Archive for shrinkage estimation

same risk, different estimators

Posted in Statistics with tags , , , , , on November 10, 2017 by xi'an

An interesting question on X validated reminded me of the epiphany I had some twenty years ago when reading a Annals of Statistics paper by Anirban Das Gupta and Bill Strawderman on shrinkage estimators, namely that some estimators shared the same risk function, meaning their integrated loss was the same for all values of the parameter. As indicated in this question, Stefan‘s instructor seems to believe that two estimators having the same risk function must be a.s. identical. Which is not true as exemplified by the James-Stein (1960) estimator with scale 2(p-2), which has constant risk p, just like the maximum likelihood estimator. I presume the confusion stemmed from the concept of completeness, where having a function with constant expectation under all values of the parameter implies that this function is constant. But, for loss functions, the concept does not apply since the loss depends both on the observation (that is complete in a Normal model) and on the parameter.

Charles M. Stein [1920-2016]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on November 26, 2016 by xi'an

I have just heard that Charles Stein, Professor at Stanford University, passed away last night. Although the following image is definitely over-used, I truly feel this is the departure of a giant of statistics.  He has been deeply influential on the fields of probability and mathematical statistics, primarily in decision theory and approximation techniques. On the first field, he led to considerable changes in the perception of optimality by exhibiting the Stein phenomenon, where the aggregation of several admissible estimators of unrelated quantities may (and will) become inadmissible for the joint estimation of those quantities! Although the result can be explained by mathematical and statistical reasoning, it was still dubbed a paradox due to its counter-intuitive nature. More foundationally, it led to expose the ill-posed nature of frequentist optimality criteria and certainly contributed to the Bayesian renewal of the 1980’s, before the MCMC revolution. (It definitely contributed to my own move, as I started working on the Stein phenomenon during my thesis, before realising the fundamentally Bayesian nature of the domination results.)

“…the Bayesian point of view is often accompanied by an insistence that people ought to agree to a certain doctrine even without really knowing what this doctrine is.” (Statistical Science, 1986)

The second major contribution of Charles Stein was the introduction of a new technique for normal approximation that is now called the Stein method. It relies on a differential operator and produces estimates of approximation error in Central Limit theorems, even in dependent settings. While I am much less familiar with this aspect of Charles Stein’s work, I believe the impact it has had on the field is much more profound and durable than the Stein effect in Normal mean estimation.

(During the Vietnam War, he was quite active in the anti-war movement and the above picture from 2003 shows that his opinions had not shifted over time!) A giant truly has gone.

Bayesian propaganda?

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , on April 20, 2015 by xi'an

“The question is about frequentist approach. Bayesian is admissable [sic] only by wrong definition as it starts with the assumption that the prior is the correct pre-information. James-Stein beats OLS without assumptions. If there is an admissable [sic] frequentist estimator then it will correspond to a true objective prior.”

I had a wee bit of a (minor, very minor!) communication problem on X validated, about a question on the existence of admissible estimators of the linear regression coefficient in multiple dimensions, under squared error loss. When I first replied that all Bayes estimators with finite risk were de facto admissible, I got the above reply, which clearly misses the point, and as I had edited the OP question to include more tags, the edited version was reverted with a comment about Bayesian propaganda! This is rather funny, if not hilarious, as (a) Bayes estimators are indeed admissible in the classical or frequentist sense—I actually fail to see a definition of admissibility in the Bayesian sense—and (b) the complete class theorems of Wald, Stein, and others (like Jack Kiefer, Larry Brown, and Jim Berger) come from the frequentist quest for best estimator(s). To make my point clearer, I also reproduced in my answer the Stein’s necessary and sufficient condition for admissibility from my book but it did not help, as the theorem was “too complex for [the OP] to understand”, which shows in fine the point of reading textbooks!

Cancún, ISBA 2014 [day #1]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on July 18, 2014 by xi'an

sunrise in Cancún, July 15, 2014The first full day of talks at ISBA 2014, Cancún, was full of goodies, from the three early talks on specifically developed software, including one by Daniel Lee on STAN that completed the one given by Bob Carpenter a few weeks ago in Paris (which gives me the opportunity to advertise STAN tee-shirts!). To the poster session (which just started a wee bit late for my conference sleep pattern!). Sylvia Richardson gave an impressive lecture full of information on Bayesian genomics. I also enjoyed very much two sessions with young Bayesian statisticians, one on Bayesian econometrics and the other one more diverse and sponsored by ISBA. Overall, and this also applies to the programme of the following days, I found that the proportion of non-parametric talks was quite high this year, possibly signalling a switch in the community and the interest of Bayesians. And conversely very few talks on computing related issues. (With most scheduled after my early departure…)

In the first of those sessions, Brendan Kline talked about partially identified parameters, a topic quite close to my interests, although I did not buy the overall modelling adopted in the analysis. For instance, Brendan Kline presented the example of a parameter θ that is the expectation of a random variable Y which is indirectly observed through x <Y< x̅ . While he maintained that inference should be restricted to an interval around θ and that using a prior on θ was doomed to fail (and against econometrics culture), I would have prefered to see this example as a missing data one, with both x and x̅ containing information about θ. And somewhat object to the argument against the prior as it would equally apply to any prior modelling. Although unrelated in the themes, Angela Bitto presented a work on the impact of different prior modellings on the estimation of time-varying parameters in time-series models. À la Harrison and West 1994 Discriminating between good and poor shrinkage in a way I could not spot. Unless it was based on the data fit (horror!). And a third talk of interest by Andriy Norets that (very loosely) related to Angela’s talk by presenting a framework to modify credible sets towards frequentist properties: one example was the credible interval on a positive normal mean that led to a frequency-valid confidence interval with a modified prior. This reminded me very much of the shrinkage confidence intervals of the James-Stein era.

new MCMC algorithm for Bayesian variable selection

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on February 25, 2014 by xi'an

Flight from Bristol to Amsterdam, April 03, 2011Unfortunately, I will miss the incoming Bayes in Paris seminar next Thursday (27th February), as I will be flying to Montréal and then Québec at the time (despite having omitted to book a flight till now!). Indeed Amandine Shreck will give a talk at 2pm in room 18 of ENSAE, Malakoff, on A shrinkage-thresholding Metropolis adjusted Langevin algorithm for Bayesian variable selection, a work written jointly with Gersende Fort, Sylvain Le Corff, and Eric Moulines, and arXived at the end of 2013 (which may explain why I missed it!). Here is the abstract:

This paper introduces a new Markov Chain Monte Carlo method to perform Bayesian variable selection in high dimensional settings. The algorithm is a Hastings-Metropolis sampler with a proposal mechanism which combines (i) a Metropolis adjusted Langevin step to propose local moves associated with the differentiable part of the target density with (ii) a shrinkage-thresholding step based on the non-differentiable part of the target density which provides sparse solutions such that small components are shrunk toward zero. This allows to sample from distributions on spaces with different dimensions by actually setting some components to zero. The performances of this new procedure are illustrated with both simulated and real data sets. The geometric ergodicity of this new transdimensional Markov Chain Monte Carlo sampler is also established.

(I will definitely get a look at the paper over the coming days!)

reading classics (#3)

Posted in Statistics, University life with tags , , , , , , , , , , , , on November 15, 2012 by xi'an

Following in the reading classics series, my Master students in the Reading Classics Seminar course, listened today to Kaniav Kamary analysis of Denis Lindley’s and Adrian Smith’s 1972 linear Bayes paper Bayes Estimates for the Linear Model in JRSS Series B. Here are her (Beamer) slides

At a first (mathematical) level this is an easier paper in the list, because it relies on linear algebra and normal conditioning. Of course, this is not the reason why Bayes Estimates for the Linear Model is in the list and how it impacted the field. It is indeed one of the first expositions on hierarchical Bayes programming, with some bits of empirical Bayes shortcuts when computation got a wee in the way. (Remember, this is 1972, when shrinkage estimation and its empirical Bayes motivations is in full blast…and—despite Hstings’ 1970 Biometrika paper—MCMC is yet to be imagined, except maybe by Julian Besag!) So, at secondary and tertiary levels, it is again hard to discuss, esp. with Kaniav’s low fluency in English. For instance, a major concept in the paper is exchangeability, not such a surprise given Adrian Smith’s translation of de Finetti into English. But this is a hard concept if only looking at the algebra within the paper, as a motivation for exchangeability and partial exchangeability (and hierarchical models) comes from applied fields like animal breeding (as in Sørensen and Gianola’s book). Otherwise, piling normal priors on top of normal priors is lost on the students. An objection from a 2012 reader is also that the assumption of exchangeability on the parameters of a regression model does not really make sense when the regressors are not normalised (this is linked to yesterday’s nefarious post!): I much prefer the presentation we make of the linear model in Chapter 3 of our Bayesian Core. Based on Arnold Zellner‘s g-prior. An interesting question from one student was whether or not this paper still had any relevance, other than historical. I was a bit at a loss on how to answer as, again, at a first level, the algebra was somehow natural and, at a statistical level, less informative priors could be used. However, the idea of grouping parameters together in partial exchangeability clusters remained quite appealing and bound to provide gains in precision….

reading classics (#2)

Posted in Statistics, University life with tags , , , , , , , , , , , on November 8, 2012 by xi'an

Following last week read of Hartigan and Wong’s 1979 K-Means Clustering Algorithm, my Master students in the Reading Classics Seminar course, listened today to Agnė Ulčinaitė covering Rob Tibshirani‘s original LASSO paper Regression shrinkage and selection via the lasso in JRSS Series B. Here are her (Beamer) slides

Again not the easiest paper in the list, again mostly algorithmic and requiring some background on how it impacted the field. Even though Agnė also went through the Elements of Statistical Learning by Hastie, Friedman and Tibshirani, it was hard to get away from the paper to analyse more widely the importance of the paper, the connection with the Bayesian (linear) literature of the 70’s, its algorithmic and inferential aspects, like the computational cost, and the recent extensions like Bayesian LASSO. Or the issue of handling n<p models. Remember that one of the S in LASSO stands for shrinkage: it was quite pleasant to hear again about ridge estimators and Stein’s unbiased estimator of the risk, as those were themes of my Ph.D. thesis… (I hope the students do not get discouraged by the complexity of those papers: there were fewer questions and fewer students this time. Next week, the compass will move to the Bayesian pole with a talk on Lindley and Smith’s 1973 linear Bayes paper by one of my PhD students.)