Archive for shrinkage estimation

Bayesian propaganda?

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , on April 20, 2015 by xi'an

“The question is about frequentist approach. Bayesian is admissable [sic] only by wrong definition as it starts with the assumption that the prior is the correct pre-information. James-Stein beats OLS without assumptions. If there is an admissable [sic] frequentist estimator then it will correspond to a true objective prior.”

I had a wee bit of a (minor, very minor!) communication problem on X validated, about a question on the existence of admissible estimators of the linear regression coefficient in multiple dimensions, under squared error loss. When I first replied that all Bayes estimators with finite risk were de facto admissible, I got the above reply, which clearly misses the point, and as I had edited the OP question to include more tags, the edited version was reverted with a comment about Bayesian propaganda! This is rather funny, if not hilarious, as (a) Bayes estimators are indeed admissible in the classical or frequentist sense—I actually fail to see a definition of admissibility in the Bayesian sense—and (b) the complete class theorems of Wald, Stein, and others (like Jack Kiefer, Larry Brown, and Jim Berger) come from the frequentist quest for best estimator(s). To make my point clearer, I also reproduced in my answer the Stein’s necessary and sufficient condition for admissibility from my book but it did not help, as the theorem was “too complex for [the OP] to understand”, which shows in fine the point of reading textbooks!

Cancún, ISBA 2014 [day #1]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on July 18, 2014 by xi'an

sunrise in Cancún, July 15, 2014The first full day of talks at ISBA 2014, Cancún, was full of goodies, from the three early talks on specifically developed software, including one by Daniel Lee on STAN that completed the one given by Bob Carpenter a few weeks ago in Paris (which gives me the opportunity to advertise STAN tee-shirts!). To the poster session (which just started a wee bit late for my conference sleep pattern!). Sylvia Richardson gave an impressive lecture full of information on Bayesian genomics. I also enjoyed very much two sessions with young Bayesian statisticians, one on Bayesian econometrics and the other one more diverse and sponsored by ISBA. Overall, and this also applies to the programme of the following days, I found that the proportion of non-parametric talks was quite high this year, possibly signalling a switch in the community and the interest of Bayesians. And conversely very few talks on computing related issues. (With most scheduled after my early departure…)

In the first of those sessions, Brendan Kline talked about partially identified parameters, a topic quite close to my interests, although I did not buy the overall modelling adopted in the analysis. For instance, Brendan Kline presented the example of a parameter θ that is the expectation of a random variable Y which is indirectly observed through x <Y< x̅ . While he maintained that inference should be restricted to an interval around θ and that using a prior on θ was doomed to fail (and against econometrics culture), I would have prefered to see this example as a missing data one, with both x and x̅ containing information about θ. And somewhat object to the argument against the prior as it would equally apply to any prior modelling. Although unrelated in the themes, Angela Bitto presented a work on the impact of different prior modellings on the estimation of time-varying parameters in time-series models. À la Harrison and West 1994 Discriminating between good and poor shrinkage in a way I could not spot. Unless it was based on the data fit (horror!). And a third talk of interest by Andriy Norets that (very loosely) related to Angela’s talk by presenting a framework to modify credible sets towards frequentist properties: one example was the credible interval on a positive normal mean that led to a frequency-valid confidence interval with a modified prior. This reminded me very much of the shrinkage confidence intervals of the James-Stein era.

new MCMC algorithm for Bayesian variable selection

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on February 25, 2014 by xi'an

Flight from Bristol to Amsterdam, April 03, 2011Unfortunately, I will miss the incoming Bayes in Paris seminar next Thursday (27th February), as I will be flying to Montréal and then Québec at the time (despite having omitted to book a flight till now!). Indeed Amandine Shreck will give a talk at 2pm in room 18 of ENSAE, Malakoff, on A shrinkage-thresholding Metropolis adjusted Langevin algorithm for Bayesian variable selection, a work written jointly with Gersende Fort, Sylvain Le Corff, and Eric Moulines, and arXived at the end of 2013 (which may explain why I missed it!). Here is the abstract:

This paper introduces a new Markov Chain Monte Carlo method to perform Bayesian variable selection in high dimensional settings. The algorithm is a Hastings-Metropolis sampler with a proposal mechanism which combines (i) a Metropolis adjusted Langevin step to propose local moves associated with the differentiable part of the target density with (ii) a shrinkage-thresholding step based on the non-differentiable part of the target density which provides sparse solutions such that small components are shrunk toward zero. This allows to sample from distributions on spaces with different dimensions by actually setting some components to zero. The performances of this new procedure are illustrated with both simulated and real data sets. The geometric ergodicity of this new transdimensional Markov Chain Monte Carlo sampler is also established.

(I will definitely get a look at the paper over the coming days!)

reading classics (#3)

Posted in Statistics, University life with tags , , , , , , , , , , , , on November 15, 2012 by xi'an

Following in the reading classics series, my Master students in the Reading Classics Seminar course, listened today to Kaniav Kamary analysis of Denis Lindley’s and Adrian Smith’s 1972 linear Bayes paper Bayes Estimates for the Linear Model in JRSS Series B. Here are her (Beamer) slides

At a first (mathematical) level this is an easier paper in the list, because it relies on linear algebra and normal conditioning. Of course, this is not the reason why Bayes Estimates for the Linear Model is in the list and how it impacted the field. It is indeed one of the first expositions on hierarchical Bayes programming, with some bits of empirical Bayes shortcuts when computation got a wee in the way. (Remember, this is 1972, when shrinkage estimation and its empirical Bayes motivations is in full blast…and—despite Hstings’ 1970 Biometrika paper—MCMC is yet to be imagined, except maybe by Julian Besag!) So, at secondary and tertiary levels, it is again hard to discuss, esp. with Kaniav’s low fluency in English. For instance, a major concept in the paper is exchangeability, not such a surprise given Adrian Smith’s translation of de Finetti into English. But this is a hard concept if only looking at the algebra within the paper, as a motivation for exchangeability and partial exchangeability (and hierarchical models) comes from applied fields like animal breeding (as in Sørensen and Gianola’s book). Otherwise, piling normal priors on top of normal priors is lost on the students. An objection from a 2012 reader is also that the assumption of exchangeability on the parameters of a regression model does not really make sense when the regressors are not normalised (this is linked to yesterday’s nefarious post!): I much prefer the presentation we make of the linear model in Chapter 3 of our Bayesian Core. Based on Arnold Zellner‘s g-prior. An interesting question from one student was whether or not this paper still had any relevance, other than historical. I was a bit at a loss on how to answer as, again, at a first level, the algebra was somehow natural and, at a statistical level, less informative priors could be used. However, the idea of grouping parameters together in partial exchangeability clusters remained quite appealing and bound to provide gains in precision….

reading classics (#2)

Posted in Statistics, University life with tags , , , , , , , , , , , on November 8, 2012 by xi'an

Following last week read of Hartigan and Wong’s 1979 K-Means Clustering Algorithm, my Master students in the Reading Classics Seminar course, listened today to Agnė Ulčinaitė covering Rob Tibshirani‘s original LASSO paper Regression shrinkage and selection via the lasso in JRSS Series B. Here are her (Beamer) slides

Again not the easiest paper in the list, again mostly algorithmic and requiring some background on how it impacted the field. Even though Agnė also went through the Elements of Statistical Learning by Hastie, Friedman and Tibshirani, it was hard to get away from the paper to analyse more widely the importance of the paper, the connection with the Bayesian (linear) literature of the 70’s, its algorithmic and inferential aspects, like the computational cost, and the recent extensions like Bayesian LASSO. Or the issue of handling n<p models. Remember that one of the S in LASSO stands for shrinkage: it was quite pleasant to hear again about ridge estimators and Stein’s unbiased estimator of the risk, as those were themes of my Ph.D. thesis… (I hope the students do not get discouraged by the complexity of those papers: there were fewer questions and fewer students this time. Next week, the compass will move to the Bayesian pole with a talk on Lindley and Smith’s 1973 linear Bayes paper by one of my PhD students.)

A Tribute to Charles Stein

Posted in Statistics, University life with tags , , , , , , on March 28, 2012 by xi'an

Statistical Science just ran a special issue (Feb. 2012) as a tribute to Charles Stein that focused on shrinkage estimation. Shrinkage and the Stein effect have been my entries to the Bayesian (wonderful) world, so I read through this series of papers edited by Ed George and Bill Strawderman with fond remembrance. The more because most of the authors are good friends! Jim Berger, Bill Jefferys, and Peter Müller consider shrinkage estimation for wavelet coefficients and applies it to Cepheid variable stars. The paper by Ann Brandwein and Bill Strawderman is a survey of shrinkage estimation and the Stein effect for spherically elliptical distributions, precisely my PhD thesis topic and main result! Larry Brown and Linda Shao give a geometric interpretation of the original Stein (1956) paper. Tony Cai discusses the concepts of minimaxity and shrinkage estimators in functional spaces. George Casella and Juinn Gene Hwang recall the impact of shrinkage estimation on confidence sets. Dominique Fourdrinier and Marty Wells give an expository development of loss estimation using shrinkage estimators. Ed George, Feng Liang and Xinyi Xu recall how shrinkage estimation was recently extended to prediction using Kullback-Leibler losses. Carl Morris and Martin Lysy detail the reversed shrinkage defect and Model-II minimaxity in the normal case. Gauri Datta and Malay Ghosh explain how shrinkage estimators are paramount in small area estimation, providing a synthesis between both the Bayesian and the frequentist points of view. At last, Michael Perlman and Sanjay Chaudhuri reflect on the reversed shrinkage effect, providing us with several pages of Star Trek dialogues on this issue, and more seriously voicing a valid Bayesian reservation!

Comparison of the Bayesian and frequentist approaches

Posted in Books, Statistics, University life with tags , , , , , , , on September 1, 2010 by xi'an

I came upon this new book at the Springer booth at JSM 2010. Because its purpose [as stated on the backcover] seemed intriguing enough (“This monograph contributes to the area of comparative statistical inference. Attention is restricted to the important subfield of statistical estimation. (…) The necessary background on Decision Theory and the frequentist and Bayesian approaches to estimation is presented and carefully discussed in Chapters 1–3. The “threshold problem” – identifying the boundary between Bayes estimators which tend to outperform standard frequentist estimators and Bayes estimators which don’t – is formulated in an analytically tractable way in Chapter 4. The formulation includes a specific (decision-theory based) criterion for comparing estimators.“), I bought it and read it during the past month spent travelling through California.

Robert’s (2001) book, The Bayesian Choice, has similarities to the present work in that the author seeks to determine whether one should be a Bayesian or a frequentist. The main difference between our books is that I come to a different conclusion!A comparison of the Bayesian and frequentist approaches to estimation, F. Samaniego.

This quote from the preface is admittedly the final reason that made me buy the book by F. Samaniego! When going through the chapters of A comparison of the Bayesian and frequentist approaches to estimation, I found them pleasant to read, written in a congenial (if sometimes repetitive) style, and some places were indeed reminiscent of The Bayesian Choice. However, my overall impression is that this monograph is too inconclusive to attract a large flock of readers and that the two central notions around which the book revolves, namely the threshold between “good and bad priors”, and the self-consistency, are rather weakly supported, at least when seen from my Bayesian perspective.

“Where this [generalised Bayes] approach runs afoul of the laws of coherent Bayesian inference is in its failure to use probability assessments in the qualification of uncertainty”. A comparison of the Bayesian and frequentist approaches to estimation, F. Samaniego.

The book is set within a restrictive setup, which is the Lehmann-Scheffé point estimation framework where there exists one “best” unbiased estimator. Of course, in most estimation problems, there is no unbiased estimator (see Lehmann and Casellla’s Theory of point estimation, for instance). The presentation of the Bayesian principles tends to exclude improper priors as being incoherent (see the above quote) and it calls estimators associated with improper priors generalised Bayes estimators, while I take the alternative stance of calling generalised Bayes estimators those associated with an infinite Bayes risk. (The main appeal of the Bayesian approach, namely to provide all at once a complete inferential machine covering testing as well as estimation aspects, is not covered in the Bayesian chapter.)

“Which method stands to give the “better answers” in real problems of real interest?” A comparison of the Bayesian and frequentist approaches to estimation, F. Samaniego.

The central topic of the book is the comparison of frequentist and Bayesian procedures. Since under a given prior G, the optimal procedure is the Bayesian procedure associated with G and with the loss function, Samaniego introduces a “true prior” G0 to run the comparison between frequentist and Bayesian procedures. The following chapters then revolve around the same type of conclusion: if the prior is close enough to the “true prior” G0 then the Bayesian procedure does better than the frequentist one. Because the conditions for improvement depends on an unknown “truth”, the results are mathematically correct but operationally unappealing: when is one’s prior close enough to the truth? Stating that the threshold separates between “good and bad priors” does not have a strong content, besides the obvious. (From a Bayesian perspective, using the “wrong” prior has been studied for a while in the 1990’s, under the category of Bayesian robustness.)

Whatever the merits of an objective Bayesian analysis might be, one should recognize that the approach is patently non-BayesianA comparison of the Bayesian and frequentist approaches to estimation, F. Samaniego.

The restricted perspective on the Bayesian paradigm is also reflected by the insistence in using conjugate priors and linear estimators. The notion of self-consistency in Chapter 6 does not make sense outside this setting: a prior \pi on \theta = \mathbb{E}[X] is self-consistent if, when x=\mathbb{E}_\pi[\theta],

\mathbb{E}_\pi[\theta|x]= \mathbb{E}_\pi[\theta].

In other words, if the prior expectation and the observation coincide, the posterior expectation should be the same. This may sound “reasonable” but it only applies to a specific parameterisation of the problem, i.e. is not invariant under reparameterisation of either x or \theta. It is also essentially restricted to natural conjugate priors, e.g. it does not apply to mixtures of conjugate priors… I also find the relevance of conjugate priors diminished by the following next chapter on shrinkage estimation, since the truly Bayesian shrinkage estimators correspond to hierarchical priors, not to conjugate priors.

The potential (indeed, typical) lack of consistency of the Bayes estimators of a nonidentifiable parameter need not be considered to be a fatal flaw.A comparison of the Bayesian and frequentist approaches to estimation, F. Samaniego.

Chapter 9 offers a new perspective on nonidentifiability, but this is highly controversial in that Samaniego’s  perspective is to look at the performances of the Bayesian estimates of the nonidentifiable part! While I think the appeal of using a Bayesian approach in non-identifiable settings is instead to be able to infer on the identifiable parts, integrating out the nonidentifiable part thanks to the prior. The chapters 10 and 11 about combining experiments in a vaguely empirical Bayes fashion are more interesting but the proposed solutions sound rather ad hoc. A modern Bayesian analysis would resort to a non-parametric modelling to gather information from past/other experiments.

But “steadfast” Bayesians and “steadfast” frequentists should also find ample food for thought in these pagesA comparison of the Bayesian and frequentist approaches to estimation, F. Samaniego.

In conclusion, this book recapitulates the works of F. Samaniego and of his co-authors on the frequentist-Bayesian “fusion” into a coherent monograph. I however fear that this treatise cannot contribute to a large extent to the philosophical debate about the relevance of using Bayesian procedures to increase frequentist efficiency or to rely on frequentist estimates when the prior information is shaky. It could appeal to “old timers” from the decision-theoretic creed, but undergraduate and graduate students may find the topic far too narrow and the book too inconclusive to register for a corresponding course. I again agree that decision theory is a nice and reasonable entry into Bayesian analysis,  and one that thoroughly got me convinced of following the Bayesian path!, but the final appeal (and hence my Choice)) stems from the universality of the posterior distribution, which covers all aspects of inference.