Archive for empirical Bayes methods

Posted in Statistics, University life with tags , , , , , , , , , , , , on March 25, 2013 by xi'an

Here are the slides of my talk in Padova for the workshop Recent Advances in statistical inference: theory and case studies (very similar to the slides for the Varanasi and Gainesville meetings, obviously!, with Peter Müller commenting [at last!] that I had picked the wrong photos from Khajuraho!)

The worthy Padova addendum is that I had two discussants, Stefano Cabras from Universidad Carlos III in Madrid, whose slides are :

and Francesco Pauli, from Trieste, whose slides are:

These were kind and rich discussions with many interesting openings: Stefano’s idea of estimating the pivotal function h is opening new directions, obviously, as it indicates an additional degree of freedom in calibrating the method. Esp. when considering the high variability of the empirical likelihood fit depending on the the function h. For instance, one could start with a large collection of candidate functions and build a regression or a principal component reparameterisation from this collection… (Actually I did not get point #1 about ignoring f: the empirical likelihood is by essence ignoring anything outside the identifying equation, so as long as the equation is valid..) Point #2: Opposing sample free and simulation free techniques is another interesting venue, although I would not say ABC is “sample free”. As to point #3, I will certainly get a look at Monahan and Boos (1992) to see if this can drive the choice of a specific type of pseudo-likelihoods. I like the idea of checking the “coverage of posterior sets” and even more “the likelihood must be the density of a statistic, not necessarily sufficient” as it obviously relates with our current ABC model comparison work… Esp. when the very same paper is mentioned by Francesco as well. Grazie, Stefano! I also appreciate the survey made by Francesco on the consistency conditions, because I think this is an important issue that should be taken into consideration when designing ABC algorithms. (Just pointing out again that, in the theorem of Fearnhead and Prangle (2012) quoting Bernardo and Smith (1992), some conditions are missing for the mathematical consistency to apply.) I also like the agreement we seem to reach about ABC being evaluated per se rather than an a poor man’s Bayesian method. Francesco’s analysis of Monahan and Boos (1992) as validating or not empirical likelihood points out a possible link with the recent coverage analysis of Prangle et al., discussed on the ‘Og a few weeks ago. And an unsuspected link with Larry Wasserman! Grazie, Francesco!

Winter workshop, Gainesville (day 2)

Posted in pictures, Running, Travel, University life, Wines with tags , , , , , , , , , , , , , on January 21, 2013 by xi'an

On day #2, besides my talk on “empirical Bayes” (ABCel) computation (mostly recycled from Varanasi, photos included), Christophe Andrieu gave a talk on exact approximations, using unbiased estimators of the likelihood and characterising estimators garanteeing geometric convergence (bounded weights, essentially, which is a condition popping out again and again in the Monte Carlo literature). Then Art Owen (father of empirical likelihood among other things!) spoke about QMC for MCMC, a topic that always intringued me.

Indeed, while I see the point of using QMC for specific integration problems, I am more uncertain about its relevance for statistics as a simulation device. Having points uniformly distributed over the unit hypercube in a much more efficient way than a random sample is not helping much when only a tiny region of the unit hypercube, namely the one where the likelihood concentrates, matters. (In other words, we are rarely interested in the uniform distribution over the unit hypercube: we instead want to simulate from a highly irregular and definitely concentrated distribution.) I have the same reservation about the applicability of stratified sampling: the strata have to be constructed in relation with the target distribution. The method Art advocates using a CUD (completely uniformly distributed) sequence as the underlying (deterministic) pseudo-unifom sequence. Highly interesting and I want to read the paper in greater details, but the fact that most simulation steps use a random number of uniforms seems detrimental to the performances of the method in general.

After a lunch break at a terrific BBQ place, with a stop at Lake Alice to watch the alligator(s) I had missed during my morning run, I was able this time to attend till the end Xiao-Li Meng’s talk, where he presented new improvements on bridge sampling based on location-scale (or warping) transforms of the original two-samples to make them share mean and variance. Hani Doss concluded the meeting with a talk on the computation of Bayes factors when using (non-parametric) Dirichlet mixture priors, whose resolution does not require simulations for each value of the scale parameter of the Dirichlet prior, thanks to a Radon-Nykodim derivative representation. (Which nicely connected with Art’s talk in that the latter mentioned therein that most simulation methods are actually based on Riemann integration rather than Lebesgue integration. Hani’s representation is not, with nested sampling being another example.)

We ended up the day with a(nother) barbecue outside, under the stars, in the peace and quiet of a local wood, with wine and laughs, just like George would have concluded the workshop. This was a fitting ending to a meeting dedicated to his memory…

empirical Bayes (CHANCE)

Posted in Books, Statistics, University life with tags , , , , , , on April 23, 2012 by xi'an

As I decided to add a vignette on empirical Bayes methods to my review of Brad Efron’s Large-scale Inference in the next issue of CHANCE [25(3)], here it is.

Empirical Bayes methods can crudely be seen as the poor man’s Bayesian analysis. They start from a Bayesian modelling, for instance the parameterised prior

$x\sim f(x|\theta)\,,\quad \theta\sim\pi(\theta|\alpha)$

and then, instead of setting α to a specific value or of assigning an hyperprior to this hyperparameter α, as in a regular or a hierarchical Bayes approach, the empirical Bayes paradigm consists in estimating α from the data. Hence the “empirical” label. The reference model used for the estimation is the integrated likelihood (or conditional marginal)

$m(x|\alpha) = \int f(x|\theta) \pi(\theta|\alpha)\,\text{d}\theta$

which defines a distribution density indexed by α and thus allows for the use of any statistical estimation method (moments, maximum likelihood or even Bayesian!). A classical example is provided by the normal exchangeable sample: if

$x_i\sim \mathcal{N}(\theta_i,\sigma^2)\qquad \theta_i\sim \mathcal{N}(\mu,\tau^2)\quad i=1,\ldots,p$

then, marginally,

$x_i \sim \mathcal{N}(\mu,\tau^2+\sigma^2)$

and μ can be estimated by the empirical average of the observations. The next step in an empirical Bayes analysis is to act as if α had not been estimated from the data and to conduct a regular Bayesian processing of the data with this estimated prior distribution. In the above normal example, this means estimating the θi‘s by

$\dfrac{\sigma^2 \bar{x} + \tau^2 x_i}{\sigma^2+\tau^2}$

with the characteristic shrinkage (to the average) property of the resulting estimator (Efron and Morris, 1973).

…empirical Bayes isn’t Bayes.” B. Efron (p.90)

While using Bayesian tools, this technique is outside of the Bayesian paradigm for several reasons: (a) the prior depends on the data, hence it lacks foundational justifications; (b) the prior varies with the data, hence it lacks theoretical validations like Walk’s complete class theorem; (c) the prior uses the data once, hence the posterior uses the data twice (see the vignette about this sin in the previous issue); (d) the prior relies of an estimator, whose variability is not accounted for in the subsequent analysis (Morris, 1983). The original motivation for the approach (Robbins, 1955) was more non-parametric, however it gained popularity in the 70′s and 80′s both in conjunction with the Stein effect and as a practical mean of bypassing complex Bayesian computations. As illustrated by Efron’s book, it recently met with renewed interest in connection with multiple testing.

Large-scale Inference

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , on February 24, 2012 by xi'an

Large-scale Inference by Brad Efron is the first IMS Monograph in this new series, coordinated by David Cox and published by Cambridge University Press. Since I read this book immediately after Cox’ and Donnelly’s Principles of Applied Statistics, I was thinking of drawing a parallel between the two books. However, while none of them can be classified as textbooks [even though Efron's has exercises], they differ very much in their intended audience and their purpose. As I wrote in the review of Principles of Applied Statistics, the book has an encompassing scope with the goal of covering all the methodological steps  required by a statistical study. In Large-scale Inference, Efron focus on empirical Bayes methodology for large-scale inference, by which he mostly means multiple testing (rather than, say, data mining). As a result, the book is centred on mathematical statistics and is more technical. (Which does not mean it less of an exciting read!) The book was recently reviewed by Jordi Prats for Significance. Akin to the previous reviewer, and unsurprisingly, I found the book nicely written, with a wealth of R (colour!) graphs (the R programs and dataset are available on Brad Efron’s home page).

I have perhaps abused the “mono” in monograph by featuring methods from my own work of the past decade.” (p.xi)

Sadly, I cannot remember if I read my first Efron’s paper via his 1977 introduction to the Stein phenomenon with Carl Morris in Pour la Science (the French translation of Scientific American) or through his 1983 Pour la Science paper with Persi Diaconis on computer intensive methods. (I would bet on the later though.) In any case, I certainly read a lot of the Efron’s papers on the Stein phenomenon during my thesis and it was thus with great pleasure that I saw he introduced empirical Bayes notions through the Stein phenomenon (Chapter 1). It actually took me a while but I eventually (by page 90) realised that empirical Bayes was a proper subtitle to Large-Scale Inference in that the large samples were giving some weight to the validation of empirical Bayes analyses. In the sense of reducing the importance of a genuine Bayesian modelling (even though I do not see why this genuine Bayesian modelling could not be implemented in the cases covered in the book).

Large N isn’t infinity and empirical Bayes isn’t Bayes.” (p.90)

The core of Large-scale Inference is multiple testing and the empirical Bayes justification/construction of Fdr’s (false discovery rates). Efron wrote more than a dozen papers on this topic, covered in the book and building on the groundbreaking and highly cited Series B 1995 paper by Benjamini and Hochberg. (In retrospect, it should have been a Read Paper and so was made a “retrospective read paper” by the Research Section of the RSS.) Frd are essentially posterior probabilities and therefore open to empirical Bayes approximations when priors are not selected. Before reaching the concept of Fdr’s in Chapter 4, Efron goes over earlier procedures for removing multiple testing biases. As shown by a section title (“Is FDR Control “Hypothesis Testing”?”, p.58), one major point in the book is that an Fdr is more of an estimation procedure than a significance-testing object. (This is not a surprise from a Bayesian perspective since the posterior probability is an estimate as well.)

Scientific applications of single-test theory most often suppose, or hope for rejection of the null hypothesis (…) Large-scale studies are usually carried out with the expectation that most of the N cases will accept the null hypothesis.” (p.89)

On the innovations proposed by Efron and described in Large-scale Inference, I particularly enjoyed the notions of local Fdrs in Chapter 5 (essentially pluggin posterior probabilities that a given observation stems from the null component of the mixture) and of the (Bayesian) improvement brought by empirical null estimation in Chapter 6 (“not something one estimates in classical hypothesis testing”, p.97) and the explanation for the inaccuracy of the bootstrap (which “stems from a simpler cause”, p.139), but found less crystal-clear the empirical evaluation of the accuracy of Fdr estimates (Chapter 7, ‘independence is only a dream”, p.113), maybe in relation with my early career inability to explain Morris’s (1983) correction for empirical Bayes confidence intervals (pp. 12-13). I also discovered the notion of enrichment in Chapter 9, with permutation tests resembling some low-key bootstrap, and multiclass models in Chapter 10, which appear as if they could benefit from a hierarchical Bayes perspective. The last chapter happily concludes with one of my preferred stories, namely the missing species problem (on which I hope to work this very Spring).

improper priors, incorporated

Posted in Books, Statistics, University life with tags , , , , , , , , on January 11, 2012 by xi'an

If a statistical procedure is to be judged by a criterion such as a conventional loss function (…) we should not expect optimal results from a probabilistic theory that demands multiple observations and multiple parameters.” P. McCullagh & H. Han

Peter McCullagh and Han Han have just published in the Annals of Statistics a paper on Bayes’ theorem for improper mixtures. This is a fascinating piece of work, even though some parts do elude me… The authors indeed propose a framework based on Kingman’s Poisson point processes that allow to include (countable) improper priors in a coherent probabilistic framework. This framework requires the definition of a test set A in the sampling space, the observations being then the events Y∩ A, Y being an infinite random set when the prior is infinite. It is therefore complicated to perceive this representation in a genuine Bayesian framework, i.e. for a single observation, corresponding to a single parameter value. In that sense it seems closer to the original empirical Bayes, à la Robbins.

An improper mixture is designed for a generic class of problems, not necessarily related to one another scientifically, but all having the same mathematical structure.” P. McCullagh & H. Han

The paper thus misses in my opinion a clear link with the design of improper priors. And it does not offer a resolution of the  improper prior Bayes factor conundrum. However, it provides a perfectly valid environment for working with improper priors. For instance, the final section on the marginalisation “paradoxes” is illuminating in this respect as it does not demand  using a limit of proper priors.