After a rather intense period of new simulations and versions, Juong Een (Kate) Lee and I have now resubmitted our paper on (some) importance sampling schemes for evidence approximation in mixture models to Bayesian Analysis. There is no fundamental change in the new version but rather a more detailed description of what those importance schemes mean in practice. The original idea in the paper is to improve upon the Rao-Blackwellisation solution proposed by Berkoff et al. (2002) and later by Marin et al. (2005) to avoid the impact of label switching on Chib’s formula. The Rao-Blackwellisation consists in averaging over all permutations of the labels while the improvement relies on the elimination of useless permutations, namely those that produce a negligible conditional density in Chib’s (candidate’s) formula. While the improvement implies truncated the overall sum and hence induces a potential bias (which was the concern of one referee), the determination of the irrelevant permutations after relabelling next to a single mode does not appear to cause any bias, while reducing the computational overload. Referees also made us aware of many recent proposals that conduct to different evidence approximations, albeit not directly related with our purpose. (One was Rodrigues and Walker, 2014, discussed and commented in a recent post.)
Archive for Chib’s approximation
Jeong Eun (Kate) Lee and I completed this paper, “Importance sampling schemes for evidence approximation in mixture models“, now posted on arXiv. (With the customary one-day lag for posting, making me bemoan the days of yore when arXiv would give a definitive arXiv number at the time of submission.) Kate came twice to Paris in the past years to work with me on this evaluation of Chib’s original marginal likelihood estimate (also called the candidate formula by Julian Besag). And on the improvement proposed by Berkhof, van Mechelen, and Gelman (2003), based on averaging over all permutations, idea that we rediscovered in an earlier paper with Jean-Michel Marin. (And that Andrew seemed to have completely forgotten. Despite being the very first one to publish [in English] a paper on a Gibbs sampler for mixtures.) Given that this averaging can get quite costly, we propose a preliminary step to reduce the number of relevant permutations to be considered in the averaging, removing far-away modes that do not contribute to the Rao-Blackwell estimate and called dual importance sampling. We also considered modelling the posterior as a product of k-component mixtures on the components, following a vague idea I had in the back of my mind for many years, but it did not help. In the above boxplot comparison of estimators, the marginal likelihood estimators are
- Chib’s method using T = 5000 samples with a permutation correction by multiplying by k!.
- Chib’s method (1), using T = 5000 samples which are randomly permuted.
- Importance sampling estimate (7), using the maximum likelihood estimate (MLE) of the latents as centre.
- Dual importance sampling using q in (8).
- Dual importance sampling using an approximate in (14).
- Bridge sampling (3). Here, label switching is imposed in hyperparameters.
Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.
After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers 2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis. (There is no comparison with other methods.)
On the last/my day of the ISBA meeting in Varanasi, I attended a few talks before being kindly driven to the airport (early, too early, but with the unpredictable traffic there, it was better to err on the cautionary side!). In the dynamical model session, Simon Wilson presented a way to approximate posteriors for HMMs based on Chib’s (or Bayes’!) formula, while Jonathan Stroud exposed another approach to state-space model approximation involving a move of the state parameter based on a normal approximation of its conditional given the observable, approximation which seemed acceptable for the cloud analysis model he was processing. Nicolas Chopin then gave a quick introduction to particle MCMC, all the way to SMC². (As a stern chairmain of the session, I know Nicolas felt he did not have enough time but he did a really good job of motivating those different methods, in particular in explaining why the auxiliary variable approach makes the unbiased estimator of the likelihood a valid MCMC method.) Peter Green’s plenary talk was about a emission tomography image analysis whose statistical processing turned into a complex (Bernstein-von Mises) convergence theorem (whose preliminary version I saw in Bristol during Natalia Bochkina’s talk).
Overall, as forewarned by and expected from the program, this ISBA meeting was of the highest scientific quality. (I only wish I had had hindi god abilities to duplicate and attend several parallel sessions at the same time!) Besides, much besides!, the wamr attention paid to everyone by the organisers was just simply un-be-lie-vable! The cultural program went in par with the scientific program. The numerous graduate students and faculty involved in the workshop organisation had a minute knowledge of our schedules and locations, and were constantly anticipating our needs and moves. Almost to a fault, i.e. to a point that was close to embarassing for our cultural habits. I am therefore immensely grateful [personally and as former ISBA president] to all those people that contributed to the success of this ISBA meeting and first and foremost to Professor Satyanshu Upadhyay who worked relentlessly towards this goal during many months! (As a conference organiser, I realise I was and am simply unable to provide this level of welcome to the participants, even for much smaller meetings… The contrast with my previous conference in Berlin could not be more extreme as, for a much higher registration fee, the return was very, very limited.) I will forever (at least until my next reincarnation!) keep the memory of this meeting as a very special one, quite besides giving me the opportunity of my first visit to India…
Another arXiv posting I had had no time to comment is Nial Friel’s and Jason Wyse’s “Estimating the model evidence: a review“. This is a review in the spirit of two of our papers, “Importance sampling methods for Bayesian discrimination between embedded models” with Jean-Michel Marin (published in Jim Berger Feitschrift, Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor of James O. Berger, but not mentioned in the review) and “Computational methods for Bayesian model choice” with Darren Wraith (referred to by the review). Indeed, it considers a series of competing computational methods for approximating evidence, aka marginal likelihood:
- Laplace approximation
- harmonic mean estimator
- Chib’s method
- annealed importance sampling (à la Neal, 2001)
- nested sampling
- power posteriors (which is actually a form of path sampling, à la Friel and Pettitt)
The paper correctly points out the difficulty with the naïve harmonic mean estimator. (But it does not cover the extension to the finite variance solutions found in”Importance sampling methods for Bayesian discrimination between embedded models” and in “Computational methods for Bayesian model choice“.) It also misses the whole collection of bridge and umbrella sampling techniques covered in, e.g., Chen, Shao and Ibrahim, 2000 . In their numerical evaluations of the methods, the authors use the Pima Indian diabetes dataset we also used in “Importance sampling methods for Bayesian discrimination between embedded models“. The outcome is that the Laplace approximation does extremely well in this case (due to the fact that the posterior is very close to normal), Chib’s method being a very near second. The harmonic mean estimator does extremely poorly (not a suprise!) and the nested sampling approximation is not as accurate as the other (non-harmonic) methods. If we compare with our 2009 study, importance sampling based on the normal approximation (almost the truth!) did best, followed by our harmonic mean solution based on the same normal approximation. (Chib’s solution was then third, with a standard deviation ten times larger.)
An ‘Og reader. Emmanuel Charpentier, sent me the following email about model choice:
I read with great interest your critique of Peter Congdon’s 2006 paper (CSDA, 50(2):346-357) proposing a method of estimation of posterior model probabilities based on improper distributions for parameters not present in the model inder examination, as well as a more general critique in your recent review of M. Aitkin’s recent book.
However, Peter Congdon’s 2007 proposal (Statistical Methodology. 4(2):143-157.) of another method for model weighting seems to have flown under your radar ; more generally, while the 2006 proposal seems to have been somewhat quoted and used in at least one biological application and two financial applications, ihis 2007 proposal seems to have been largely ignored (as far as a naïve Google Scholar’s user can tell) ; I found no allusion to this technique neither in your blog nor on Andrew Gelman’s blog.
This proposal, which uses a full probability model with proper priors and pseudo-priors, seems, however, to answer your critiques, and offers a number of technical advantages over other proposal :
- it can be computed from separate MCMC samples, with no regard to the MCMC sapling technique used to obtain them, therefore allowing the use of the « canned expertise » existing in WinBUGS, OpenBUGS or JAGS (which entails the impossibility of controlling the exact sampling methods used to solve a given problem) ;
- it avoids the needs of very long runs to sufficiently explore unlikely models (which is the curse of Carlin & Chib (1995) method) ;
- it seems relatively easy to compute in most situations.
I’d be quite interested by any writings, thoughts or reactions to this proposal.
As I had indeed missed this paper, I went and took a look at it.
Last week, I received a box of books from the International Statistical Review, for reviewing them. I thus grabbed the one whose title was most appealing to me, namely Bayesian Model Selection and Statistical Modeling by Tomohiro Ando. I am indeed interested in both the nature of testing hypotheses or more accurately of assessing models, as discussed in both my talk at the Seminar of philosophy of mathematics at Université Paris Diderot a few days ago and the post on Murray Aitkin’s alternative, and the computational aspects of the resulting Bayesian procedures, including evidence, the Savage-Dickey paradox, nested sampling, harmonic mean estimators, and more…
After reading through the book, I am alas rather disappointed. What I consider to be innovative or at least “novel” parts with comparison with existing books (like Chen, Shao and Ibrahim, 2000, which remains a reference on this topic) is based on papers written by the author over the past five years and it is mostly a sort of asymptotic Bayes analysis that I do not see as particularly Bayesian, because involving the “true” distribution of the data. The coverage of the existing literature on Bayesian model choice is often incomplete and sometimes misses the point, as discussed below. This is especially true for the computational aspects that are generally mistreated or at least not treated in a way from which a newcomer to the field would benefit. The author often takes complex econometric examples for illustration, which is nice; however, he does not pursue the details far enough for the reader to be able to replicate the study without further reading. (An example is given by the coverage of stochastic volatility in Section 4.5.1, pages 83-84.) The few exercises at the end of each chapter are rather unhelpful, often sounding rather like notes than true problems (an extreme case is Exercise 6 pages 196-197 which introduces the Metropolis-Hastings algorithm within the exercise (although it has already been defined on pages 66-67) and then asks to derive the marginal likelihood estimator. Another such exercise on page 164-165 introduces the theory of DNA microarrays and gene expression in ten lines (which are later repeated verbatim on page 227), then asks to identify marker genes responsible for a certain trait.) The overall feeling after reading this book is thus that the contribution to the field of Bayesian Model Selection and Statistical Modeling is too limited and disorganised for the book to be recommended as “helping you choose the right Bayesian model” (backcover).