**P**anayiota Touloupou (Warwick), Naif Alzahrani, Peter Neal, Simon Spencer (Warwick) and Trevelyan McKinley arXived a paper yesterday on Model comparison with missing data using MCMC and importance sampling, where they proposed an importance sampling strategy based on an early MCMC run to approximate the marginal likelihood a.k.a. the evidence. Another instance of estimating a constant. It is thus similar to our Frontier paper with Jean-Michel, as well as to the recent Pima Indian survey of James and Nicolas. The authors give the difficulty to calibrate reversible jump MCMC as the starting point to their research. The importance sampler they use is the natural choice of a Gaussian or *t* distribution centred at some estimate of θ and with covariance matrix associated with Fisher’s information. Or derived from the warmup MCMC run. The comparison between the different approximations to the evidence are done first over longitudinal epidemiological models. Involving 11 parameters in the example processed therein. The competitors to the 9 versions of importance samplers investigated in the paper are the raw harmonic mean [rather than our HPD truncated version], Chib’s, path sampling and RJMCMC [which does not make much sense when comparing two models]. But neither bridge sampling, nor nested sampling. Without any surprise (!) harmonic means do not converge to the right value, but more surprisingly Chib’s method happens to be less accurate than most importance solutions studied therein. It may be due to the fact that Chib’s approximation requires three MCMC runs and hence is quite costly. The fact that the mixture (or defensive) importance sampling [with 5% weight on the prior] did best begs for a comparison with bridge sampling, no? The difficulty with such study is obviously that the results only apply in the setting of the simulation, hence that e.g. another mixture importance sampler or Chib’s solution would behave differently in another model. In particular, it is hard to judge of the impact of the dimensions of the parameter and of the missing data.

## Archive for Chib’s approximation

## approximating evidence with missing data

Posted in Books, pictures, Statistics, University life with tags Bayes factor, Bayesian Choice, Bayesian model comparison, bridge sampling, Chib's approximation, defensive mixture, harmonic mean, importance sampling, MCMC algorithms, mixture, Monte Carlo Statistical Methods, nested sampling, Pima Indians, reversible jump MCMC, simulation, University of Warwick on December 23, 2015 by xi'an## Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags ABC, Akimel O'odham, Bayes factor, benchmark, Chib's approximation, CPU, diabetes, EP-ABC, expectation-propagation, Gibbs sampling, Jim Berger, logistic regression, MCMC algorithms, Monte Carlo Statistical Methods, Newton-Raphson algorithm, Pima Indians, probit model, R on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

**N**icolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their *gold standard*. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

## Robert’s paradox [reading in Reading]

Posted in Statistics, Travel, University life with tags BayesComp, biking, Chib's approximation, doubly intractable problems, exchange algorithm, exchangeability, John Deely, label switching, mixture estimation, Purdue University, trains, University of Oxford, University of Reading on January 28, 2015 by xi'an**O**n Wednesday afternoon, Richard Everitt and Dennis Prangle organised an RSS workshop in Reading on Bayesian Computation. And invited me to give a talk there, along with John Hemmings, Christophe Andrieu, Marcelo Pereyra, and themselves. Given the proximity between Oxford and Reading, this felt like a neighbourly visit, especially when I realised I could take my bike on the train! John Hemmings gave a presentation on synthetic models for climate change and their evaluation, which could have some connection with Tony O’Hagan’s recent talk in Warwick, Dennis told us about “the lazier ABC” version in connection with his “lazy ABC” paper, [from my very personal view] Marcelo expanded on the Moreau-Yoshida expansion he had presented in Bristol about six months ago, with the notion that using a Gaussian tail regularisation of a super-Gaussian target in a Langevin algorithm could produce better convergence guarantees than the competition, including Hamiltonian Monte Carlo, Luke Kelly spoke about an extension of phylogenetic trees using a notion of lateral transfer, and Richard introduced a notion of biased approximation to Metropolis-Hasting acceptance ratios, notion that I found quite attractive if not completely formalised, as there should be a Monte Carlo equivalent to the improvement brought by biased Bayes estimators over unbiased classical counterparts. (Repeating a remark by Persi Diaconis made more than 20 years ago.) Christophe Andrieu also exposed some recent developments of his on exact approximations à la Andrieu and Roberts (2009).

Since those developments are not yet finalised into an archived document, I will not delve into the details, but I found the results quite impressive and worth exploring, so I am looking forward to the incoming publication. One aspect of the talk which I can comment on is related to the exchange algorithm of Murray et al. (2006). Let me recall that this algorithm handles double intractable problems (i.e., likelihoods with intractable normalising constants like the Ising model), by introducing auxiliary variables with the same distribution as the data given the new value of the parameter and computing an augmented acceptance ratio which expectation is the targeted acceptance ratio and which conveniently removes the unknown normalising constants. This auxiliary scheme produces a random acceptance ratio and hence differs from the exact-approximation MCMC approach, which target directly the intractable likelihood. It somewhat replaces the unknown constant with the density taken at a plausible realisation, hence providing a proper scale. At least for the new value. I wonder if a comparison has been conducted between both versions, the naïve intuition being that the ratio of estimates should be more variable than the estimate of the ratio. More generally, it seemed to me [during the introductory part of Christophe’s talk] that those different methods always faced a harmonic mean danger when being phrased as expectations of ratios, since those ratios were not necessarily squared integrable. And not necessarily bounded. Hence my rather gratuitous suggestion of using other tools than the expectation, like maybe a median, thus circling back to the biased estimators of Richard. (And later cycling back, unscathed, to Reading station!)

On top of the six talks in the afternoon, there was a small poster session during the tea break, where I met Garth Holloway, working in agricultural economics, who happened to be a (unsuspected) fan of mine!, to the point of entitling his poster “Robert’s paradox”!!! The problem covered by this undeserved denomination connected to the bias in Chib’s approximation of the evidence in mixture estimation, a phenomenon that I related to the exchangeability of the component parameters in an earlier paper or set of slides. So “my” paradox is essentially label (un)switching and its consequences. For which I cannot claim any fame! Still, I am looking forward the completed version of this poster to discuss Garth’s solution, but we had a beer together after the talks, drinking to the health of our mutual friend John Deely.

## importance sampling schemes for evidence approximation [revised]

Posted in Statistics, University life with tags Andrew Gelman, candidate approximation, Chib's approximation, evidence, finite mixtures, label switching, permutations, Rao-Blackwellisation on November 18, 2014 by xi'an**A**fter a rather intense period of new simulations and versions, Juong Een (Kate) Lee and I have now resubmitted our paper on (some) importance sampling schemes for evidence approximation in mixture models to Bayesian Analysis. There is no fundamental change in the new version but rather a more detailed description of what those importance schemes mean in practice. The original idea in the paper is to improve upon the Rao-Blackwellisation solution proposed by Berkoff et al. (2002) and later by Marin et al. (2005) to avoid the impact of label switching on Chib’s formula. The Rao-Blackwellisation consists in averaging over all permutations of the labels while the improvement relies on the elimination of useless permutations, namely those that produce a negligible conditional density in Chib’s (candidate’s) formula. While the improvement implies truncated the overall sum and hence induces a potential bias (which was the concern of one referee), the determination of the irrelevant permutations after relabelling next to a single mode does not appear to cause any bias, while reducing the computational overload. Referees also made us aware of many recent proposals that conduct to different evidence approximations, albeit not directly related with our purpose. (One was Rodrigues and Walker, 2014, discussed and commented in a recent post.)

## Importance sampling schemes for evidence approximation in mixture models

Posted in R, Statistics, University life with tags arXiv, Chib's approximation, evidence, label switching, marginal likelihood, mixture estimation, Monte Carlo Statistical Methods, path sampling, permutations, subsampling on November 27, 2013 by xi'an**J**eong Eun (Kate) Lee and I completed this paper, “Importance sampling schemes for evidence approximation in mixture models“, now posted on arXiv. *(With the customary one-day lag for posting, making me bemoan the days of yore when arXiv would give a definitive arXiv number at the time of submission.)* Kate came twice to Paris in the past years to work with me on this evaluation of Chib’s original marginal likelihood estimate (also called the candidate formula by Julian Besag). And on the improvement proposed by Berkhof, van Mechelen, and Gelman (2003), based on averaging over all permutations, idea that we rediscovered in an earlier paper with Jean-Michel Marin. *(And that Andrew seemed to have completely forgotten. Despite being the very first one to publish* [in English]* a paper on a Gibbs sampler for mixtures.)* Given that this averaging can get quite costly, we propose a preliminary step to reduce the number of relevant permutations to be considered in the averaging, removing far-away modes that do not contribute to the Rao-Blackwell estimate and called dual importance sampling. We also considered modelling the posterior as a product of k-component mixtures on the components, following a vague idea I had in the back of my mind for many years, but it did not help. In the above boxplot comparison of estimators, the marginal likelihood estimators are

- Chib’s method using T = 5000 samples with a permutation correction by multiplying by k!.
- Chib’s method (1), using T = 5000 samples which are randomly permuted.
- Importance sampling estimate (7), using the maximum likelihood estimate (MLE) of the latents as centre.
- Dual importance sampling using q in (8).
- Dual importance sampling using an approximate in (14).
- Bridge sampling (3). Here, label switching is imposed in hyperparameters.

## On the use of marginal posteriors in marginal likelihood estimation via importance-sampling

Posted in R, Statistics, University life with tags Bayes factor, Chib's approximation, evidence, harmonic mean estimator, label switching, latent variable, marginal likelihood, MCMC, mixtures, Monte Carlo Statistical Methods, nested sampling, Poisson regression, Rao-Blackwellisation, simulation on November 20, 2013 by xi'an**P**errakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.

**A**fter a standard regression example (that did not include Chib’s solution in the comparison), the paper considers 2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis. (There is no comparison with other methods.)

**A**s a side note, among the many references provided by this paper, I did not find trace of Skilling’s nested sampling or of safe harmonic means (as exposed in our own survey on the topic).