**H**ere are the slides of the presentation I gave at the EPSRC Advanced Computational methods for complex models in Biology at University College London, last week. Introducing random forests as proper summaries for both model choice and parameter estimation (with considerable overlap with earlier slides, obviously!). The other talks of that highly interesting day on computational Biology were mostly about ancestral graphs, using Wright-Fisher diffusions for coalescents, plus a comparison of expectation-propagation and ABC on a genealogy model by Mark Beaumont and the decision theoretic approach to HMM order estimation by Chris Holmes. In addition, it gave me the opportunity to come back to the Department of Statistics at UCL more than twenty years after my previous visit, at a time when my friend Costas Goutis was still there. And to realise it had moved from its historical premises years ago. (I wonder what happened to the two staircases built to reduce frictions between Fisher and Pearson if I remember correctly…)

## Archive for expectation-propagation

## advanced computational methods for complex models in Biology [talk]

Posted in Books, pictures, Statistics, Travel, University life with tags ABC, Bayesian computing, Biology, coalescent, computational biology, England, EPSRC, expectation-propagation, London, random forests, UCL, University College London, Wright-Fisher model on September 29, 2016 by xi'an## approximate Bayesian inference

Posted in Books, pictures, Statistics, Travel, University life with tags ABC in London, alive particle filter, Bayesian Analysis, Dirichlet prior, empirical likelihood, expectation-propagation, integer time-series, pMCMC, pseudo-marginal MCMC on March 23, 2016 by xi'an**M**aybe it is just a coincidence, but both most recent issues of Bayesian Analysis have an article featuring approximate Bayesian inference. One is by Daniel Add Contact Form Graham and co-authors on Approximate Bayesian Inference for Doubly Robust Estimation, while the other one is by Chris Drovandi and co-authors from QUT on Exact and Approximate Bayesian Inference for Low Integer-Valued Time Series Models with Intractable Likelihoods. The first paper has little connection with ABC. Even though it (a) uses a lot of three letter acronyms [which does not help with speed reading] and (b) relies on moment based and propensity score models. Instead, it relies on Bayesian bootstrap, which suddenly seems to me to be rather connected with empirical likelihood! Except the weights are estimated via a Dirichlet prior instead of being optimised. The approximation lies in using the bootstrap to derive a posterior predictive. I did not spot any assessment or control of the approximation effect in the paper.

“Note that we are always using the full data so avoiding the need to choose a summary statistic” (p.326)

The second paper connects pMCMC with ABC. Plus pseudo-marginals on the side! And even simplified reversible jump MCMC!!! I am far from certain I got every point of the paper, though, especially the notion of dimension reduction associated with *this* version of reversible jump MCMC. It may mean that latent variables are integrated out in approximate (marginalised) likelihoods [as explicated in Andrieu and Roberts (2009)].

“The difference with the common ABC approach is that we match on observations one-at-a-time” (p.328)

The model that the authors study is an integer value time-series, like the INAR(p) model. Which integer support allows for a non-zero probability of exact matching between simulated and observed data. One-at-a-time as indicated in the above quote. And integer valued tolerances like ε=1 otherwise. In the case auxiliary variables are necessary, the authors resort to the alive particle filter of Jasra et al. (2013), which main point is to produce an unbiased estimate of the (possibly approximate) likelihood, to be exploited by pseudo-marginal techniques. However, unbiasedness sounds less compelling when moving to approximate methods, as illustrated by the subsequent suggestion to use a more stable estimate of the log-likelihood. In fact, when the tolerance ε is positive, the pMCMC acceptance probability looks quite close to an ABC-MCMC probability when relying on several pseudo-data simulations. Which is unbiased for the “right” approximate target. A fact that may actually holds for *all* ABC algorithms. One quite interesting aspect of the paper is its reflection about the advantage of pseudo-marginal techniques for RJMCMC algorithms since they allow for trans-dimension moves to be simplified, as they consider marginals on the space of interest. Up to this day, I had not realised Andrieu and Roberts (2009) had a section on this aspect… I am still unclear about the derivation of the posterior probabilities of the models under comparison, unless it is a byproduct of the RJMCMC algorithm. A last point is that, for some of the Markov models used in the paper, the pseudo observations can be produced as a random one-time move away from the current true observation, which makes life much easier for ABC and explain why exact simulations can sometimes be produced. (*A side note:* the authors mention on p.326 that EP is only applicable when the posterior is from an exponential family, while my understanding is that it uses an exponential family to approximate the true posterior.)

## at CIRM [#3]

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags ABC, ABC-SMC, Bayesian statistics, CIRM, component of a mixture, cross validated, expectation-propagation, high dimensions, identifiability, Luminy, Marseille, MCMC, Mont Puget, Monte Carlo Statistical Methods, particle filter, particle Gibbs sampler, summer school on March 4, 2016 by xi'an**S**imon Barthelmé gave his mini-course on EP, with loads of details on the implementation of the method. Focussing on the EP-ABC and MCMC-EP versions today. Leaving open the difficulty of assessing to which limit EP is converging. But mentioning the potential for asynchronous EP (on which I would like to hear more). Ironically using several times a logistic regression example, if not on the Pima Indians benchmark! He also talked about approximate EP solutions that relate to consensus MCMC. With a connection to Mark Beaumont’s talk at NIPS [at the time as mine!] on the comparison with ABC. While we saw several talks on EP during this week, I am still agnostic about the potential of the approach. It certainly produces a fast proxy to the true posterior and hence can be exploited *ad nauseam* in inference methods based on pseudo-models like indirect inference. In conjunction with other quick and dirty approximations when available. As in ABC, it would be most useful to know how far from the (ideal) posterior distribution does the approximation stands. Machine learning approaches presumably allow for an evaluation of the predictive performances, but less so for the modelling accuracy, even with new sampling steps. [But I know nothing, I know!]

Dennis Prangle presented some on-going research on high dimension [data] ABC. Raising the question of what is the true meaning of dimension in ABC algorithms. Or of sample size. Because the inference relies on the event d(s(y),s(y’))≤ξ or on the likelihood l(θ|x). Both one-dimensional. Mentioning Iain Murray’s talk at NIPS [that I also missed]. Re-expressing as well the perspective that ABC can be seen as a missing or estimated normalising constant problem as in Bornn et al. (2015) I discussed earlier. The central idea is to use SMC to simulate a particle cloud evolving as the target tolerance ξ decreases. Which supposes a latent variable structure lurking in the background.

Judith Rousseau gave her talk on non-parametric mixtures and the possibility to learn parametrically about the component weights. Starting with a rather “magic” result by Allman et al. (2009) that three repeated observations per individual, all terms in a mixture are identifiable. Maybe related to that simpler fact that mixtures of Bernoullis are not identifiable while mixtures of Binomial are identifiable, even when n=2. As “shown” in this plot made for X validated. Actually truly related because Allman et al. (2009) prove identifiability through a finite dimensional model. (I am surprised I missed this most interesting paper!) With the side condition that a mixture of p components made of r Bernoulli products is identifiable when p ≥ 2[log² r] +1, when log² is base 2-logarithm. And [x] the upper rounding. I also find most relevant this distinction between the weights and the remainder of the mixture as weights behave quite differently, hardly parameters in a sense.

## expectation-propagation from Les Houches

Posted in Books, Mountains, pictures, Statistics, University life with tags belief propagation, book review, Chamonix, CHANCE, expectation-propagation, French Alps, Lasso, Les Houches, Oxford University Press, sparsity, summer school on February 3, 2016 by xi'an**A**s CHANCE book editor, I received the other day from Oxford University Press acts from an École de Physique des Houches on Statistical Physics, Optimisation, Inference, and Message-Passing Algorithms that took place there in September 30 – October 11, 2013. While it is mostly unrelated with Statistics, and since Igor Caron already reviewed the book a year and more ago, I skimmed through the few chapters connected to my interest, from Devavrat Shah’s chapter on graphical models and belief propagation, to Andrea Montanari‘s denoising and sparse regression, including LASSO, and only read in some detail Manfred Opper’s expectation propagation chapter. This paper made me realise (or re-realise as I had presumably forgotten an earlier explanation!) that expectation propagation can be seen as a sort of variational approximation that produces by a sequence of iterations the distribution within a certain parametric (exponential) family that is the closest to the distribution of interest. By writing the Kullback-Leibler divergence the opposite way from the usual variational approximation, the solution equates the expectation of the natural sufficient statistic under both models… Another interesting aspect of this chapter is the connection with estimating normalising constants. (I noticed a slight typo on p.269 in the final form of the Kullback approximation q() to p().

## Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags ABC, Bayes factor, benchmark, Chib's approximation, CPU, diabetes, EP-ABC, expectation-propagation, Gibbs sampling, Jim Berger, logistic regression, MCMC algorithms, Monte Carlo Statistical Methods, Newton-Raphson algorithm, Pima Indians, probit model, R on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

**N**icolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their *gold standard*. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.