## Initializing adaptive importance sampling with Markov chains

Posted in Statistics with tags , , , , , , , , , , , on May 6, 2013 by xi'an

Another paper recently arXived by Beaujean and Caldwell elaborated on our population Monte Carlo papers (Cappé et al., 2005, Douc et al., 2007, Wraith et al., 2010) to design a more thorough starting distribution. Interestingly, the authors mention the fact that PMC is an EM-type algorithm to emphasize the importance of the starting distribution, as with “poor proposal, PMC fails as proposal updates lead to a consecutively poorer approximation of the target” (p.2). I had not thought of this possible feature of PMC, which indeed proceeds along integrated EM steps, and thus could converge to a local optimum (if not poorer than the start as the Kullback-Leibler divergence decreases).

The solution proposed in this paper is similar to the one we developed in our AMIS paper. An important part of the simulation is dedicated to the construction of the starting distribution, which is a mixture deduced from multiple Metropolis-Hastings runs. I find the method spends an unnecessary long time on refining this mixture by culling the number of components: down-the-shelf clustering techniques should be sufficient, esp. if one considers that the value of the target is available at every simulated point. This has been my pet (if idle) theory for a long while: we do not take (enough) advantage of this informative feature in our simulation methods… I also find the Student’s t versus Gaussian kernel debate (p.6) somehow superfluous: as we shown in Douc et al., 2007, we can process Student’s t distributions so we can as well work with those. And rather worry about the homogeneity assumption this choice implies: working with any elliptically symmetric kernel assumes a local Euclidean structure on the parameter space, for all components, and does not model properly highly curved spaces. Another pet theory of mine’s. As for picking the necessary number of simulations at each PMC iteration, I would add to the ESS and the survival rate of the components a measure of the Kullback-Leibler divergence, as it should decrease at each iteration (with an infinite number of particles).

Another interesting feature is in the comparison with Multinest, the current version of nested sampling, developed by Farhan Feroz. This is the second time I read a paper involving nested sampling in the past two days. While this PMC implementation does better than nested sampling on the examples processed in the paper, the Multinest outcome remains relevant, particularly because it handles multi-modality fairly well. The authors seem to think parallelisation is an issue with nested sampling, while I do see why: at the most naïve stage, several nested samplers can be run in parallel and the outcomes pulled together.

## Harmonic means, again again

Posted in Books, R, Statistics, University life with tags , , , , , , , , on January 10, 2012 by xi'an

Another arXiv posting I had had no time to comment is Nial Friel’s and Jason Wyse’s “Estimating the model evidence: a review“. This is a review in the spirit of two of our papers, “Importance sampling methods for Bayesian discrimination between embedded models” with Jean-Michel Marin (published in Jim Berger Feitschrift, Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor of James O. Berger, but not mentioned in the review) and “Computational methods for Bayesian model choice” with Darren Wraith (referred to by the review). Indeed, it considers a series of competing computational methods for approximating evidence, aka marginal likelihood:

The paper correctly points out the difficulty with the naïve harmonic mean estimator. (But it does not cover the extension to the finite variance solutions found in”Importance sampling methods for Bayesian discrimination between embedded models” and in “Computational methods for Bayesian model choice“.)  It also misses the whole collection of bridge and umbrella sampling techniques covered in, e.g., Chen, Shao and Ibrahim, 2000 . In their numerical evaluations of the methods, the authors use the Pima Indian diabetes dataset we also used in “Importance sampling methods for Bayesian discrimination between embedded models“. The outcome is that the Laplace approximation does extremely well in this case (due to the fact that the posterior is very close to normal), Chib’s method being a very near second. The harmonic mean estimator does extremely poorly (not a suprise!) and the nested sampling approximation is not as accurate as the other (non-harmonic) methods. If we compare with our 2009 study, importance sampling based on the normal approximation (almost the truth!) did best, followed by our harmonic mean solution based on the same normal approximation. (Chib’s solution was then third, with a standard deviation ten times larger.)

## Random construction of interpolating sets

Posted in Kids, Statistics, University life with tags , , , , on January 5, 2012 by xi'an

One of the many arXiv papers I could not discuss earlier is Huber and Schott’s “Random construction of interpolating sets for high dimensional integration” which relates to their earlier TPA paper at the València meeting. (Paper that we discussed with Nicolas Chopin.) TPA stands for tootsie pop algorithm, The paper is very pleasant to read, just like its predecessor. The principle behind TPA is that the number of steps in the algorithm is Poisson with parameter connected  to  the unknown measure of the inner set:

$N\sim\mathcal{P}(\ln[\mu(B)/\mu(B^\prime)])$

Therefore, the variance of the estimation is known as well.  This is a significant property of a mathematically elegant solution. As already argued in our earlier discussion, it however seems the paper is defending an integral approximation that sounds far from realistic, in my opinion. Indeed, the TPA method requires as a fundamental item the ability to simulate from a measure μ restricted to a level set A(β). Exact simulation seems close to impossible in any realistic problem. Just as in Skilling (2006)’s nested sampling. Furthermore, the comparison with nested sampling is evacuated rather summarily: that the variance of this alternative cannot be computed “prior to running the algorithm” does not mean it is larger than the one of the TPA method. If the proposal is to become a realistic algorithm, some degree of comparison with the existing should appear in the paper. (A further if minor comment about the introduction is that the reason for picking the relative ideal balance α=0.2031 in the embedded sets is not clear. Not that it really matters in the implementation unless Section 5 on well-balanced sets is connected with this ideal ratio…)

## recent arXiv postings

Posted in Statistics, University life with tags , , , , , on October 17, 2011 by xi'an

Three interesting recent arXiv postings and not enough time to read them all and in the ‘Og bind them! (Of course, comments from readers welcome!)

Formulating a statistical inverse problem as one of inference in a Bayesian model has great appeal, notably for what this brings in terms of coherence, the interpretability of regularisation penalties, the integration of all uncertainties, and the principled way in which the set-up can be elaborated to encompass broader features of the context, such as measurement error, indirect observation, etc. The Bayesian formulation comes close to the way that most scientists intuitively regard the inferential task, and in principle allows the free use of subject knowledge in probabilistic model building. However, in some problems where the solution is not unique, for example in ill-posed inverse problems, it is important to understand the relationship between the chosen Bayesian model and the resulting solution. Taking emission tomography as a canonical example for study, we present results about consistency of the posterior distribution of the reconstruction, and a general method to study convergence of posterior distributions. To study efficiency of Bayesian inference for ill-posed linear inverse problems with constraint, we prove a version of the Bernstein-von Mises theorem for nonregular Bayesian models.

(Certainly unlikely to please the member of the audience in Zürich who questioned my Bayesian credentials for considering “true” models and consistency….)

Recently, Andrieu, Doucet and Holenstein (2010) introduced a general framework for using particle filters (PFs) to construct proposal kernels for Markov chain Monte Carlo (MCMC) methods. This framework, termed Particle Markov chain Monte Carlo (PMCMC), was shown to provide powerful methods for joint Bayesian state and parameter inference in nonlinear/non-Gaussian state-space models. However, the mixing of the resulting MCMC kernels can be quite sensitive, both to the number of particles used in the underlying PF and to the number of observations in the data. In this paper we suggest alternatives to the three PMCMC methods introduced in Andrieu et al. (2010), which are much more robust to a low number of particles as well as a large number of observations. We consider some challenging inference problems and show in a simulation study that, for problems where existing PMCMC methods require around 1000 particles, the proposed methods provide satisfactory results with as few as 5 particles.

(I have not read the paper enough in-depth to be critical, however “hard” figures like 5, or 10³, are always suspicious in that they cannot carry to the general case…)

In this paper we present an algorithm for rapid Bayesian analysis that combines the benefits of nested sampling and artificial neural networks. The blind accelerated multimodal Bayesian inference (BAMBI) algorithm implements the MultiNest package for nested sampling as well as the training of an artificial neural network (NN) to learn the likelihood function. In the case of computationally expensive likelihoods, this allows the substitution of a much more rapid approximation in order to increase significantly the speed of the analysis. We begin by demonstrating, with a few toy examples, the ability of a NN to learn complicated likelihood surfaces. BAMBI’s ability to decrease running time for Bayesian inference is then demonstrated in the context of estimating cosmological parameters from WMAP and other observations. We show that valuable speed increases are achieved in addition to obtaining NNs trained on the likelihood functions for the different model and data combinations. These NNs can then be used for an even faster follow-up analysis using the same likelihood and different priors. This is a fully general algorithm that can be applied, without any pre-processing, to other problems with computationally expensive likelihood functions.

(This is primarily an astronomy paper that uses a sample produced by the nested sampling algorithm MultiNest to build a neural network instead of the model likelihood. The algorithm thus requires the likelihood to be available at some stage.)

## Typo in Biometrika 97(3): 747

Posted in Statistics, Travel, University life with tags , , , on November 4, 2010 by xi'an

Yesterday while in Philadelphia airport I got this email from Javier Rubio:

Prof. Robert,

I am a first year PhD student in the University of Warwick. I was reading your paper “Properties of nested sampling” which I found very interesting.  I have a question about it. Is the second equation in pp. 747 correct?
I think this equation is related with the first equation in pp. 12 from your paper “Importance sampling methods for Bayesian discrimination between embedded models”.
Kind regards,
Javier.

Indeed, there is a typo in that this formula (page 747) should be

$\widehat{Z}_1=1\bigg/\left\{\dfrac{1}{T}\,\sum_{t=1}^T {g(\theta^{(t)}) }\big/\pi(\theta^{(t)})L(\theta^{(t)})\right\}$

(I checked my LaTeX code and there is a trace of a former \dfrac that got erased but was not replaced with a \big/ symbol… I am quite sorry for the typo, the more because this paper went through many revisions.) There is no typo in the corresponding chapter of Frontiers of Statistical Decision Making and Bayesian Analysis: In Honor of James O. Berger