Archive for population Monte Carlo

Initializing adaptive importance sampling with Markov chains

Posted in Statistics with tags , , , , , , , , , , , on May 6, 2013 by xi'an

Another paper recently arXived by Beaujean and Caldwell elaborated on our population Monte Carlo papers (Cappé et al., 2005, Douc et al., 2007, Wraith et al., 2010) to design a more thorough starting distribution. Interestingly, the authors mention the fact that PMC is an EM-type algorithm to emphasize the importance of the starting distribution, as with “poor proposal, PMC fails as proposal updates lead to a consecutively poorer approximation of the target” (p.2). I had not thought of this possible feature of PMC, which indeed proceeds along integrated EM steps, and thus could converge to a local optimum (if not poorer than the start as the Kullback-Leibler divergence decreases).

The solution proposed in this paper is similar to the one we developed in our AMIS paper. An important part of the simulation is dedicated to the construction of the starting distribution, which is a mixture deduced from multiple Metropolis-Hastings runs. I find the method spends an unnecessary long time on refining this mixture by culling the number of components: down-the-shelf clustering techniques should be sufficient, esp. if one considers that the value of the target is available at every simulated point. This has been my pet (if idle) theory for a long while: we do not take (enough) advantage of this informative feature in our simulation methods… I also find the Student’s t versus Gaussian kernel debate (p.6) somehow superfluous: as we shown in Douc et al., 2007, we can process Student’s t distributions so we can as well work with those. And rather worry about the homogeneity assumption this choice implies: working with any elliptically symmetric kernel assumes a local Euclidean structure on the parameter space, for all components, and does not model properly highly curved spaces. Another pet theory of mine’s. As for picking the necessary number of simulations at each PMC iteration, I would add to the ESS and the survival rate of the components a measure of the Kullback-Leibler divergence, as it should decrease at each iteration (with an infinite number of particles).

Another interesting feature is in the comparison with Multinest, the current version of nested sampling, developed by Farhan Feroz. This is the second time I read a paper involving nested sampling in the past two days. While this PMC implementation does better than nested sampling on the examples processed in the paper, the Multinest outcome remains relevant, particularly because it handles multi-modality fairly well. The authors seem to think parallelisation is an issue with nested sampling, while I do see why: at the most naïve stage, several nested samplers can be run in parallel and the outcomes pulled together.

ABC-MCMC for parallel tempering

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 9, 2012 by xi'an

In this paper a new algorithm combining population-based MCMC methods with ABC requirements is proposed, using an analogy with the Parallel Tempering algorithm (Geyer, 1991).

Another of those arXiv papers that had sat on my to-read pile for way too long: Likelihood-free parallel tempering by Meïli Baragtti, Agnès Grimaud, and Denys Pommeret, from Luminy, Marseilles. The paper mentions our population Monte Carlo (PMC) algorithm (Beaumont et al., 2009) and other ABC-SMC algorithms, but opts instead for an ABC-MCMC basis. The purpose is to build a parallel tempering method. Tolerances and temperatures evolve simultaneously. I however fail to see where the tempering occurs in the algorithm (page 7): there is a set of temperatures T1,….,TN, but they do not appear within the algorithm. My first idea of a tempering mechanism in a likelihood-free setting was to replicate our SAME algorithm (Doucet, Godsill, and Robert, 2004), by creating Tj copies of the [pseudo-]observations to mimic the likelihood taken to the power Tj. But this is annealing, not tempering, and I cannot think of the opposite of copies of the data. Unless of course a power of the likelihood can be simulated (and even then, what would the equivalent be for the data…?) Maybe a natural solution would be to operate some kind of data-attrition, e.g. by subsampling the original vector of observations.

Discussing the issue with Jean-Michel Marin, during a visit to Montpellier today, I realised that the true tempering came from the tolerances εi, while the temperatures Tj were there to calibrate the proposal distributions. And that the major innovation contained in the thesis (if not so clearly in the paper) was to boost exchanges between different tolerances, improving upon the regular ABC-MCMC sampler by an equi-energy move.

multiple try/point Metropolis algorithm

Posted in Statistics, Travel with tags , , , , , on January 23, 2012 by xi'an

Among the arXiv documents I printed at the turn of the year in order to get a better look at them (in the métro if nowhere else!), there were two papers by Luca Martino and co-authors from Universidad Carlos III, Madrid, A multi-point Metropolis scheme with generic weight functions and Different acceptance functions for multiple try Metropolis schemes. The multiple-try algorithm sounds like another version of the delayed sampling algorithm of Tierney and Mira (1999) and Green and Mira (2001). I somehow missed it, even though it was introduced in Liu et al. (2000) and Quin and Liu (2001). Multiple-try Metropolis builds upon the idea that, instead of making one proposal at a time, it is feasible to build a sequence of proposals and to pick one among those, presumably rather likely and hence more open to being accepted. The sequence of proposals may depend upon the past propositions as well as on the current value, lending some degree of adaptability to the scheme. In the current implementation, the algorithm remains rather clumsy [in my opinion] in that (a) a fixed horizon N need be fixed and (b) an additional series of backward simulations need be produced simply to keep the balance equation happy… Hence a total number of simulations O(N) for one possible acceptance. The first note slightly extends Quin and Liu (2001) by using a fairly general weighting scheme. The second paper studies some particular choices for the weights in a much less adaptive scheme (where parallelisation would be an appropriate alternative, since each proposal in the multiple try only depends on the current value of the chain). But it does not demonstrate a more efficient behaviour than when using a cycle or a mixture of Metropolis-Hastings algorithms. The method seems to regain popularity, though, as Roberto Casarin, Radu Craiu and Fabrizio Leisen (also from Carlos III)  arXived a paper on a multiple-try algorithm, connected with population Monte Carlo, and more recently published it in Statistics and Computing.

WSC [2]011

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , on December 15, 2011 by xi'an

Last day at WSC 2011: as it was again raining, I could not run a second time into the South Mountain Preserve park. (I had a swim at 5am instead and ended up having a nice chat with an old man in the pool under the rain!) My first morning session was rather disappointing with two talks that remained at such a high level of generality as to be useless and a mathematical talk about new forms of stochastic approximation that included proofs and no indication on the calibration of its many parameters. During the coffee break, I tried to have a chat with a vendor of a simulation software but we were using so different vocabularies that I soon gave up. (A lot of the software on display was a-statistical in that users would build a network, specify all parameters, incl. the distributions at the different nodes and start calibrating those parameters towards a behaviour that suited them.) The second session was much more in my area of interest/expertise, with Paul Dupuis giving a talk in the same spirit as the one he gave in New York last September. using large deviations and importance sampling on diffusions. Both following talks were about specially designed importance sampling techniques for rare events and about approximating the zero variance optimal importance function: Yixi Shin gave a talk on cross-entropy based selection of mixtures for the simulation of tail events, connecting somehow with the talk on mixtures of importance sampling distributions I attended yesterday. Although I am afraid I dozed a while during the talk, it was an interesting mix with the determination of the weights by cross-entropy arguments reminded me of what we did for the population Monte Carlo approach (since it also involved some adaptive entropy optimisation). Zdravko Botev gave a talk on approximating the ideal zero variance importance function by MCMC and a sort of Rao-Blackwell estimator that gives an unbiased estimator of this density under stationarity. Then it was time to leave for the airport (and wait in a Starbucks for the plane to Minneapolis and then London to depart, as there is no such thing as a lounge in Phoenix airport…). I had an interesting exchange with a professional magician in the first plane, The Amazing Hondo!, who knew about Persi and was a former math teacher. He explained a few tricks to me, plus showed me his indeed amazing sleight of hands in manipulating cards. In exchange, I took Persi’s book on Magic and Mathematics out of my bag so that he could have look at it. (The trip to London was completely uneventful as I slept most of the way.)

Overall, WSC 2011 was an interesting experience in that (a) the talks I attended on zero variance importance simulation set me thinking again on potential applications of the apparently useless optimality result; (b) it showed me that most people using simulation do not, N.O.T., relate to Monte Carlo techniques (to the extent of being completely foreign to my domains of expertise); and (c) among the parallel sessions that cover military applications, health care simulation, &tc., there always is a theme connecting to mines, which means that I will find sessions to attend when taking part in WSC 2012 in Berlin next year (since I have been invited for a talk). This will be the first time WSC is held outside North America. Hopefully, this will attract simulation addicts from Europe as well as elsewhere.

On optimality of kernels for ABC-SMC

Posted in Statistics, University life with tags , , , , , , , , , , on December 11, 2011 by xi'an

This freshly arXived paper by Sarah Filippi, Chris Barnes, Julien Cornebise, and Michael Stumpf, is in the lineage of our 2009 Biometrika ABC-PMC (population Monte Carlo) paper with Marc Beaumont, Jean-Marie Cornuet and Jean-Michel Marin. (I actually missed the first posting while in Berlin last summer. Flying to Utah gave me the opportunity to read it at length!)  The  paper focusses on the impact of the transition kernel in our PMC scheme: while we used component-wise adaptive proposals, the paper studies multivariate adaptivity with a covariance matrix adapted from the whole population, or locally or from an approximation to the information matrix. The simulation study run in the paper shows that, even when accounting for the additional cost due to the derivation of the matrix, the multivariate adaptation can improve the acceptance rate by a fair amount. So this is an interesting and positive sequel to our paper (that I may well end up refereeing one of those days, like an earlier paper from some of the authors!)

The main criticism I may have about the paper is that the selection of the tolerance sequence is not done in an adaptive way, while it could, given the recent developments of Del Moral et al. and of Drovandri and Pettitt (as well as our even more recent still-un-arXived submission to Stat & Computing!). While the target is the same for all transition kernels, thus the comparison still makes sense as is, the final product is to build a complete adaptive scheme that comes as close as possible to the genuine posterior.

This paper also raised a new question: there is a slight distinction between the Kullback-Leibler divergence we used and the Kullback-Leibler divergence the authors use here. (In fact, we do not account for the change in the tolerance.) Now, since what only matters is the distribution of the current particles, and while the distribution on the past particles is needed to compute the double integral leading to the divergence, there is a complete freedom in the choice of this past distribution. As in Del Moral et al., the distribution L(θ:t-1t) could therefore be chosen towards an optimal acceptance rate or something akin. I wonder if anyone ever looked at this…


Follow

Get every new post delivered to your Inbox.

Join 338 other followers