## efficient exploration of multi-modal posterior distributions

Posted in Books, Statistics, University life with tags , , , , on September 1, 2014 by xi'an

The title of this recent arXival had potential appeal, however the proposal ends up being rather straightforward and hence  anti-climactic! The paper by Hu, Hendry and Heng proposes to run a mixture of proposals centred at the various modes of  the target for an efficient exploration. This is a correct MCMC algorithm, granted!, but the requirement to know beforehand all the modes to be explored is self-defeating, since the major issue with MCMC is about modes that are  omitted from the exploration and remain undetected throughout the simulation… As provided, this is a standard MCMC algorithm with no adaptive feature and I would rather suggest our population Monte Carlo version, given the available information. Another connection with population Monte Carlo is that I think the performances would improve by Rao-Blackwellising the acceptance rate, i.e. removing the conditioning on the (ancillary) component of the index. For PMC we proved that using the mixture proposal in the ratio led to an ideally minimal variance estimate and I do not see why randomising the acceptance ratio in the current case would bring any improvement.

## PMC for combinatoric spaces

Posted in Statistics, University life with tags , , , , , , , on July 28, 2014 by xi'an

I received this interesting [edited] email from Xiannian Fan at CUNY:

I am trying to use PMC to solve Bayesian network structure learning problem (which is in a combinatorial space, not continuous space).

In PMC, the proposal distributions qi,t can be very flexible, even specific to each iteration and each instance. My problem occurs due to the combinatorial space.

For importance sampling, the requirement for proposal distribution, q, is:

support (p) ⊂ support (q)             (*)

For PMC, what is the support of the proposal distribution in iteration t? is it

support (p) ⊂ U support(qi,t)    (**)

or does (*) apply to every qi,t?

For continuous problem, this is not a big issue. We can use random walk of Normal distribution to do local move satisfying (*). But for combination search, local moving only result in finite states choice, just not satisfying (*). For example for a permutation (1,3,2,4), random swap has only choose(4,2)=6 neighbor states.

Fairly interesting question about population Monte Carlo (PMC), a sequential version of importance sampling we worked on with French colleagues in the early 2000’s.  (The name population Monte Carlo comes from Iba, 2000.)  While MCMC samplers do not have to cover the whole support of p at each iteration, it is much harder for importance samplers as their core justification is to provide an unbiased estimator to for all integrals of interest. Thus, when using the PMC estimate,

1/n ∑i,t {p(xi,t)/qi,t(xi,t)}h(qi,t),  xi,t~qi,t(x)

this estimator is only unbiased when the supports of the qi,t “s are all containing the support of p. The only other cases I can think of are

1. associating the qi,t “s with a partition Si,t of the support of p and using instead

i,t {p(xi,t)/qi,t(xi,t)}h(qi,t), xi,t~qi,t(x)

2. resorting to AMIS under the assumption (**) and using instead

1/n ∑i,t {p(xi,t)/∑j,t qj,t(xi,t)}h(qi,t), xi,t~qi,t(x)

but I am open to further suggestions!

## ABC for design

Posted in Statistics with tags , , , , , , , on August 30, 2013 by xi'an

I wrote a comment on this arXived paper on simulation based design that starts from Müller (1999) and gets an ABC perspective a while ago on my iPad when travelling to Montpellier and then forgot to download it…

Hainy, [Wener] Müller, and Wagner recently arXived a paper called “Likelihood-free Simulation-based Optimal Design“, paper which relies on ABC to construct optimal designs . Remember that [Peter] Müller (1999) uses a natural simulated annealing that is quite similar to our MAP [SAME] algorithm with Arnaud Doucet and Simon Godsill, relying on multiple versions of the data set to get to the maximum. The paper also builds upon our 2006 JASA paper with my then PhD student Billy Amzal, Eric Parent, and Frederic Bois, paper that took advantage of the then emerging particle methods to improve upon a static horizon target. While our method is sequential in that it pursues a moving target, it does not rely on the generic methodology developed by del Moral et al. (2006), where a backward kernel brings more stability to the moves. The paper also implements a version of our population Monte Carlo ABC algorithm (Beaumont et al., 2009), as a first step before an MCMC simulation. Overall, the paper sounds more like a review than like a strongly directive entry into ABC based design in that it remains quite generic. Not that I have specific suggestions, mind!, but I fear a realistic implementation (as opposed to the linear model used in the paper) would require a certain amount of calibration. There are missing references of recent papers using ABC for design, including some by Michael Stumpf I think.

I did not know about the Kuck et al. reference… Which is reproducing our 2006 approach within the del Moral framework. It uses a continuous temperature scale that I find artificial and not that useful, again a maybe superficial comment as I didn’t get very much into the paper … Just that integer powers lead to multiples of the sample and have a nice algorithmic counterpart.

## Initializing adaptive importance sampling with Markov chains

Posted in Statistics with tags , , , , , , , , , , , on May 6, 2013 by xi'an

Another paper recently arXived by Beaujean and Caldwell elaborated on our population Monte Carlo papers (Cappé et al., 2005, Douc et al., 2007, Wraith et al., 2010) to design a more thorough starting distribution. Interestingly, the authors mention the fact that PMC is an EM-type algorithm to emphasize the importance of the starting distribution, as with “poor proposal, PMC fails as proposal updates lead to a consecutively poorer approximation of the target” (p.2). I had not thought of this possible feature of PMC, which indeed proceeds along integrated EM steps, and thus could converge to a local optimum (if not poorer than the start as the Kullback-Leibler divergence decreases).

The solution proposed in this paper is similar to the one we developed in our AMIS paper. An important part of the simulation is dedicated to the construction of the starting distribution, which is a mixture deduced from multiple Metropolis-Hastings runs. I find the method spends an unnecessary long time on refining this mixture by culling the number of components: down-the-shelf clustering techniques should be sufficient, esp. if one considers that the value of the target is available at every simulated point. This has been my pet (if idle) theory for a long while: we do not take (enough) advantage of this informative feature in our simulation methods… I also find the Student’s t versus Gaussian kernel debate (p.6) somehow superfluous: as we shown in Douc et al., 2007, we can process Student’s t distributions so we can as well work with those. And rather worry about the homogeneity assumption this choice implies: working with any elliptically symmetric kernel assumes a local Euclidean structure on the parameter space, for all components, and does not model properly highly curved spaces. Another pet theory of mine’s. As for picking the necessary number of simulations at each PMC iteration, I would add to the ESS and the survival rate of the components a measure of the Kullback-Leibler divergence, as it should decrease at each iteration (with an infinite number of particles).

Another interesting feature is in the comparison with Multinest, the current version of nested sampling, developed by Farhan Feroz. This is the second time I read a paper involving nested sampling in the past two days. While this PMC implementation does better than nested sampling on the examples processed in the paper, the Multinest outcome remains relevant, particularly because it handles multi-modality fairly well. The authors seem to think parallelisation is an issue with nested sampling, while I do see why: at the most naïve stage, several nested samplers can be run in parallel and the outcomes pulled together.

## ABC-MCMC for parallel tempering

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 9, 2012 by xi'an

In this paper a new algorithm combining population-based MCMC methods with ABC requirements is proposed, using an analogy with the Parallel Tempering algorithm (Geyer, 1991).

Another of those arXiv papers that had sat on my to-read pile for way too long: Likelihood-free parallel tempering by Meïli Baragtti, Agnès Grimaud, and Denys Pommeret, from Luminy, Marseilles. The paper mentions our population Monte Carlo (PMC) algorithm (Beaumont et al., 2009) and other ABC-SMC algorithms, but opts instead for an ABC-MCMC basis. The purpose is to build a parallel tempering method. Tolerances and temperatures evolve simultaneously. I however fail to see where the tempering occurs in the algorithm (page 7): there is a set of temperatures T1,….,TN, but they do not appear within the algorithm. My first idea of a tempering mechanism in a likelihood-free setting was to replicate our SAME algorithm (Doucet, Godsill, and Robert, 2004), by creating Tj copies of the [pseudo-]observations to mimic the likelihood taken to the power Tj. But this is annealing, not tempering, and I cannot think of the opposite of copies of the data. Unless of course a power of the likelihood can be simulated (and even then, what would the equivalent be for the data…?) Maybe a natural solution would be to operate some kind of data-attrition, e.g. by subsampling the original vector of observations.

Discussing the issue with Jean-Michel Marin, during a visit to Montpellier today, I realised that the true tempering came from the tolerances εi, while the temperatures Tj were there to calibrate the proposal distributions. And that the major innovation contained in the thesis (if not so clearly in the paper) was to boost exchanges between different tolerances, improving upon the regular ABC-MCMC sampler by an equi-energy move.

## multiple try/point Metropolis algorithm

Posted in Statistics, Travel with tags , , , , , on January 23, 2012 by xi'an

Among the arXiv documents I printed at the turn of the year in order to get a better look at them (in the métro if nowhere else!), there were two papers by Luca Martino and co-authors from Universidad Carlos III, Madrid, A multi-point Metropolis scheme with generic weight functions and Different acceptance functions for multiple try Metropolis schemes. The multiple-try algorithm sounds like another version of the delayed sampling algorithm of Tierney and Mira (1999) and Green and Mira (2001). I somehow missed it, even though it was introduced in Liu et al. (2000) and Quin and Liu (2001). Multiple-try Metropolis builds upon the idea that, instead of making one proposal at a time, it is feasible to build a sequence of proposals and to pick one among those, presumably rather likely and hence more open to being accepted. The sequence of proposals may depend upon the past propositions as well as on the current value, lending some degree of adaptability to the scheme. In the current implementation, the algorithm remains rather clumsy [in my opinion] in that (a) a fixed horizon N need be fixed and (b) an additional series of backward simulations need be produced simply to keep the balance equation happy… Hence a total number of simulations O(N) for one possible acceptance. The first note slightly extends Quin and Liu (2001) by using a fairly general weighting scheme. The second paper studies some particular choices for the weights in a much less adaptive scheme (where parallelisation would be an appropriate alternative, since each proposal in the multiple try only depends on the current value of the chain). But it does not demonstrate a more efficient behaviour than when using a cycle or a mixture of Metropolis-Hastings algorithms. The method seems to regain popularity, though, as Roberto Casarin, Radu Craiu and Fabrizio Leisen (also from Carlos III)  arXived a paper on a multiple-try algorithm, connected with population Monte Carlo, and more recently published it in Statistics and Computing.

## WSC [2]011

Posted in Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , on December 15, 2011 by xi'an

Last day at WSC 2011: as it was again raining, I could not run a second time into the South Mountain Preserve park. (I had a swim at 5am instead and ended up having a nice chat with an old man in the pool under the rain!) My first morning session was rather disappointing with two talks that remained at such a high level of generality as to be useless and a mathematical talk about new forms of stochastic approximation that included proofs and no indication on the calibration of its many parameters. During the coffee break, I tried to have a chat with a vendor of a simulation software but we were using so different vocabularies that I soon gave up. (A lot of the software on display was a-statistical in that users would build a network, specify all parameters, incl. the distributions at the different nodes and start calibrating those parameters towards a behaviour that suited them.) The second session was much more in my area of interest/expertise, with Paul Dupuis giving a talk in the same spirit as the one he gave in New York last September. using large deviations and importance sampling on diffusions. Both following talks were about specially designed importance sampling techniques for rare events and about approximating the zero variance optimal importance function: Yixi Shin gave a talk on cross-entropy based selection of mixtures for the simulation of tail events, connecting somehow with the talk on mixtures of importance sampling distributions I attended yesterday. Although I am afraid I dozed a while during the talk, it was an interesting mix with the determination of the weights by cross-entropy arguments reminded me of what we did for the population Monte Carlo approach (since it also involved some adaptive entropy optimisation). Zdravko Botev gave a talk on approximating the ideal zero variance importance function by MCMC and a sort of Rao-Blackwell estimator that gives an unbiased estimator of this density under stationarity. Then it was time to leave for the airport (and wait in a Starbucks for the plane to Minneapolis and then London to depart, as there is no such thing as a lounge in Phoenix airport…). I had an interesting exchange with a professional magician in the first plane, The Amazing Hondo!, who knew about Persi and was a former math teacher. He explained a few tricks to me, plus showed me his indeed amazing sleight of hands in manipulating cards. In exchange, I took Persi’s book on Magic and Mathematics out of my bag so that he could have look at it. (The trip to London was completely uneventful as I slept most of the way.)

Overall, WSC 2011 was an interesting experience in that (a) the talks I attended on zero variance importance simulation set me thinking again on potential applications of the apparently useless optimality result; (b) it showed me that most people using simulation do not, N.O.T., relate to Monte Carlo techniques (to the extent of being completely foreign to my domains of expertise); and (c) among the parallel sessions that cover military applications, health care simulation, &tc., there always is a theme connecting to mines, which means that I will find sessions to attend when taking part in WSC 2012 in Berlin next year (since I have been invited for a talk). This will be the first time WSC is held outside North America. Hopefully, this will attract simulation addicts from Europe as well as elsewhere.