## Archive for population Monte Carlo

Posted in Books, pictures, Statistics, University life with tags , , , , , , on July 30, 2015 by xi'an

Ingmar Schuster, who visited Paris-Dauphine last Spring (and is soon to return here as a postdoc funded by Fondation des Sciences Mathématiques de Paris) has arXived last week a paper on gradient importance sampling. In this paper, he builds a sequential importance sampling (or population Monte Carlo) algorithm that exploits the additional information contained in the gradient of the target. The proposal or importance function being essentially the MALA move as its proposal, mixed across the elements of the previous population. When compared with our original PMC mixture of random walk proposals found in e.g. this paper, each term in the mixture thus involves an extra gradient, with a scale factor that decreases to zero as 1/t√t. Ingmar compares his proposal with an adaptive Metropolis, an adaptive MALTa and an HM algorithms, for two mixture distributions and the banana target of Haario et al. (1999) we also used in our paper. As well as a logistic regression. In each case, he finds both a smaller squared error and a smaller bias for the same computing time (evaluated as the number of likelihood evaluations). While we discussed this scheme when he visited, I remain intrigued as to why it works so well when compared with the other solutions. One possible explanation is that the use of the gradient drift is more efficient on a population of particles than on a single Markov chain, provided the population covers all modes of importance on the target surface: the “fatal” attraction of the local model is then much less of an issue…

## Bayesian model averaging in astrophysics

Posted in Books, Statistics, University life with tags , , , , , , , , , , on July 29, 2015 by xi'an

[A 2013 post that somewhat got lost in a pile of postponed entries and referee’s reports…]

In this review paper, now published in Statistical Analysis and Data Mining 6, 3 (2013), David Parkinson and Andrew R. Liddle go over the (Bayesian) model selection and model averaging perspectives. Their argument in favour of model averaging is that model selection via Bayes factors may simply be too inconclusive to favour one model and only one model. While this is a correct perspective, this is about it for the theoretical background provided therein. The authors then move to the computational aspects and the first difficulty is their approximation (6) to the evidence

$P(D|M) = E \approx \frac{1}{n} \sum_{i=1}^n L(\theta_i)Pr(\theta_i)\, ,$

where they average the likelihood x prior terms over simulations from the posterior, which does not provide a valid (either unbiased or converging) approximation. They surprisingly fail to account for the huge statistical literature on evidence and Bayes factor approximation, incl. Chen, Shao and Ibrahim (2000). Which covers earlier developments like bridge sampling (Gelman and Meng, 1998).

As often the case in astrophysics, at least since 2007, the authors’ description of nested sampling drifts away from perceiving it as a regular Monte Carlo technique, with the same convergence speed n1/2 as other Monte Carlo techniques and the same dependence on dimension. It is certainly not the only simulation method where the produced “samples, as well as contributing to the evidence integral, can also be used as posterior samples.” The authors then move to “population Monte Carlo [which] is an adaptive form of importance sampling designed to give a good estimate of the evidence”, a particularly restrictive description of a generic adaptive importance sampling method (Cappé et al., 2004). The approximation of the evidence (9) based on PMC also seems invalid:

$E \approx \frac{1}{n} \sum_{i=1}^n \dfrac{L(\theta_i)}{q(\theta_i)}\, ,$

is missing the prior in the numerator. (The switch from θ in Section 3.1 to X in Section 3.4 is  confusing.) Further, the sentence “PMC gives an unbiased estimator of the evidence in a very small number of such iterations” is misleading in that PMC is unbiased at each iteration. Reversible jump is not described at all (the supposedly higher efficiency of this algorithm is far from guaranteed when facing a small number of models, which is the case here, since the moves between models are governed by a random walk and the acceptance probabilities can be quite low).

The second quite unrelated part of the paper covers published applications in astrophysics. Unrelated because the three different methods exposed in the first part are not compared on the same dataset. Model averaging is obviously based on a computational device that explores the posteriors of the different models under comparison (or, rather, averaging), however no recommendation is found in the paper as to efficiently implement the averaging or anything of the kind. In conclusion, I thus find this review somehow anticlimactic.

## extending ABC to high dimensions via Gaussian copula

Posted in Books, pictures, Statistics, Travel, Uncategorized, University life with tags , , , on April 28, 2015 by xi'an

Li, Nott, Fan, and Sisson arXived last week a new paper on ABC methodology that I read on my way to Warwick this morning. The central idea in the paper is (i) to estimate marginal posterior densities for the components of the model parameter by non-parametric means; and (ii) to consider all pairs of components to deduce the correlation matrix R of the Gaussian (pdf) transform of the pairwise rank statistic. From those two low-dimensional estimates, the authors derive a joint Gaussian-copula distribution by using inverse  pdf transforms and the correlation matrix R, to end up with a meta-Gaussian representation

$f(\theta)=\dfrac{1}{|R|^{1/2}}\exp\{\eta^\prime(I-R^{-1})\eta/2\}\prod_{i=1}^p g_i(\theta_i)$

where the η’s are the Gaussian transforms of the inverse-cdf transforms of the θ’s,that is,

$\eta_i=\Phi^{-1}(G_i(\theta_i))$

Or rather

$\eta_i=\Phi^{-1}(\hat{G}_i(\theta_i))$

given that the g’s are estimated.

This is obviously an approximation of the joint in that, even in the most favourable case when the g’s are perfectly estimated, and thus the components perfectly Gaussian, the joint is not necessarily Gaussian… But it sounds quite interesting, provided the cost of running all those transforms is not overwhelming. For instance, if the g’s are kernel density estimators, they involve sums of possibly a large number of terms.

One thing that bothers me in the approach, albeit mostly at a conceptual level for I realise the practical appeal is the use of different summary statistics for approximating different uni- and bi-dimensional marginals. This makes for an incoherent joint distribution, again at a conceptual level as I do not see immediate practical consequences… Those local summaries also have to be identified, component by component, which adds another level of computational cost to the approach, even when using a semi-automatic approach as in Fernhead and Prangle (2012). Although the whole algorithm relies on a single reference table.

The examples in the paper are (i) the banana shaped “Gaussian” distribution of Haario et al. (1999) that we used in our PMC papers, with a twist; and (ii) a g-and-k quantile distribution. The twist in the banana (!) is that the banana distribution is the prior associated with the mean of a Gaussian observation. In that case, the meta-Gaussian representation seems to hold almost perfectly, even in p=50 dimensions. (If I remember correctly, the hard part in analysing the banana distribution was reaching the tails, which are extremely elongated in at least one direction.) For the g-and-k quantile distribution, the same holds, even for a regular ABC. What seems to be of further interest would be to exhibit examples where the meta-Gaussian is clearly an approximation. If such cases exist.

## optimal mixture weights in multiple importance sampling

Posted in Statistics, University life with tags , , , , , , on December 12, 2014 by xi'an

Multiple importance sampling is back!!! I am always interested in this improvement upon regular importance sampling, even or especially after publishing a recent paper about our AMIS (for adaptive multiple importance sampling) algorithm, so I was quite eager to see what was in Hera He’s and Art Owen’s newly arXived paper. The paper is definitely exciting and set me on a new set of importance sampling improvements and experiments…

Some of the most interesting developments in the paper are that, (i) when using a collection of importance functions qi with the same target p, every ratio qi/p is a control variate function with expectation 1 [assuming each of the qi‘s has a support smaller than the support of p]; (ii) the weights of a mixture of the qi‘s can be chosen in an optimal way towards minimising the variance for a certain integrand; (iii) multiple importance sampling incorporates quite naturally stratified sampling, i.e. the qi‘s may have disjoint supports; )iv) control variates contribute little, esp. when compared with the optimisation over the weights [which does not surprise me that much, given that the control variates have little correlation with the integrands]; (v) Veach’s (1997) seminal PhD thesis remains a driving force behind those results [and in getting Eric Veach an Academy Oscar in 2014!].

One extension that I would find of the uttermost interest deals with unscaled densities, both for p and the qi‘s. In that case, the weights do not even sum up to a know value and I wonder at how much more difficult it is to analyse this realistic case. And unscaled densities led me to imagine using geometric mixtures instead. Or even harmonic mixtures! (Maybe not.)

Another one is more in tune with our adaptive multiple mixture paper. The paper works with regret, but one could also work with remorse! Besides the pun, this means that one could adapt the weights along iterations and even possible design new importance functions from the past outcome, i.e., be adaptive once again. He and Owen suggest mixing their approach with our adaptive sequential Monte Carlo model.

## efficient exploration of multi-modal posterior distributions

Posted in Books, Statistics, University life with tags , , , , on September 1, 2014 by xi'an

The title of this recent arXival had potential appeal, however the proposal ends up being rather straightforward and hence  anti-climactic! The paper by Hu, Hendry and Heng proposes to run a mixture of proposals centred at the various modes of  the target for an efficient exploration. This is a correct MCMC algorithm, granted!, but the requirement to know beforehand all the modes to be explored is self-defeating, since the major issue with MCMC is about modes that are  omitted from the exploration and remain undetected throughout the simulation… As provided, this is a standard MCMC algorithm with no adaptive feature and I would rather suggest our population Monte Carlo version, given the available information. Another connection with population Monte Carlo is that I think the performances would improve by Rao-Blackwellising the acceptance rate, i.e. removing the conditioning on the (ancillary) component of the index. For PMC we proved that using the mixture proposal in the ratio led to an ideally minimal variance estimate and I do not see why randomising the acceptance ratio in the current case would bring any improvement.

## PMC for combinatoric spaces

Posted in Statistics, University life with tags , , , , , , , on July 28, 2014 by xi'an

I received this interesting [edited] email from Xiannian Fan at CUNY:

I am trying to use PMC to solve Bayesian network structure learning problem (which is in a combinatorial space, not continuous space).

In PMC, the proposal distributions qi,t can be very flexible, even specific to each iteration and each instance. My problem occurs due to the combinatorial space.

For importance sampling, the requirement for proposal distribution, q, is:

support (p) ⊂ support (q)             (*)

For PMC, what is the support of the proposal distribution in iteration t? is it

support (p) ⊂ U support(qi,t)    (**)

or does (*) apply to every qi,t?

For continuous problem, this is not a big issue. We can use random walk of Normal distribution to do local move satisfying (*). But for combination search, local moving only result in finite states choice, just not satisfying (*). For example for a permutation (1,3,2,4), random swap has only choose(4,2)=6 neighbor states.

Fairly interesting question about population Monte Carlo (PMC), a sequential version of importance sampling we worked on with French colleagues in the early 2000’s.  (The name population Monte Carlo comes from Iba, 2000.)  While MCMC samplers do not have to cover the whole support of p at each iteration, it is much harder for importance samplers as their core justification is to provide an unbiased estimator to for all integrals of interest. Thus, when using the PMC estimate,

1/n ∑i,t {p(xi,t)/qi,t(xi,t)}h(qi,t),  xi,t~qi,t(x)

this estimator is only unbiased when the supports of the qi,t “s are all containing the support of p. The only other cases I can think of are

1. associating the qi,t “s with a partition Si,t of the support of p and using instead

i,t {p(xi,t)/qi,t(xi,t)}h(qi,t), xi,t~qi,t(x)

2. resorting to AMIS under the assumption (**) and using instead

1/n ∑i,t {p(xi,t)/∑j,t qj,t(xi,t)}h(qi,t), xi,t~qi,t(x)

but I am open to further suggestions!

## ABC for design

Posted in Statistics with tags , , , , , , , on August 30, 2013 by xi'an

I wrote a comment on this arXived paper on simulation based design that starts from Müller (1999) and gets an ABC perspective a while ago on my iPad when travelling to Montpellier and then forgot to download it…

Hainy, [Wener] Müller, and Wagner recently arXived a paper called “Likelihood-free Simulation-based Optimal Design“, paper which relies on ABC to construct optimal designs . Remember that [Peter] Müller (1999) uses a natural simulated annealing that is quite similar to our MAP [SAME] algorithm with Arnaud Doucet and Simon Godsill, relying on multiple versions of the data set to get to the maximum. The paper also builds upon our 2006 JASA paper with my then PhD student Billy Amzal, Eric Parent, and Frederic Bois, paper that took advantage of the then emerging particle methods to improve upon a static horizon target. While our method is sequential in that it pursues a moving target, it does not rely on the generic methodology developed by del Moral et al. (2006), where a backward kernel brings more stability to the moves. The paper also implements a version of our population Monte Carlo ABC algorithm (Beaumont et al., 2009), as a first step before an MCMC simulation. Overall, the paper sounds more like a review than like a strongly directive entry into ABC based design in that it remains quite generic. Not that I have specific suggestions, mind!, but I fear a realistic implementation (as opposed to the linear model used in the paper) would require a certain amount of calibration. There are missing references of recent papers using ABC for design, including some by Michael Stumpf I think.

I did not know about the Kuck et al. reference… Which is reproducing our 2006 approach within the del Moral framework. It uses a continuous temperature scale that I find artificial and not that useful, again a maybe superficial comment as I didn’t get very much into the paper … Just that integer powers lead to multiples of the sample and have a nice algorithmic counterpart.