Archive for pseudo-marginal MCMC

unbiased product of expectations

Posted in Books, Statistics, University life with tags , , , , , , , , on August 5, 2019 by xi'an

m_biomet_106_2coverWhile I was not involved in any way, or even aware of this research, Anthony Lee, Simone Tiberi, and Giacomo Zanella have an incoming paper in Biometrika, and which was partly written while all three authors were at the University of Warwick. The purpose is to design an efficient manner to approximate the product of n unidimensional expectations (or integrals) all computed against the same reference density. Which is not a real constraint. A neat remark that motivates the method in the paper is that an improved estimator can be connected with the permanent of the n x N matrix A made of the values of the n functions computed at N different simulations from the reference density. And involves N!/ (N-n)! terms rather than N to the power n. Since it is NP-hard to compute, a manageable alternative uses random draws from constrained permutations that are reasonably easy to simulate. Especially since, given that the estimator recycles most of the particles, it requires a much smaller version of N. Essentially N=O(n) with this scenario, instead of O(n²) with the basic Monte Carlo solution, towards a similar variance.

This framework offers many applications in latent variable models, including pseudo-marginal MCMC, of course, but also for ABC since the ABC posterior based on getting each simulated observation close enough from the corresponding actual observation fits this pattern (albeit the dependence on the chosen ordering of the data is an issue that can make the example somewhat artificial).

likelihood free nested sampling

Posted in Books, Statistics with tags , , , , , , , , , , , on April 26, 2019 by xi'an

A recent paper by Mikelson and Khammash found on bioRxiv considers the (paradoxical?) mixture of nested sampling and intractable likelihood. They however cover only the case when a particle filter or another unbiased estimator of the likelihood function can be found. Unless I am missing something in the paper, this seems a very costly and convoluted approach when pseudo-marginal MCMC is available. Or the rather substantial literature on computational approaches to state-space models. Furthermore simulating under the lower likelihood constraint gets even more intricate than for standard nested sampling as the parameter space is augmented with the likelihood estimator as an extra variable. And this makes a constrained simulation the harder, to the point that the paper need resort to a Dirichlet process Gaussian mixture approximation of the constrained density. It thus sounds quite an intricate approach to the problem. (For one of the realistic examples, the authors mention a 12 hour computation on a 48 core cluster. Producing an approximation of the evidence that is not unarguably stabilised, contrary to the above.) Once again, not being completely up-to-date in sequential Monte Carlo, I may miss a difficulty in analysing such models with other methods, but the proposal seems to be highly demanding with respect to the target.

bandits for doubly intractable posteriors

Posted in Statistics with tags , , , , , , , , on April 17, 2019 by xi'an

Last Friday, Guanyang Wang arXived a paper on the use of multi-armed bandits (hence the reference to the three bandits) to handle intractable normalising constants. The bandit compares or mixes Møller et al. (2006) auxiliary variable solution with Murray et al. (2006) exchange algorithm. Which are both special cases of pseudo-marginal MCMC algorithms. In both cases, the auxiliary variables produce an unbiased estimator of the ratio of the constants. Rather than the ratio of two unbiased estimators as in the more standard pseudo-marginal MCMC. The current paper tries to compare the two approaches based on the variance of the ratio estimate, but cannot derive a general ordering. The multi-armed bandit algorithm exploits both estimators of the acceptance ratio to pick the one that is almost the largest, almost because there is a correction for validating the step by detailed balance. The bandit acceptance probability is the maximum [over the methods] of the minimum [over the time directions] of the original acceptance ratio. While this appears to be valid, note that the resulting algorithm implies four times as many auxiliary variates as the original ones, which makes me wonder at the gain when compared with a parallel implementation of these methods, coupled at random times. (The fundamental difficulty of simulating from likelihoods with an unknown normalising constant remains, see p.4.)

asymptotics of synthetic likelihood [a reply from the authors]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on March 19, 2019 by xi'an

[Here is a reply from David, Chris, and Robert on my earlier comments, highlighting some points I had missed or misunderstood.]

Dear Christian

Thanks for your interest in our synthetic likelihood paper and the thoughtful comments you wrote about it on your blog.  We’d like to respond to the comments to avoid some misconceptions.

Your first claim is that we don’t account for the differing number of simulation draws required for each parameter proposal in ABC and synthetic likelihood.  This doesn’t seem correct, see the discussion below Lemma 4 at the bottom of page 12.  The comparison between methods is on the basis of effective sample size per model simulation.

As you say, in the comparison of ABC and synthetic likelihood, we consider the ABC tolerance \epsilon and the number of simulations per likelihood estimate M in synthetic likelihood as functions of n.  Then for tuning parameter choices that result in the same uncertainty quantification asymptotically (and the same asymptotically as the true posterior given the summary statistic) we can look at the effective sample size per model simulation.  Your objection here seems to be that even though uncertainty quantification is similar for large n, for a finite n the uncertainty quantification may differ.  This is true, but similar arguments can be directed at almost any asymptotic analysis, so this doesn’t seem a serious objection to us at least.  We don’t find it surprising that the strong synthetic likelihood assumptions, when accurate, give you something extra in terms of computational efficiency.

We think mixing up the synthetic likelihood/ABC comparison with the comparison between correctly specified and misspecified covariance in Bayesian synthetic likelihood is a bit unfortunate, since these situations are quite different.  The first involves correct uncertainty quantification asymptotically for both methods.  Only a very committed reader who looked at our paper in detail would understand what you say here.  The question we are asking with the misspecified covariance is the following.  If the usual Bayesian synthetic likelihood analysis is too much for our computational budget, can something still be done to quantify uncertainty?  We think the answer is yes, and with the misspecified covariance we can reduce the computational requirements by an order of magnitude, but with an appropriate cost statistically speaking.  The analyses with misspecified covariance give valid frequentist confidence regions asymptotically, so this may still be useful if it is all that can be done.  The examples as you say show something of the nature of the trade-off involved.

We aren’t quite sure what you mean when you are puzzled about why we can avoid having M to be O(√n).  Note that because of the way the summary statistics satisfy a central limit theorem, elements of the covariance matrix of S are already O(1/n), and so, for example, in estimating μ(θ) as an average of M simulations for S, the elements of the covariance matrix of the estimator of μ(θ) are O(1/(Mn)).  Similar remarks apply to estimation of Σ(θ).  I’m not sure whether that gets to the heart of what you are asking here or not.

In our email discussion you mention the fact that if M increases with n, then the computational burden of a single likelihood approximation and hence generating a single parameter sample also increases with n.  This is true, but unavoidable if you want exact uncertainty quantification asymptotically, and M can be allowed to increase with n at any rate.  With a fixed M there will be some approximation error, which is often small in practice.  The situation with vanilla ABC methods will be even worse, in terms of the number of proposals required to generate a single accepted sample, in the case where exact uncertainty quantification is desired asymptotically.  As shown in Li and Fearnhead (2018), if regression adjustment is used with ABC and you can find a good proposal in their sense, one can avoid this.  For vanilla ABC, if the focus is on point estimation and exact uncertainty quantification is not required, the situation is better.  Of course as you show in your nice ABC paper for misspecified models jointly with David Frazier and Juidth Rousseau recently the choice of whether to use regression adjustment can be subtle in the case of misspecification.

In our previous paper Price, Drovandi, Lee and Nott (2018) (which you also reviewed on this blog) we observed that if the summary statistics are exactly normal, then you can sample from the summary statistic posterior exactly with finite M in the synthetic likelihood by using pseudo-marginal ideas together with an unbiased estimate of a normal density due to Ghurye and Olkin (1962).  When S satisfies a central limit theorem so that S is increasingly close to normal as n gets large, we conjecture that it is possible to get exact uncertainty quantification asymptotically with fixed M if we use the Ghurye and Olkin estimator, but we have no proof of that yet (if it is true at all).

Thanks again for being interested enough in the paper to comment, much appreciated.

David, Chris, Robert.

easy-to-use empirical likelihood ABC

Posted in Statistics, University life with tags , , , , , , , on October 23, 2018 by xi'an

A newly arXived paper from a group of researchers at NUS I wish we had discussed when I was there last month. As we wrote this empirical ABCe paper in PNAS with Kerrie Mengersen and Pierre Pudlo in 2012. Plus the SAME paper with Arnaud Doucet and Simon Godsill ten years earlier, which the authors prefer to call data cloning in continuation of the more recent Lele et al. (2007). They could actually have used my original denomination of prior feedback (1992? I remember presenting the idea at Camp Casella in Cornell that summer) as well! Actually, I am not certain invoking prior feedback is quite necessary since this is a form of simulated method of moments as well.

Now, did we really assume that some moments of the distribution were analytically available, although the likelihood was not?! Even before going through the paper, it dawned on me that these theoretical moments could have been simulated instead, since the model is a generative one: for a given parameter value, a direct Monte Carlo approximation to the exact moment can be produced and can serve as a constraint for the empirical likelihood definition. I am surprised and aggrieved that we would not think of this empirical likelihood version of a method of moments. Which is central to the current paper. In the sense that, were the parameter exact, the differences between the moments based on the actual data x⁰ and the moments based on m replicas of the simulated data x¹,x²,… have mean zero, meaning the moment constraint is immediately available. Meaning an empirical likelihood is easily constructed, replacing the actual likelihood in an MCMC scheme, albeit at a rather high computing cost. Congratulations to the authors for uncovering this possibility that we missed!

“The summary statistics in this example were judiciously chosen.”

One point in the paper on which I disagree with the authors is the argument that MCMC sampling based on an empirical likelihood can be seen as an implementation of the pseudo-marginal Metropolis-Hastings method. The major difference in my opinion is that there is no unbiasedness here (and no generic result that indicates convergence to the exact posterior as the number of simulations grows to infinity). The other point unclear to me is about the selection of summaries [or moments] for implementing the method, which seems to be based on their performances in the subsequent estimation, performances that are hard to assess properly in intractable likelihood cases. In the last example of stereological extremes (not covered in our paper), for instance, the output is compared with the parallel synthetic likelihood result.

IMS workshop [day 5]

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , on September 3, 2018 by xi'an

The last day of the starting workshop [and my last day in Singapore] was a day of importance [sampling] with talks by Matti Vihola opposing importance sampling and delayed acceptance and particle MCMC, related to several papers of his that I missed. To be continued in the coming weeks at the IMS, which is another reason to regret having to leave that early [as my Parisian semester starts this Monday with an undergrad class at 8:30!]

And then a talk by Joaquín Miguez on stabilizing importance sampling by truncation which reminded me very much of the later work by Andrew Gelman and Aki Vehtari on Pareto smoothed importance sampling, with further operators adapted to sequential settings and the similar drawback that when the importance sampler is poor, i.e., when the simulated points are all very far from the centre of mass, no amount of fudging with the weights will bring the points closer. AMIS made an appearance as a reference method, to be improved by this truncation of the weights, a wee bit surprising as it should bring the large weights of the earlier stages down.

Followed by an almost silent talk by Nick Whiteley, who having lost his voice to the air conditioning whispered his talk in the microphone. Having once faced a lost voice during an introductory lecture to a large undergraduate audience, I could not but completely commiserate for the hardship of the task. Although this made the audience most silent and attentive. His topic was the Viterbi process and its parallelisation, by using a truncated horizon (presenting connection with overdamped Langevin, eg Durmus and Moulines and Dalalyan).

And due to a pressing appointment with my son and his girlfriend [who were traveling through Singapore on that day] for a chili crab dinner on my way to the airport, I missed the final talk by Arnaud Doucet, where he was to reconsider PDMP algorithms without the continuous time layer, a perspective I find most appealing!

Overall, this was a quite diverse and rich [starting] seminar, backed by the superb organisation of the IMS and the smooth living conditions on the NUS campus [once I had mastered the bus routes], which would have made much more sense for me as part of a longer stay, which is actually what happened the previous time I visited the IMS (in 2005), again clashing with my course schedule at home… And as always, I am impressed with the city-state of Singapore, for the highly diverse food scene in particular, but also this [maybe illusory] impression of coexistence between communities. And even though the ecological footprint could certainly be decreased, measures to curb car ownership (with a 150% purchase tax) and use (with congestion charges).

IMS workshop [day 4]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on August 31, 2018 by xi'an

While I did not repeat the mistake of yesterday morning, just as well because the sun was unbearably strong!, I managed this time to board a bus headed in the wrong direction and as a result went through several remote NUS campi! Missing the first talk of the day as a result. By Youssef Marzouk, with a connection between sequential Monte Carlo and optimal transport. Transport for sampling, that is. The following talk by Tiangang Cui was however related, with Marzouk a co-author, as it aimed at finding linear transforms towards creating Normal approximations to the target to be used as proposals in Metropolis algorithms. Which may sound like something already tried a zillion times in the MCMC literature, except that the setting was rather specific to some inverse problems, imposing a generalised Normal structure on the transform, then optimised by transport arguments. It is unclear to me [from just attending the talk] how complex this derivation is and how dimension steps in, but the produced illustrations were quite robust to an increase in dimension.

The remaining talks for the day were mostly particular, from Anthony Lee introducing a new and almost costless way of producing variance estimates in particle filters, exploiting only the ancestry of particles, to Mike Pitt discussing the correlated pseudo-marginal algorithm developed with George Deligiannidis and Arnaud Doucet. Which somewhat paradoxically managed to fight the degeneracy [i.e., the need for a number of terms increasing like the time index T] found in independent pseudo-marginal resolutions, moving down to almost log(T)… With an interesting connection to the quasi SMC approach of Mathieu and Nicolas. And Sebastian Reich also stressed the links with optimal transport in a talk about data assimilation that was way beyond my reach. The day concluded with fireworks, through a magistral lecture by Professeur Del Moral on a continuous time version of PMCMC using the Feynman-Kac terminology. Pierre did a superb job during his lecture towards leading the whole room to the conclusion.