Parallel processing of independent Metropolis-Hastings algorithms

With Pierre Jacob, my PhD student, and Murray Smith, from National Institute of Water and Atmospheric Research, Wellington, who actually started us on this project at the last and latest Valencia meeting, we have completed a paper on using parallel computing in independent Metropolis-Hastings algorithms. The paper is arXived and the abstract goes as follows:

In this paper, we consider the implications of the fact that parallel raw-power can be exploited by a generic Metropolis–Hastings algorithm if the proposed values are independent. In particular, we present improvements to the independent Metropolis–Hastings algorithm that significantly decrease the variance of any estimator derived from the MCMC output, for a null computing cost since those improvements are based on a fixed number of target density evaluations. Furthermore, the techniques developed in this paper do not jeopardize the Markovian convergence properties of the algorithm, since they are based on the Rao–Blackwell principles of Gelfand and Smith (1990), already exploited in Casella and Robert 91996), Atchadé and Perron (2005) and Douc and Robert (2010). We illustrate those improvement both on a toy normal example and on a classical probit regression model but insist on the fact that they are universally applicable.

I am quite excited about the results in this paper, which took advantage of (a) older works of mine on Rao-Blackwellisation, (b) Murray’s interests in costly likelihoods, and (c) our mutual excitement when hearing about GPU parallel possibilities from Chris Holmes’ talk in Valencia. (As well as directions drafted in an exciting session in Vancouver!) The (free) gains over standard independent Metropolis-Hastings estimates are equivalent to using importance sampling gains, while keeping the Markov structure of the original chain. Given that 100 or more parallel threads can be enhanced from current GPU cards, this is clearly a field with much potential! The graph below

gives the variance improvements brought by three Rao-Blackwell estimates taking advantage of parallelisation over the initial MCMC estimate (first entry) with the importance sampling estimate (last entry) using only 10 parallel threads.

20 Responses to “Parallel processing of independent Metropolis-Hastings algorithms”

  1. […] seem to be all the rage these days. At the last Bayesian Valencia meeting, Chris Holmes gave a nice talk on how GPUs could […]

  2. […] of the conference… To wit, Pierre Jacob got one of the poster prizes for his poster on our parallelisation paper.) As I left early, I am sorry I missed Keith Baggerly’ report of his sleuth work on […]

  3. […] is the poster presented at MCMSki III next week by Pierre Jacob about our joint paper on […]

  4. […] Jacob and I got this email from a student about our parallel Rao-Blackwellisation paper. Here are some parts of the questions and our […]

  5. […] results in a most synthetic and intuitive manner… I also got new ideas about generalising our parallel computing paper for random walk Metropolis-Hastings algorithms and for optimising across permutation […]

  6. […] whereas the standard Metropolis-Hastings algorithm is essentially sequential. See as well Pierre, Christian and Murray Smith’s block Independent Metropolis-Hastings algorithm for further […]

  7. […] address not unique?! When trying to submit a paper yesterday, I got the following […]

  8. Nathan Urban Says:

    How might this algorithm compare to the adaptive importance sampling of Cappé et al. (2008)?

    • Nathan, there is no connection or comparison possible between both papers. The current one is about improving MCMC, the earlier one about adaptive importance sampling…

    • Nathan Urban Says:

      Yes, I understand they are different algorithms. I mean “compare” in the sense of computational performance. My question is possibly naive, but: if one is interested in doing parallel Monte Carlo on a GPU, as you suggest, what guidance is there for a practitioner to decide between the different parallel algorithms in the literature (e.g., MCMC vs. adaptive importance sampling)? I am sure it is problem-dependent, but, for example, if one applied parallel MCMC to the cosmological parameter estimation problem of Wraith et al. (2009), might it be expected to be more or less efficient than the population Monte Carlo applied in that paper?

      • For the cosmological parameter estimation problem of Wraith et al. (2009), we could not use an MCMC algorithm as the likelihoods were awfully costly. We did run an adaptive importance sampling algorithm for this very reason on a cluster with 124 processors at the institute of astrophysics in Paris. Now, once a stable PMC proposal is obtained it could be used as an independent proposal within an MCMC and the above parallel processing could be used. I am however doubtful as to the improvement…

    • Nathan, provided that the adaptation is learning about the posterior … adaptive approaches are likely to be more efficient than using a fixed starting proposal such as for IMH where good information about the posterior is not available from the start. The ability to parallelise adaptive approaches is then a further computational benefit. As pointed out in the paper there are problems where IMH can be used and say importance sampling can’t (e.g for metropolis updates within a Gibbs sampler). Adaptation can also be overused and for many problems good information about the posterior is available or can be learned quickly at the start. The computational approach outlined in the paper is a way to make sampling from such fixed proposals more efficient in terms of time (even if the computer is doing much work! :-) ) and given the benefits shown should make the method more appealing in this sense. (Xian, I am just trying to help, feel free to delete or add as you wish.)

  9. Very interesting. I just quickly glanced at the paper so this probably is explained if I read carefully, but I’m confused with whether the advantage comes from step 5 or 9 in algorithm 2. That is:

    Does BIMH yield one valid chain of the same length as IMH, with the same effective sample size, and at the same computational cost but there are p parallel copies that mix up order and uniforms and are free if we parallelize Step 9. Then these copies can be used to integrate out some randomness via a Rao-Blackwellization type scheme and that’s where the benefit comes from?

    Or is the point that I immediately get a linear speed up by p in Step 5 from being able to parallelize the ratios \omega_t, and then get an additional benefit from Rao-Blackwellization which I can get very quickly by doing Step 9 in parallel?

    • There is a gain in time in computing p ratios in parallel, especially with costly likelihoods. In this paper, however, we study the additional and free gain obtained from using the p parallel “stories” over different uniform sequences and different orders. This is where the difference between IMH and BIMH resides…

  10. […] data(Pima.te) ?Pima.te Update: my PhD advisor Christian P. Robert blogs about the article as well, in more details. Leave a […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.