## Archive for acceleration of MCMC algorithms

## delayed but published!

Posted in Statistics with tags acceleration of MCMC algorithms, AIMS, delayed acceptance, Foundations of Data Science, Hastings-Metropolis sampler, Monte Carlo Statistical Methods, publication on June 20, 2019 by xi'an## accelerating MCMC

Posted in Books, Statistics, University life with tags acceleration of MCMC algorithms, algorithms, arXiv, cross validated, MCMC, Monte Carlo Statistical Methods, referee, simulation, Telecom Lille, typology, Université Paris Dauphine, University of Warwick, WIREs on April 11, 2018 by xi'an**A**s forecasted a rather long while ago (!), I wrote a short and incomplete survey on some approaches to accelerating MCMC. With the massive help of Victor Elvira (Lille), Nick Tawn (Warwick) and Changye Wu (Dauphine). Survey which current version just got arXived and which has now been accepted by WIREs Computational Statistics. The typology (and even the range of methods) adopted here is certainly mostly arbitrary, with suggestions for different divisions made by a very involved and helpful reviewer. While we achieved a quick conclusion to the review process, suggestions and comments are most welcome! Even if we cannot include every possible suggestion, just like those already made on X validated. (WIREs stands for Wiley Interdisciplinary Reviews and its dozen topics cover several fields, from computational stats to biology, to medicine, to engineering.)

## accelerating MCMC

Posted in Statistics with tags acceleration of MCMC algorithms, coupling, Hamiltonian Monte Carlo, India, MCMC, Monte Carlo Statistical Methods, motorbike, NUTS, Rajasthan, review, survey, tempering, WIREs on May 29, 2017 by xi'an**I** have recently [well, not so recently!] been asked to write a review paper on ways of accelerating MCMC algorithms for the [review] journal WIREs Computational Statistics and would welcome all suggestions towards the goal of accelerating MCMC algorithms. Besides [and including more on]

- coupling strategies using different kernels and switching between them;
- tempering strategies using flatter or lower dimensional targets as intermediary steps, e.g., à la Neal;
- sequential Monte Carlo with particle systems targeting again flatter or lower dimensional targets and adapting proposals to this effect;
- Hamiltonian MCMC, again with connections to Radford (and more generally ways of avoiding rejections);
- adaptive MCMC, obviously;
- Rao-Blackwellisation, just as obviously (in the sense that increasing the precision in the resulting estimates means less simulations).

## ISBA 2016 [#5]

Posted in Mountains, pictures, Running, Statistics, Travel with tags acceleration of MCMC algorithms, Bayesian foundations, Bayesian lasso, change-point, fiducial distribution, maquis, mare e monti, modularisation, Monte Carlo Statistical Methods, regularisation, Santa Margherita di Pula, Sardinia, The Likelihood Principle, thorn on June 18, 2016 by xi'anOn Thursday, I started the day by a rather masochist run to the nearby hills, not only because of the very hour but also because, by following rabbit trails that were not intended for my size, I ended up being scratched by thorns and bramble all over!, but also with neat views of the coast around Pula. From there, it was all downhill [joke]. The first morning talk I attended was by Paul Fearnhead and about efficient change point estimation (which is an NP hard problem or close to). The method relies on dynamic programming [which reminded me of one of my earliest Pascal codes about optimising a dam debit]. From my spectator’s perspective, I wonder[ed] at easier models, from Lasso optimisation to spline modelling followed by testing equality between bits. Later that morning, James Scott delivered the first Bayarri Lecture, created in honour of our friend Susie who passed away between the previous ISBA meeting and this one. James gave an impressive coverage of regularisation through three complex models, with the [hopefully not degraded by my translation] message that we should [as Bayesians] focus on important parts of those models and use non-Bayesian tools like regularisation. I can understand the practical constraints for doing so, but optimisation leads us away from a Bayesian handling of inference problems, by removing the ascertainment of uncertainty…

Later in the afternoon, I took part in the Bayesian foundations session, discussing the shortcomings of the Bayes factor and suggesting the use of mixtures instead. With rebuttals from [friends in] the audience!

This session also included a talk by Victor Peña and Jim Berger analysing and answering the recent criticisms of the Likelihood principle. I am not sure this answer will convince the critics, but I won’t comment further as I now see the debate as resulting from a vague notion of inference in Birnbaum‘s expression of the principle. Jan Hannig gave another foundation talk introducing fiducial distributions (a.k.a., Fisher’s Bayesian mimicry) but failing to provide a foundational argument for replacing Bayesian modelling. (Obviously, I am definitely prejudiced in this regard.)

The last session of the day was sponsored by BayesComp and saw talks by Natesh Pillai, Pierre Jacob, and Eric Xing. Natesh talked about his paper on accelerated MCMC recently published in JASA. Which surprisingly did not get discussed here, but would definitely deserve to be! As hopefully corrected within a few days, when I recoved from conference burnout!!! Pierre Jacob presented a work we are currently completing with Chris Holmes and Lawrence Murray on modularisation, inspired from the cut problem (as exposed by Plummer at MCMski IV in Chamonix). And Eric Xing spoke about embarrassingly parallel solutions, discussed a while ago here.

## self-healing umbrella sampling

Posted in Kids, pictures, Statistics, University life with tags acceleration of MCMC algorithms, adaptive MCMC methods, Monte Carlo experiment, multimodality, Tintin, umbrella sampling, Wang-Landau algorithm, well-tempered algorithm on November 5, 2014 by xi'an**T**en days ago, Gersende Fort, Benjamin Jourdain, Tony Lelièvre, and Gabriel Stoltz arXived a study about an adaptive umbrella sampler that can be re-interpreted as a Wang-Landau algorithm, if not the most efficient version of the latter. This reminded me very much of the workshop we had all together in Edinburgh last June. And even more of the focus of the molecular dynamics talks in this same ICMS workshop about accelerating the MCMC exploration of multimodal targets. The self-healing aspect of the sampler is to adapt to the multimodal structure thanks to a partition that defines a biased sampling scheme spending time in each set of the partition in a frequency proportional to weights. While the optimal weights are the weights of the sets against the target distribution (are they truly optimal?! I would have thought lifting low density regions, i.e., marshes, could improve the mixing of the chain for a given proposal), those are unknown and they need to be estimated by an adaptive scheme that makes staying in a given set the less desirable the more one has visited it. By increasing the inverse weight of a given set by a factor each time it is visited. Which sounds indeed like Wang-Landau. The plus side of the self-healing umbrella sampler is that it only depends on a scale γ (and on the partition). Besides converging to the right weights of course. The downside is that it does not reach the most efficient convergence, since the adaptivity weight decreases in 1/n rather than 1/√n.

Note that the paper contains a massive experimental side where the authors checked the impact of various parameters by Monte Carlo studies of estimators involving more than a billion iterations. Apparently repeated a large number of times.

The next step in adaptivity should be about the adaptive determination of the partition, hoping for a robustness against the dimension of the space. Which may be unreachable if I judge by the apparent deceleration of the method when the number of terms in the partition increases.

## delayed acceptance [alternative]

Posted in Books, Kids, Statistics, University life with tags acceleration of MCMC algorithms, delayed acceptance, detailed balance, MCMC, Monte Carlo Statistical Methods, reversibility, simulation on October 22, 2014 by xi'an**I**n a comment on our *Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching* paper, Philip commented that he had experimented with an alternative splitting technique retaining the right stationary measure: the idea behind his alternative acceleration is again (a) to divide the target into bits and (b) run the acceptance step by parts, towards a major reduction in computing time. The difference with our approach is to represent the overall acceptance probability

and, even more surprisingly than in our case, this representation remains associated with the right (posterior) target!!! Provided the ordering of the terms is random with a symmetric distribution on the permutation. This property can be directly checked via the detailed balance condition*.*

In a toy example, I compared the acceptance rates (*acrat*) for our delayed solution (*letabin.R*), for this alternative (*letamin.R*), and for a non-delayed reference (*letabaz.R*), when considering more and more fractured decompositions of a Bernoulli likelihood.

> system.time(source("letabin.R")) user system elapsed 225.918 0.444 227.200 > acrat [1] 0.3195 0.2424 0.2154 0.1917 0.1305 0.0958 > system.time(source("letamin.R")) user system elapsed 340.677 0.512 345.389 > acrat [1] 0.4045 0.4138 0.4194 0.4003 0.3998 0.4145 > system.time(source("letabaz.R")) user system elapsed 49.271 0.080 49.862 > acrat [1] 0.6078 0.6068 0.6103 0.6086 0.6040 0.6158

A very interesting outcome since the acceptance rate does not change with the number of terms in the decomposition for the alternative delayed acceptance method… Even though it logically takes longer than our solution. However, the drawback is that detailed balance implies picking the order at random, hence loosing on the gain in computing the cheap terms first. If reversibility could be bypassed, then this alternative would definitely get very appealing!

## thermodynamic Monte Carlo

Posted in Books, Statistics, University life with tags acceleration of MCMC algorithms, differential geometry, Hamiltonian Monte Carlo, Riemann manifold, University of Warwick on June 27, 2014 by xi'an**M**ichael Betancourt, my colleague from Warwick, arXived a month ago a paper about a differential geometry approach to relaxation. *(In the Monte Carlo rather than the siesta sense of the term relaxation!)* He is considering the best way to link a simple base measure ϖ to a measure of interest π by the sequence

where Z(β) is the normalising constant (or *partition function* in the thermodynamic translation). Most methods are highly dependent on how the sequence of β’s is chosen. A first nice result (for me) is that the Kullback-Leibler distance and the partition function are strongly related in that

which means that the variation in the normalising constant is driving the variation in the Kullback-Leibler distance. The next section goes into differential geometry and the remains from my Master course in differential geometry alas are much too scattered for me to even remember some notions like that of a *bundle*… So, like Andrew, I have trouble making sense of the resulting algorithm, which updates the temperature β along with the position and speed. (It sounds like an extra and corresponding energy term is added to the original Hamiltonian function.) Even the Beta-Binomial

example is somewhat too involved for me. So I tried to write down the algorithm step by step in this special case. Which led to

- update β into β-εδp’²
- update p into p-εδp’
- update p’ into p’+ε{(1-a)/p+(b-1)/(1-p)}
- compute the average log-likelihood, λ* under the tempered version of the target (at temperature β)
- update p’ into p’+2εβ{(1-a)/p+(b-1)/(1-p)}-ε[λ-λ*]p’
- update p’ into p’+ε{(1-a)/p+(b-1)/(1-p)}
- update β into β-εδp’²
- update p into p-εδp’

where p’ denotes the momentum auxiliary variable associated with the kinetic energy. And λ is the current log-likelihood. (The parameter ε was equal to 0.005 and I could not find the value of δ.) The only costly step in the above list is the approximation of the log-likelihood average λ*. The above details make the algorithm quite clear but I am still missing the intuition behind…