Archive for acceleration of MCMC algorithms

self-healing umbrella sampling

Posted in Kids, pictures, Statistics, University life with tags , , , , , , , on November 5, 2014 by xi'an

Ten days ago, Gersende Fort, Benjamin Jourdain, Tony Lelièvre, and Gabriel Stoltz arXived a study about an adaptive umbrella sampler that can be re-interpreted as a Wang-Landau algorithm, if not the most efficient version of the latter. This reminded me very much of the workshop we had all together in Edinburgh last June. And even more of the focus of the molecular dynamics talks in this same ICMS workshop about accelerating the MCMC exploration of multimodal targets. The self-healing aspect of the sampler is to adapt to the multimodal structure thanks to a partition that defines a biased sampling scheme spending time in each set of the partition in a frequency proportional to weights. While the optimal weights are the weights of the sets against the target distribution (are they truly optimal?! I would have thought lifting low density regions, i.e., marshes, could improve the mixing of the chain for a given proposal), those are unknown and they need to be estimated by an adaptive scheme that makes staying in a given set the less desirable the more one has visited it. By increasing the inverse weight of a given set by a factor each time it is visited. Which sounds indeed like Wang-Landau. The plus side of the self-healing umbrella sampler is that it only depends on a scale γ (and on the partition). Besides converging to the right weights of course. The downside is that it does not reach the most efficient convergence, since the adaptivity weight decreases in 1/n rather than 1/√n.

Note that the paper contains a massive experimental side where the authors checked the impact of various parameters by Monte Carlo studies of estimators involving more than a billion iterations. Apparently repeated a large number of times.

The next step in adaptivity should be about the adaptive determination of the partition, hoping for a robustness against the dimension of the space. Which may be unreachable if I judge by the apparent deceleration of the method when the number of terms in the partition increases.

delayed acceptance [alternative]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on October 22, 2014 by xi'an

In a comment on our Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching paper, Philip commented that he had experimented with an alternative splitting technique retaining the right stationary measure: the idea behind his alternative acceleration is again (a) to divide the target into bits and (b) run the acceptance step by parts, towards a major reduction in computing time. The difference with our approach is to represent the  overall acceptance probability

\min_{k=0,..,d}\left\{\prod_{j=1}^k \rho_j(\eta,\theta),1\right\}

and, even more surprisingly than in our case, this representation remains associated with the right (posterior) target!!! Provided the ordering of the terms is random with a symmetric distribution on the permutation. This property can be directly checked via the detailed balance condition.

In a toy example, I compared the acceptance rates (acrat) for our delayed solution (letabin.R), for this alternative (letamin.R), and for a non-delayed reference (letabaz.R), when considering more and more fractured decompositions of a Bernoulli likelihood.

> system.time(source("letabin.R"))
user system elapsed
225.918 0.444 227.200
> acrat
[1] 0.3195 0.2424 0.2154 0.1917 0.1305 0.0958
> system.time(source("letamin.R"))
user system elapsed
340.677 0.512 345.389
> acrat
[1] 0.4045 0.4138 0.4194 0.4003 0.3998 0.4145
> system.time(source("letabaz.R"))
user system elapsed
49.271 0.080 49.862
> acrat
[1] 0.6078 0.6068 0.6103 0.6086 0.6040 0.6158

A very interesting outcome since the acceptance rate does not change with the number of terms in the decomposition for the alternative delayed acceptance method… Even though it logically takes longer than our solution. However, the drawback is that detailed balance implies picking the order at random, hence loosing on the gain in computing the cheap terms first. If reversibility could be bypassed, then this alternative would definitely get very appealing!

thermodynamic Monte Carlo

Posted in Books, Statistics, University life with tags , , , , on June 27, 2014 by xi'an

Department of Mathematics and Statistics, University of Warwick

Michael Betancourt, my colleague from Warwick, arXived a month ago a paper about a differential geometry approach to relaxation. (In the Monte Carlo rather than the siesta sense of the term relaxation!) He is considering the best way to link a simple base measure ϖ to a measure of interest π by the sequence

\pi_\beta(x) = \dfrac{e^{-\beta\Delta V(x)}\varpi(x)}{Z(\beta)}

where Z(β) is the normalising constant (or partition function in the  thermodynamic translation). Most methods are highly dependent on how the sequence of β’s is chosen. A first nice result (for me) is that the Kullback-Leibler distance and the partition function are strongly related in that

K(\pi_\beta,\pi_\eta) \approx (\eta-\beta) \dfrac{\text{d}Z}{\text{d}\beta}

which means that the variation in the normalising constant is driving the variation in the Kullback-Leibler distance. The next section goes into differential geometry and the remains from my Master course in differential geometry alas are much too scattered for me to even remember some notions like that of a bundle… So, like Andrew, I have trouble making sense of the resulting algorithm, which updates the temperature β along with the position and speed. (It sounds like an extra and corresponding energy term is added to the original Hamiltonian function.) Even the Beta-Binomial

k|p\sim\mathrm{B}(n,p)\,,\ p\sim\mathrm{Be}(a,b)

example is somewhat too involved for me.  So I tried to write down the algorithm step by step in this special case. Which led to

  1. update β into β-εδp’²
  2. update p into p-εδp’
  3. update p’ into p’+ε{(1-a)/p+(b-1)/(1-p)}
  4. compute the average log-likelihood, λ* under the tempered version of the target (at temperature β)
  5. update p’ into p’+2εβ{(1-a)/p+(b-1)/(1-p)}-ε[λ-λ*]p’
  6. update p’ into p’+ε{(1-a)/p+(b-1)/(1-p)}
  7. update β into β-εδp’²
  8. update p into p-εδp’

where p’ denotes the momentum auxiliary variable associated with the kinetic energy. And λ is the current log-likelihood. (The parameter ε was equal to 0.005 and I could not find the value of δ.) The only costly step in the above list is the approximation of the log-likelihood average λ*. The above details make the algorithm quite clear but I am still missing the intuition behind…

early rejection MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , on June 16, 2014 by xi'an

In a (relatively) recent Bayesian Analysis paper on efficient MCMC algorithms for climate models, Antti Solonen, Pirkka Ollinaho, Marko Laine, Heikki Haario, Johanna Tamminen and Heikki Järvinen propose an early rejection scheme to speed up Metropolis-Hastings algorithms. The idea is to consider a posterior distribution (proportional to)

\pi(\theta|y)= \prod_{k=1}^nL_i(\theta|y)

such that all terms in the product are less than one and to compare the uniform u in the acceptance step of the Metropolis-Hastings algorithm to

L_1(\theta'|y)/\pi(\theta|y),

then, if u is smaller than the ratio, to

L_1(\theta'|y)L_2(\theta'|y)/\pi(\theta|y),

and so on, until the new value has been rejected or all terms have been evaluated. The scheme obviously stops earlier than the regular Metropolis-Hastings algorithm, at no significant extra cost when the product above does not factor through a sufficient statistic. Solonen et al.  suggest ordering the terms so that the computationally simpler ones are computed first. The upper bound assumption requires and is equivalent to finding the maximum on each term of the product, though, which may be costly in its own for non-standard distributions. With my students Marco Banterle and Clara Grazian, we actually came upon this paper when preparing our delayed acceptance paper as (a) it belongs to the same category of accelerated MCMC methods (delayed acceptance and early rejection are somehow synonymous!) and (b) it mentions the early prefetching papers of Brockwell (2005) and Strid (2009).

“The acceptance probability in ABC is commonly very low, and many proposals are rejected, and ER can potentially help to detect the rejections sooner.”

In the conclusion, Solonen et al. point out a possible link with ABC but, apart from the general idea of rejecting earlier by looking at a subsample or at a proxy simulation of a summary statistics, which is also the idea at the core of Dennis Prangle’s lazy ABC, there is no obvious impact on a likelihood-free method like ABC.

Asymptotically Exact, Embarrassingly Parallel MCMC

Posted in Books, Statistics, University life with tags , , , , , , on November 26, 2013 by xi'an

Departamento de Matemática, Universidad Complutense de Madrid, 11/11/11Willie Neiswanger, Chong Wang, and Eric Xing (from CMU) recently arXived a paper entitled as above. The “embarrassing” in the title refers to the “embarrassingly simple” solution proposed therein, namely to solve the difficulty in handling very large datasets by running completely independent parallel MCMC samplers on parallel threads or computers and using the outcomes of those samplers as density estimates, pulled together as a product towards an approximation of the true posterior density. In other words, the idea is to break the posterior as

p(\theta|x) = \prod_{i=1}^m p_i(\theta|x)

and to use the estimate

\hat p(\theta|x) = \prod_{i=1}^m \hat p_i(\theta|x)

where the individual estimates are obtained by, say, non-parametric estimates. The method is then “asymptotically exact” in the weak (and unsurprising) sense of being converging in the number of MCMC iterations. Still, there is a theoretical justification that is not found in previous parallel methods that mixed all resulting samples without accounting for the subsampling. And I also appreciate the point that, in many cases, running MCMC samplers with subsamples produces faster convergence.

In the paper, the division of p into its components is done by partitioning the iid data into m subsets. And taking a power 1/m of the prior in each case. (Which may induce improperness issues.) However, the subdivision is arbitrary and can thus be implemented in other cases than the fairly restrictive iid setting. Because each (subsample)  non-parametric estimate involves T terms, the resulting overall estimate contains Tm terms and the authors suggest using an independent Metropolis-within-Gibbs sampler to handle this complexity. Which is necessary [took me a while to realise this!] for producing a final sample from the (approximate) true posterior distribution. As an aside, I wonder why the bandwidths are all equal across all subsamples, as they should depend on those. And as it would not make much of a difference. It would also be interesting to build a typology of cases where subsampling leads to subposteriors that are close to orthogonal, preventing the implementation of the method.

As it happened, I read this paper on the very day Nial Friel (University College Dublin) gave a seminar at the Big’MC seminar on the convergence of approximations to ergodic transition kernels, based on the recent results of Mitrophanov on the stability of Markov chains, where he introduced the topic by the case of datasets large enough to prevent the computation of the likelihood function.

accelerated ABC

Posted in R, Statistics, Travel, University life with tags , , , , , on October 17, 2013 by xi'an

AF flight to Montpellier, Feb. 07, 2012On the flight back from Warwick, I read a fairly recently arXived paper by Umberto Picchini and Julie Forman entitled “Accelerating inference for diffusions observed with measurement error and large sample sizes using Approximate Bayesian Computation: A case study” that relates to earlier ABC works (and the MATLAB abc-sde package) by the first author (earlier works I missed). Among other things, the authors propose an acceleration device for ABC-MCMC: when simulating from the proposal, the Metropolis-Hastings acceptance probability can be computed and compared with a uniform rv prior to simulating pseudo-data. In case of rejection, the pseudo-data does not need to be simulated. In case of acceptance, it is compared with the observed data as usual. This is interesting for two reasons: first it always speeds up the algorithm. Second, it shows the strict limitations of ABC-MCMC, since the rejection takes place without incorporating the information contained in the data. (Even when the proposal incorporates this information, the comparison with the prior does not go this way.) This also relates to one of my open problems, namely how to simulate directly summary statistics without simulating the whole pseudo-dataset.

Another thing (related with acceleration) is that the authors use a simulated subsample rather than the simulated sample in order to gain time: this worries me somehow as the statistics corresponding to the observed data is based on the whole observed data. I thus wonder how both statistics could be compared, since they have different distributions and variabilities, even when using the same parameter value. Or is this a sort of pluggin/bootstrap principle, the true parameter being replaced with its estimator based on the whole data? Maybe this does not matter in the end (when compared with the several levels of approximation)…

Questions on the parallel Rao-Blackwellisation

Posted in R, Statistics, University life with tags , , , , , on December 22, 2010 by xi'an

Pierre Jacob and I got this email from a student about our parallel Rao-Blackwellisation paper. Here are some parts of the questions and our answer:

Although I understand how the strategy proposed in the paper helps in variance reduction, I do not understand why you set b=1 (mentioned in Section 3.2) and why it plays no role in the comparison. If b=1 and p=4 for example, every chains (out of the 4 chains of the block) has a length of 4 samples. And there are no additional blocks. How is this length enough for the sampler to converge and sample the distribution adequately? You mention that you do 10000 independent runs (Which means you do the above procedure 10000 times independently) but aren’t all these runs equally short and inadequate? Even for p=100, aren’t the chains too short? Continue reading

Follow

Get every new post delivered to your Inbox.

Join 717 other followers