turn-key and scalable synchronous distributed MCMC algorithm

Posted in Statistics, University life with tags , , , , , on April 29, 2022 by xi'an

Last week, I attended a Lagrange seminar where Vincent Plassier presented a ICML²¹ paper he had co-authored with Maxime Vono, Alain Durmus, and Eric Moulines. Aiming at distributed MCMC algorithms that operate on several machines, with a target distribution that breaks as a target

$\int\prod_{i=1}^b \pi_i(\theta,z_i)\,\text d\mathbf{z}=\prod_{i=1}^b e^{U_i(A_i\theta)}$

where θ is common to all terms. And each term in the product can (only) be computed locally. This setup is obviously the same as for the embarrassingly parallel approaches of Neiswanger et al. (2014) and Scott et al. (2016). And it follows an earlier proposal of Vono et al. (2020), which appears as a full Gibbs algorithm on the augmented parameters (θ,z), assuming each term is a conditional density in the latent z’s. Which requires constant communications between the b workers and the central “master” node when θ is concerned. The ICML²¹ paper overcomes this difficulty by defining an approximate target with a Normal component in z. Meaning that the (approximate) conditional distribution of θ given the latent z is Normal, i.e. considering the augmented joint

$\prod_{i=1}^b\exp\left\{u_i(z_i)-\rho_i||z_i-A_i\theta||^2\right\}$

but despite the Gaussian aspect, this is not always practical:

“When d [is large], this Gibbs sampling scheme unfortunately leads to prohibitive computational costs and hence prevents its practical use for general Bayesian inference problems.”

The authors then move to simulating from several Langevin step, more specifically running one move of the Euler-Maruyama discretisation scheme of the overdamped Langevin stochastic differential equation. Communication with the central node is then reduced. The paper proposes a proof of convergence in this unusual (since overdamped) setup. As well as bounds on the bias due to the inclusion of the latent variables. They also manage to find the required scaling of the various parameters involved (Normal variance, discretisation scale, Langevin runs) to achieve convergence, which I find rather remarkable. The table at the top illustrates the comparison with earlier methods, whenever available.

probability comparisons

Posted in Books, Kids, pictures, Statistics with tags , , , , on November 6, 2020 by xi'an

my likelihood is dominating my prior [not!]

Posted in Kids, Statistics with tags , , , , , on August 29, 2019 by xi'an

An interesting misconception read on X validated today, with a confusion between the absolute value of the likelihood function and its variability. Which I have trouble explaining except possibly by the extrapolation from the discrete case and a confusion between the probability density of the data [scaled as a probability] and the likelihood function [scale-less]. I also had trouble convincing the originator of the question of the irrelevance of the scale of the likelihood per se, even when demonstrating that |$$𝚺|$$ could vanish from the posterior with no consequence whatsoever. It is only when I thought of the case when the likelihood is constant in $$𝜃$$ that I managed to make my case.

revised empirical HMC

Posted in Statistics, University life with tags , , , , , , , , on March 12, 2019 by xi'an

Following the informed and helpful comments from Matt Graham and Bob Carpenter on our eHMC paper [arXival] last month, we produced a revised and re-arXived version of the paper based on new experiments ran by Changye Wu and Julien Stoehr. Here are some quick replies to these comments, reproduced for convenience. (Warning: this is a loooong post, much longer than usual.) Continue reading

scalable Metropolis-Hastings

Posted in Books, Statistics, Travel with tags , , , , , , , , , on February 12, 2019 by xi'an

Among the flury of arXived papers of last week (414!), including a fair chunk of papers submitted to ICML 2019, I spotted one entry by Cornish et al. on scalable Metropolis-Hastings, which Arnaud Doucet had mentioned to me yesterday when in Oxford. The paper builds on the delayed acceptance paper we wrote with Marco Banterlé, Clara Grazian and Anthony Lee, itself relying on a factorisation decomposition of the likelihood, combined with control variate accelerating techniques. The factorisation of both the target and the proposal allows for a (less efficient) Metropolis-Hastings acceptance ratio that is the product

$\prod_{i=1}^m \alpha_i(\theta,\theta')$

of individual Metropolis-Hastings acceptance ratios, but which allows for quicker rejection if one of the probabilities in the product is small, because the corresponding Bernoulli draw is zero with high probability. One advance made in Michel et al. (2017) [which I doubly missed] is that subsampling is achievable by thinning (as in PDMPs, where these authors have been quite active) through an algorithm of Shantikumar (1985) [described in Devroye’s bible]. Provided each Metropolis-Hastings probability can be lower bounded:

$\alpha_i(\theta,\theta') \ge \exp\{-\psi_i \phi(\theta,\theta')\}$

by a term where the transition φ does not depend on the index i in the product. The computing cost of the thinning process thus depends on the efficiency of the subsampling, namely whether or not the (Poisson) number of terms is much smaller than m, number of terms in the product. A neat trick in the current paper that extends the the Fukui-Todo procedure is to switch to the original Metropolis-Hastings when the overall lower bound is too small, recovering the geometric ergodicity of this original if it holds (Theorem 2.1). Another neat remark is that when using the naïve factorisation as the product of the n individual likelihoods, the resulting algorithm is sort of doomed as n grows, even with an optimal scaling of the proposals. To achieve scalability, the authors introduce a Taylor (i.e., Gaussian) approximation to each local target in the product and start the acceptance decomposition by using the resulting overall Gaussian approximation. Meaning that the remaining product is now made of ratios of targets over their local Taylor approximations, hence most likely close to one. And potentially lower-bounded by the remainder term in the Taylor expansion. Leading to the conclusion that, when everything goes well, meaning that the Taylor expansions can be conducted and the bounds derived for the appropriate expansion, the order of the Poisson scale is O(1/√n)..! The proposal for the Metropolis-Hastings move is actually tuned to the Gaussian approximation, appearing as a variant of the Langevin move or more exactly a discretization of an Hamiltonian move. Obviously, I cannot judge of the complexity in implementing this new scheme from just reading the paper, but this development on the split target is definitely an exciting prospect for handling huge datasets and their friends!