## unbiased Hamiltonian Monte Carlo with couplings

Posted in Books, Kids, Statistics, University life with tags , , , , , , on October 25, 2019 by xi'an

In the June issue of Biometrika, which had been sitting for a few weeks on my desk under my teapot!, Jeremy Heng and Pierre Jacob published a paper on unbiased estimators for Hamiltonian Monte Carlo using couplings. (Disclaimer: I was not involved with the review or editing of this paper.) Which extends to HMC environments the earlier paper of Pierre Jacob, John O’Leary and Yves Atchadé, to be discussed soon at the Royal Statistical Society. The fundamentals are the same, namely that an unbiased estimator can be produced from a converging sequence of estimators and that it can be de facto computed if two Markov chains with the same marginal can be coupled. The issue with Hamiltonians is to figure out how to couple their dynamics. In the Gaussian case, it is relatively easy to see that two chains with the same initial momentum meet periodically. In general, there is contraction within a compact set (Lemma 1). The coupling extends to a time discretisation of the Hamiltonian flow by a leap-frog integrator, still using the same momentum. Which roughly amounts in using the same random numbers in both chains. When defining a relaxed meeting (!) where both chains are within δ of one another, the authors rely on a drift condition (8) that reminds me of the early days of MCMC convergence and seem to imply the existence of a small set “where the target distribution [density] is strongly log-concave”. And which makes me wonder if this small set could be used instead to create renewal events that would in turn ensure both stationarity and unbiasedness without the recourse to a second coupled chain. When compared on a Gaussian example with couplings on Metropolis-Hastings and MALA (Fig. 1), the coupled HMC sees hardly any impact of the dimension of the target (in the average coupling time), with a much lower value. However, I wonder at the relevance of the meeting time as an assessment of efficiency. In the sense that the coupling time is not a convergence time but reflects as well on the initial conditions. I acknowledge that this allows for an averaging over  parallel implementations but I remain puzzled by the statement that this leads to “estimators that are consistent in the limit of the number of replicates, rather than in the usual limit of the number of Markov chain iterations”, since a particularly poor initial distribution could on principle lead to a mode of the target being never explored or on the coupling time being ever so rarely too large for the computing abilities at hand.

## dynamic nested sampling for stars

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , , , on April 12, 2019 by xi'an

In the sequel of earlier nested sampling packages, like MultiNest, Joshua Speagle has written a new package called dynesty that manages dynamic nested sampling, primarily intended for astronomical applications. Which is the field where nested sampling is the most popular. One of the first remarks in the paper is that nested sampling can be more easily implemented by using a Uniform reparameterisation of the prior, that is, a reparameterisation that turns the prior into a Uniform over the unit hypercube. Which means in fine that the prior distribution can be generated from a fixed vector of uniforms and known transforms. Maybe not such an issue given that this is the prior after all.  The author considers this makes sampling under the likelihood constraint a much simpler problem but it all depends in the end on the concentration of the likelihood within the unit hypercube. And on the ability to reach the higher likelihood slices. I did not see any special trick when looking at the documentation, but reflected on the fundamental connection between nested sampling and this ability. As in the original proposal by John Skilling (2006), the slice volumes are “estimated” by simulated Beta order statistics, with no connection with the actual sequence of simulation or the problem at hand. We did point out our incomprehension for such a scheme in our Biometrika paper with Nicolas Chopin. As in earlier versions, the algorithm attempts at visualising the slices by different bounding techniques, before proceeding to explore the bounded regions by several exploration algorithms, including HMC.

“As with any sampling method, we strongly advocate that Nested Sampling should not be viewed as being strictly“better” or “worse” than MCMC, but rather as a tool that can be more or less useful in certain problems. There is no “One True Method to Rule Them All”, even though it can be tempting to look for one.”

When introducing the dynamic version, the author lists three drawbacks for the static (original) version. One is the reliance on this transform of a Uniform vector over an hypercube. Another one is that the overall runtime is highly sensitive to the choice the prior. (If simulating from the prior rather than an importance function, as suggested in our paper.) A third one is the issue that nested sampling is impervious to the final goal, evidence approximation versus posterior simulation, i.e., uses a constant rate of prior integration. The dynamic version simply modifies the number of point simulated in each slice. According to the (relative) increase in evidence provided by the current slice, estimated through iterations. This makes nested sampling a sort of inversted Wang-Landau since it sharpens the difference between slices. (The dynamic aspects for estimating the volumes of the slices and the stopping rule may hinder convergence in unclear ways, which is not discussed by the paper.) Among the many examples produced in the paper, a 200 dimension Normal target, which is an interesting object for posterior simulation in that most of the posterior mass rests on a ring away from the maximum of the likelihood. But does not seem to merit a mention in the discussion. Another example of heterogeneous regression favourably compares dynesty with MCMC in terms of ESS (but fails to include an HMC version).

[Breaking News: Although I wrote this post before the exciting first image of the black hole in M87 was made public and hence before I was aware of it, the associated AJL paper points out relying on dynesty for comparing several physical models of the phenomenon by nested sampling.]

## revised empirical HMC

Posted in Statistics, University life with tags , , , , , , , , on March 12, 2019 by xi'an

Following the informed and helpful comments from Matt Graham and Bob Carpenter on our eHMC paper [arXival] last month, we produced a revised and re-arXived version of the paper based on new experiments ran by Changye Wu and Julien Stoehr. Here are some quick replies to these comments, reproduced for convenience. (Warning: this is a loooong post, much longer than usual.) Continue reading

## scalable Metropolis-Hastings

Posted in Books, Statistics, Travel with tags , , , , , , , , , on February 12, 2019 by xi'an

Among the flury of arXived papers of last week (414!), including a fair chunk of papers submitted to ICML 2019, I spotted one entry by Cornish et al. on scalable Metropolis-Hastings, which Arnaud Doucet had mentioned to me yesterday when in Oxford. The paper builds on the delayed acceptance paper we wrote with Marco Banterlé, Clara Grazian and Anthony Lee, itself relying on a factorisation decomposition of the likelihood, combined with control variate accelerating techniques. The factorisation of both the target and the proposal allows for a (less efficient) Metropolis-Hastings acceptance ratio that is the product

$\prod_{i=1}^m \alpha_i(\theta,\theta')$

of individual Metropolis-Hastings acceptance ratios, but which allows for quicker rejection if one of the probabilities in the product is small, because the corresponding Bernoulli draw is zero with high probability. One advance made in Michel et al. (2017) [which I doubly missed] is that subsampling is achievable by thinning (as in PDMPs, where these authors have been quite active) through an algorithm of Shantikumar (1985) [described in Devroye’s bible]. Provided each Metropolis-Hastings probability can be lower bounded:

$\alpha_i(\theta,\theta') \ge \exp\{-\psi_i \phi(\theta,\theta')\}$

by a term where the transition φ does not depend on the index i in the product. The computing cost of the thinning process thus depends on the efficiency of the subsampling, namely whether or not the (Poisson) number of terms is much smaller than m, number of terms in the product. A neat trick in the current paper that extends the the Fukui-Todo procedure is to switch to the original Metropolis-Hastings when the overall lower bound is too small, recovering the geometric ergodicity of this original if it holds (Theorem 2.1). Another neat remark is that when using the naïve factorisation as the product of the n individual likelihoods, the resulting algorithm is sort of doomed as n grows, even with an optimal scaling of the proposals. To achieve scalability, the authors introduce a Taylor (i.e., Gaussian) approximation to each local target in the product and start the acceptance decomposition by using the resulting overall Gaussian approximation. Meaning that the remaining product is now made of ratios of targets over their local Taylor approximations, hence most likely close to one. And potentially lower-bounded by the remainder term in the Taylor expansion. Leading to the conclusion that, when everything goes well, meaning that the Taylor expansions can be conducted and the bounds derived for the appropriate expansion, the order of the Poisson scale is O(1/√n)..! The proposal for the Metropolis-Hastings move is actually tuned to the Gaussian approximation, appearing as a variant of the Langevin move or more exactly a discretization of an Hamiltonian move. Obviously, I cannot judge of the complexity in implementing this new scheme from just reading the paper, but this development on the split target is definitely an exciting prospect for handling huge datasets and their friends!

## accelerating HMC by learning the leapfrog scale

Posted in Books, Statistics with tags , , , , , , , , on October 12, 2018 by xi'an

In this new arXiv submission that was part of Changye Wu’s thesis [defended last week],  we try to reduce the high sensitivity of the HMC algorithm to its hand-tuned parameters, namely the step size ε  of the discretisation scheme, the number of steps L of the integrator, and the covariance matrix of the auxiliary variables. By calibrating the number of steps of the Leapfrog integrator towards avoiding both slow mixing chains and wasteful computation costs. We do so by learning from the No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014) which already automatically tunes both the step size and the number of leapfrogs.

The core idea behind NUTS is to pick the step size via primal-dual averaging in a burn-in (warmup, Andrew would say) phase and to build at each iteration a proposal based on following a locally longest path on a level set of the Hamiltonian. This is achieved by a recursive algorithm that, at each call to the leapfrog integrator, requires to evaluate both the gradient of the target distribution and the Hamiltonianitself. Roughly speaking an iteration of NUTS costs twice as much as regular HMC with the same number of calls to the integrator. Our approach is to learn from NUTS the scale of the leapfrog length and use the resulting empirical distribution of the longest leapfrog path to randomly pick the value of  L at each iteration of an HMC scheme. This obviously preserves the validity of the HMC algorithm.

While a theoretical comparison of the convergence performances of NUTS and this eHMC proposal seem beyond our reach, we ran a series of experiments to evaluate these performances, using as a criterion an ESS value that is calibrated by the evaluation cost of the logarithm of target density function and of its gradient, as this is usually the most costly part of the algorithms. As well as a similarly calibrated expected square jumping distance. Above is one such illustration for a stochastic volatility model, the first axis representing the targeted acceptance probability in the Metropolis step. Some of the gains in either ESS or ESJD are by a factor of ten, which relates to our argument that NUTS somewhat wastes computation effort using a uniformly distributed proposal over the candidate set, instead of being close to its end-points, which automatically reduces the distance between the current position and the proposal.

## Hamiltonian tails

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on July 17, 2018 by xi'an

“We demonstrate HMC’s sensitivity to these parameters by sampling from a bivariate Gaussian with correlation coefficient 0.99. We consider three settings (ε,L) = {(0.16; 40); (0.16; 50); (0.15; 50)}” Ziyu Wang, Shakir Mohamed, and Nando De Freitas. 2013

In an experiment with my PhD student Changye Wu (who wrote all R codes used below), we looked back at a strange feature in an 2013 ICML paper by Wang, Mohamed, and De Freitas. Namely, a rather poor performance of an Hamiltonian Monte Carlo (leapfrog) algorithm on a two-dimensional strongly correlated Gaussian target, for very specific values of the parameters (ε,L) of the algorithm.

The Gaussian target associated with this sample stands right in the middle of the two clouds, as identified by Wang et al. And the leapfrog integration path for (ε,L)=(0.15,50)

keeps jumping between the two ridges (or tails) , with no stop in the middle. Changing ever so slightly (ε,L) to (ε,L)=(0.16,40) does not modify the path very much

but the HMC output is quite different since the cloud then sits right on top of the target

with no clear explanation except for a sort of periodicity in the leapfrog sequence associated with the velocity generated at the start of the code. Looking at the Hamiltonian values for (ε,L)=(0.15,50)

and for (ε,L)=(0.16,40)

does not help, except to point at a sequence located far in the tails of this Hamiltonian, surprisingly varying when supposed to be constant. At first, we thought the large value of ε was to blame but much smaller values still return poor convergence performances. As below for (ε,L)=(0.01,450)

## generalizing Hamiltonian Monte Carlo with neural networks

Posted in Statistics with tags , , , on April 25, 2018 by xi'an

Daniel Levy, Matthew Hoffman, and Jascha Sohl-Dickstein pointed out to me a recent paper of theirs submitted to and accepted by ICLR 2018, with the above title. This allowed me to discover the open source handling of paper reviews at ICLR, which I find quite convincing, except for not using MathJax or another medium for LaTeX formulas. And which provides a collection of comments besides mine’s. (Disclaimer: I was not involved in the processing of this paper for ICLR!)

“Ultimately our goal (and that of HMC) is to produce a proposal that mixes efficiently, not to simulate Hamiltonian dynamics accurately.”

The starting concept is the same as GANs (generative adversarial networks) discussed here a few weeks ago. Complemented by a new HMC that also uses deep neural networks to represent the HMC trajectory. (Also seen in earlier papers by e.g. Strathman.) The novelty in the HMC seems to be a binary direction indicator on top of the velocity. The leapfrog integrator is also modified, with a location scale generalisation for the velocity and a half-half location scale move for the original target x. The functions appearing in the location scale aspects are learned by neural nets. Towards minimising lag-one auto-correlation. Plus an extra penalty for not moving enough. Reflecting on the recent MCMC literature and in particular on the presentations at BayesComp last month, judging from comments of participants, this inclusion of neural tools in the tuning of MCMC algorithms sounds like a steady trend in the community. I am slightly at a loss about the adaptive aspects of the trend with regards to the Markovianity of the outcome.

“To compute the Metropolis-Hastings acceptance probability for a deterministic transition, the operator
must be invertible and have a tractable Jacobian.”

A remark (above) that seems to date back at least to Peter Green’s reversible jump. Duly mentioned in the paper. When reading about the performances of this new learning HMC, I could not see where the learning steps for the parameters of the leapfrog operators were accounted for, although the authors mention an identical number of gradient computations (which I take to mean the same thing). One evaluation of this method against earlier ones (Fig.2) checks successive values of the likelihood, which may be intuitive enough but does not necessarily qualify convergence to the right region since the posterior may concentrate away from the maximal likelihood.