## revised empirical HMC

Posted in Statistics, University life with tags , , , , , , , , on March 12, 2019 by xi'an

Following the informed and helpful comments from Matt Graham and Bob Carpenter on our eHMC paper [arXival] last month, we produced a revised and re-arXived version of the paper based on new experiments ran by Changye Wu and Julien Stoehr. Here are some quick replies to these comments, reproduced for convenience. (Warning: this is a loooong post, much longer than usual.) Continue reading

## scalable Metropolis-Hastings

Posted in Books, Statistics, Travel with tags , , , , , , , , , on February 12, 2019 by xi'an

Among the flury of arXived papers of last week (414!), including a fair chunk of papers submitted to ICML 2019, I spotted one entry by Cornish et al. on scalable Metropolis-Hastings, which Arnaud Doucet had mentioned to me yesterday when in Oxford. The paper builds on the delayed acceptance paper we wrote with Marco Banterlé, Clara Grazian and Anthony Lee, itself relying on a factorisation decomposition of the likelihood, combined with control variate accelerating techniques. The factorisation of both the target and the proposal allows for a (less efficient) Metropolis-Hastings acceptance ratio that is the product

$\prod_{i=1}^m \alpha_i(\theta,\theta')$

of individual Metropolis-Hastings acceptance ratios, but which allows for quicker rejection if one of the probabilities in the product is small, because the corresponding Bernoulli draw is zero with high probability. One advance made in Michel et al. (2017) [which I doubly missed] is that subsampling is achievable by thinning (as in PDMPs, where these authors have been quite active) through an algorithm of Shantikumar (1985) [described in Devroye’s bible]. Provided each Metropolis-Hastings probability can be lower bounded:

$\alpha_i(\theta,\theta') \ge \exp\{-\psi_i \phi(\theta,\theta')\}$

by a term where the transition φ does not depend on the index i in the product. The computing cost of the thinning process thus depends on the efficiency of the subsampling, namely whether or not the (Poisson) number of terms is much smaller than m, number of terms in the product. A neat trick in the current paper that extends the the Fukui-Todo procedure is to switch to the original Metropolis-Hastings when the overall lower bound is too small, recovering the geometric ergodicity of this original if it holds (Theorem 2.1). Another neat remark is that when using the naïve factorisation as the product of the n individual likelihoods, the resulting algorithm is sort of doomed as n grows, even with an optimal scaling of the proposals. To achieve scalability, the authors introduce a Taylor (i.e., Gaussian) approximation to each local target in the product and start the acceptance decomposition by using the resulting overall Gaussian approximation. Meaning that the remaining product is now made of ratios of targets over their local Taylor approximations, hence most likely close to one. And potentially lower-bounded by the remainder term in the Taylor expansion. Leading to the conclusion that, when everything goes well, meaning that the Taylor expansions can be conducted and the bounds derived for the appropriate expansion, the order of the Poisson scale is O(1/√n)..! The proposal for the Metropolis-Hastings move is actually tuned to the Gaussian approximation, appearing as a variant of the Langevin move or more exactly a discretization of an Hamiltonian move. Obviously, I cannot judge of the complexity in implementing this new scheme from just reading the paper, but this development on the split target is definitely an exciting prospect for handling huge datasets and their friends!

## accelerating HMC by learning the leapfrog scale

Posted in Books, Statistics with tags , , , , , , , , on October 12, 2018 by xi'an

In this new arXiv submission that was part of Changye Wu’s thesis [defended last week],  we try to reduce the high sensitivity of the HMC algorithm to its hand-tuned parameters, namely the step size ε  of the discretisation scheme, the number of steps L of the integrator, and the covariance matrix of the auxiliary variables. By calibrating the number of steps of the Leapfrog integrator towards avoiding both slow mixing chains and wasteful computation costs. We do so by learning from the No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014) which already automatically tunes both the step size and the number of leapfrogs.

The core idea behind NUTS is to pick the step size via primal-dual averaging in a burn-in (warmup, Andrew would say) phase and to build at each iteration a proposal based on following a locally longest path on a level set of the Hamiltonian. This is achieved by a recursive algorithm that, at each call to the leapfrog integrator, requires to evaluate both the gradient of the target distribution and the Hamiltonianitself. Roughly speaking an iteration of NUTS costs twice as much as regular HMC with the same number of calls to the integrator. Our approach is to learn from NUTS the scale of the leapfrog length and use the resulting empirical distribution of the longest leapfrog path to randomly pick the value of  L at each iteration of an HMC scheme. This obviously preserves the validity of the HMC algorithm.

While a theoretical comparison of the convergence performances of NUTS and this eHMC proposal seem beyond our reach, we ran a series of experiments to evaluate these performances, using as a criterion an ESS value that is calibrated by the evaluation cost of the logarithm of target density function and of its gradient, as this is usually the most costly part of the algorithms. As well as a similarly calibrated expected square jumping distance. Above is one such illustration for a stochastic volatility model, the first axis representing the targeted acceptance probability in the Metropolis step. Some of the gains in either ESS or ESJD are by a factor of ten, which relates to our argument that NUTS somewhat wastes computation effort using a uniformly distributed proposal over the candidate set, instead of being close to its end-points, which automatically reduces the distance between the current position and the proposal.

## Hamiltonian tails

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on July 17, 2018 by xi'an

“We demonstrate HMC’s sensitivity to these parameters by sampling from a bivariate Gaussian with correlation coefficient 0.99. We consider three settings (ε,L) = {(0.16; 40); (0.16; 50); (0.15; 50)}” Ziyu Wang, Shakir Mohamed, and Nando De Freitas. 2013

In an experiment with my PhD student Changye Wu (who wrote all R codes used below), we looked back at a strange feature in an 2013 ICML paper by Wang, Mohamed, and De Freitas. Namely, a rather poor performance of an Hamiltonian Monte Carlo (leapfrog) algorithm on a two-dimensional strongly correlated Gaussian target, for very specific values of the parameters (ε,L) of the algorithm.

The Gaussian target associated with this sample stands right in the middle of the two clouds, as identified by Wang et al. And the leapfrog integration path for (ε,L)=(0.15,50)

keeps jumping between the two ridges (or tails) , with no stop in the middle. Changing ever so slightly (ε,L) to (ε,L)=(0.16,40) does not modify the path very much

but the HMC output is quite different since the cloud then sits right on top of the target

with no clear explanation except for a sort of periodicity in the leapfrog sequence associated with the velocity generated at the start of the code. Looking at the Hamiltonian values for (ε,L)=(0.15,50)

and for (ε,L)=(0.16,40)

does not help, except to point at a sequence located far in the tails of this Hamiltonian, surprisingly varying when supposed to be constant. At first, we thought the large value of ε was to blame but much smaller values still return poor convergence performances. As below for (ε,L)=(0.01,450)

## generalizing Hamiltonian Monte Carlo with neural networks

Posted in Statistics with tags , , , on April 25, 2018 by xi'an

Daniel Levy, Matthew Hoffman, and Jascha Sohl-Dickstein pointed out to me a recent paper of theirs submitted to and accepted by ICLR 2018, with the above title. This allowed me to discover the open source handling of paper reviews at ICLR, which I find quite convincing, except for not using MathJax or another medium for LaTeX formulas. And which provides a collection of comments besides mine’s. (Disclaimer: I was not involved in the processing of this paper for ICLR!)

“Ultimately our goal (and that of HMC) is to produce a proposal that mixes efficiently, not to simulate Hamiltonian dynamics accurately.”

The starting concept is the same as GANs (generative adversarial networks) discussed here a few weeks ago. Complemented by a new HMC that also uses deep neural networks to represent the HMC trajectory. (Also seen in earlier papers by e.g. Strathman.) The novelty in the HMC seems to be a binary direction indicator on top of the velocity. The leapfrog integrator is also modified, with a location scale generalisation for the velocity and a half-half location scale move for the original target x. The functions appearing in the location scale aspects are learned by neural nets. Towards minimising lag-one auto-correlation. Plus an extra penalty for not moving enough. Reflecting on the recent MCMC literature and in particular on the presentations at BayesComp last month, judging from comments of participants, this inclusion of neural tools in the tuning of MCMC algorithms sounds like a steady trend in the community. I am slightly at a loss about the adaptive aspects of the trend with regards to the Markovianity of the outcome.

“To compute the Metropolis-Hastings acceptance probability for a deterministic transition, the operator
must be invertible and have a tractable Jacobian.”

A remark (above) that seems to date back at least to Peter Green’s reversible jump. Duly mentioned in the paper. When reading about the performances of this new learning HMC, I could not see where the learning steps for the parameters of the leapfrog operators were accounted for, although the authors mention an identical number of gradient computations (which I take to mean the same thing). One evaluation of this method against earlier ones (Fig.2) checks successive values of the likelihood, which may be intuitive enough but does not necessarily qualify convergence to the right region since the posterior may concentrate away from the maximal likelihood.

## unbiased HMC

Posted in Books, pictures, Statistics with tags , , , , , , , on September 25, 2017 by xi'an

Jeremy Heng and Pierre Jacob arXived last week a paper on unbiased Hamiltonian Monte Carlo by coupling, following the earlier paper of Pierre and co-authors on debiasing by coupling a few weeks ago. The coupling within the HMC amounts to running two HMC chains with common random numbers, plus subtleties!

“As with any other MCMC method, HMC estimators are justified in the limit of the number of iterations. Algorithms which rely on such asymptotics face the risk of becoming obsolete if computational power keeps increasing through the number of available processors and not through clock speed.”

The main difficulty here is to have both chains meet (exactly) with large probability, since coupled HMC can only bring these chain close to one another. The trick stands in using both coupled HMC and coupled Hastings-Metropolis kernels, since the coupled MH kernel allows for exact meetings when the chains are already close, after which they remain happily and forever together! The algorithm is implemented by choosing between the kernels at random at each iteration. (Unbiasedness follows by the Glynn-Rhee trick, which is eminently well-suited for coupling!) As pointed out from the start of the paper, the appeal of this unbiased version is that the algorithm can be (embarrassingly) parallelised since all processors in use return estimators that are iid copies of one another, hence easily merged into a better estimator.

## a conceptual introduction to HMC [reply from the author]

Posted in Statistics with tags , , , , , , , , on September 8, 2017 by xi'an

[Here is the reply on my post from Michael Bétancourt, detailed enough to be promoted from comment to post!]

As Dan notes this is meant as an introduction for those without a strong mathematical background, hence the focus on concepts rather than theorems! There’s plenty of maths deeper in the references. ;-)

I am not sure I get this sentence. Either it means that an expectation remains invariant under reparameterisation. Or something else and more profound that eludes me. In particular because Michael repeats later (p.25) that the canonical density does not depend on the parameterisation.

What I was trying to get at is that expectations and really all of measure theory are reparameteriztion invariant, but implementations of statistical algorithms that depend on parameterization-dependent representations, namely densities, are not. If your algorithm is sensitive to these parameterization dependencies then you end up with a tuning problem — which parameterization is best? — which makes it harder to utilize the algorithm in practice.

Exact implementations of HMC (i.e. without an integrator) are fully geometric and do not depend on any chosen parameterization, hence the canonical density and more importantly the Hamiltonian being an invariant objects. That said, there are some choices to be made in that construction, and those choices often look like parameter dependencies. See below!

“Every choice of kinetic energy and integration time yields a new Hamiltonian transition that will interact differently with a given target distribution (…) when poorly-chosen, however, the performance can suffer dramatically.”

This is exactly where it’s easy to get confused with what’s invariant and what’s not!

The target density gives rise to a potential energy, and the chosen density over momenta gives rise to a kinetic energy. The two energies transform in opposite ways under a reparameterization so their sum, the Hamiltonian, is invariant.

Really there’s a fully invariant, measure-theoretic construction where you use the target measure directly and add a “cotangent disintegration”.

In practice, however, we often choose a default kinetic energy, i.e. a log density, based on the parameterization of the target parameter space, for example an “identify mass matrix” kinetic energy. In other words, the algorithm itself is invariant but by selecting the algorithmic degrees of freedom based on the parameterization of the target parameter space we induce an implicit parameter dependence.

This all gets more complicated when we introducing the adaptation we use in Stan, which sets the elements of the mass matrix to marginal variances which means that the adapted algorithm is invariant to marginal transformations but not joint ones…

The explanation of the HMC move as a combination of uniform moves along isoclines of fixed energy level and of jumps between energy levels does not seem to translate into practical implementations, at least not as explained in the paper. Simulating directly the energy distribution for a complex target distribution does not seem more feasible than moving up likelihood levels in nested sampling.

Indeed, being able to simulate exactly from the energy distribution, which is equivalent to being able to quantify the density of states in statistical mechanics, is intractable for the same reason that marginal likelihoods are intractable. Which is a shame, because conditioned on those samples HMC could be made embarrassingly parallel!

Instead we draw correlated samples using momenta resamplings between each trajectory. As Dan noted this provides some intuition about Stan (it reduced random walk behavior to one dimension) but also motivates some powerful energy-based diagnostics that immediately indicate when the momentum resampling is limiting performance and we need to improve it by, say, changing the kinetic energy. Or per my previous comment, by keeping the kinetic energy the same but changing the parameterization of the target parameter space. :-)

In the end I cannot but agree with the concluding statement that the geometry of the target distribution holds the key to devising more efficient Monte Carlo methods.

Yes! That’s all I really want statisticians to take away from the paper. :-)