## a conceptual introduction to HMC

“…it has proven a empirical success on an incredibly diverse set of target distributions encountered in applied problems.”

**I**n January this year (!), Michael Betancourt posted on arXiv a detailed introduction to Hamiltonian Monte Carlo that recouped some talks of his I attended. Like the one in Boston two years ago. I have (re)read through this introduction to include an HMC section in my accelerating MCMC review for WIREs (which writing does not accelerate very much…)

“…this formal construction is often out of reach of theoretical and applied statisticians alike.”

With the relevant provision of Michael being a friend and former colleague at Warwick, I appreciate the paper at least as much as I appreciated the highly intuitive approach to HMC in his talks. It is not very mathematical and does not provide theoretical arguments for the defence of one solution versus another, but it (still) provides engaging reasons for using HMC.

“One way to ensure computational inefficiency is to waste computational resources evaluating the target density and relevant functions in regions of parameter space that have negligible contribution to the desired expectation.”

The paper starts by insisting on the probabilistic importance of *the typical set*, which amounts to a ring for Gaussian-like distributions. Meaning that in high dimensions the mode of the target is not a point that is particularly frequently visited. I find this notion quite compelling and am at the same time [almost] flabbergasted that I have never heard of it before.

“we will consider only a single parameterization for computing expectations, but we must be careful to ensure that any such computation does not depend on the irrelevant details of that parameterization, such as the particular shape of the probability density function.”

I am not sure I get this sentence. Either it means that an expectation remains invariant under reparameterisation. Or something else and more profound that eludes me. In particular because Michael repeats later (p.25) that the canonical density does not depend on the parameterisation.

“Every choice of kinetic energy and integration time yields a new Hamiltonian transition that will interact differently with a given target distribution (…) when poorly-chosen, however, the performance can suffer dramatically.”

When discussing HMC, Michael tends to get a wee bit overboard with superlatives!, although he eventually points out the need for calibration as in the above quote. The explanation of the HMC move as a combination of uniform moves along isoclines of fixed energy level and of jumps between energy levels does not seem to translate into practical implementations, at least not as explained in the paper. Simulating directly the energy distribution for a complex target distribution does not seem more feasible than moving up likelihood levels in nested sampling. (Unless I have forgotten something essential about HMC!) Similarly, when discussing symplectic integrators, the paper intuitively conveys the reason these integrators avoid Euler’s difficulties, even though one has to seek elsewhere for rigorous explanations. In the end I cannot but agree with the concluding statement that the geometry of the target distribution holds the key to devising more efficient Monte Carlo methods.

September 5, 2017 at 3:26 pm

Thanks for taking a look, X!

As Dan notes this is meant as an introduction for those without a strong mathematical background, hence the focus on concepts rather than theorems! There’s plenty of math deeper in the references. ;-)

> I am not sure I get this sentence. Either it means that an expectation remains invariant under reparameterisation. Or something else and more profound that eludes me. In particular because Michael repeats later (p.25) that the canonical density does not depend on the parameterisation.

What I was trying to get at is that expectations and really all of measure theory are reparameteriztion invariant, but implementations of statistical algorithms that depend on parameteriziation-dependent representations, namely densities, are not. If your algorithm is sensitive to these parameterization dependencies then you end up with a tuning problem — which parameterization is best? — which makes it harder to utilize the algorithm in practice.

Exact implementations of HMC (i.e. without an integrator) are fully geometric and do not depend on any chosen parameterization, hence the canonical density and more importantly the Hamiltonian being an invariant objects. That said, there are some choices to be made in that construction, and those choices often look like parameter dependencies. See below!

> “Every choice of kinetic energy and integration time yields a new Hamiltonian transition that will interact differently with a given target distribution (…) when poorly-chosen, however, the performance can suffer dramatically.”

This is exactly where it’s easy to get confused with what’s invariant and what’s not!

The target density gives rise to a potential energy, and the chosen density over momenta gives rise to a kinetic energy. The two energies transform in opposite ways under a reparameterization so their sum, the Hamiltonian, is invariant.

Really there’s a fully invariant, measure-theoretic construction where you use the target measure directly and add a “cotangent disintegration”.

In practice, however, we often choose a default kinetic energy, i.e. a log density, based on the parameterization of the target parameter space, for example an “identify mass matrix” kinetic energy. In other words, the algorithm itself is invariant but by selecting the algorithmic degrees of freedom based on the parameterization of the target parameter space we induce an implicit parameter dependence.

This all gets more complicated when we introducing the adaptation we use in Stan, which sets the elements of the mass matrix to marginal variances which means that the adapted algorithm is invariant to marginal transformations but not joint ones…

> The explanation of the HMC move as a combination of uniform moves along isoclines of fixed energy level and of jumps between energy levels does not seem to translate into practical implementations, at least not as explained in the paper. Simulating directly the energy distribution for a complex target distribution does not seem more feasible than moving up likelihood levels in nested sampling.

Indeed, being able to simulate exactly from the energy distribution, which is equivalent to being able to quantify the density of states in statistical mechanics, is intractable for the same reason that marginal likelihoods are intractable. Which is a shame, because conditioned on those samples HMC could be made embarrassingly parallel!

Instead we draw correlated samples using momenta resamplings between each trajectory. As Dan noted this provides some intuition about Stan (it reduced random walk behavior to one dimension) but also motivates some powerful energy-based diagnostics that immediately indicate when the momentum resampling is limiting performance and we need to improve it by, say, changing the kinetic energy. Or per my previous comment, by keeping the kinetic energy the same but changing the parameterization of the target parameter space. :-)

> In the end I cannot but agree with the concluding statement that the geometry of the target distribution holds the key to devising more efficient Monte Carlo methods.

Yes! That’s all I really want statisticians to take away from the paper. :-)

September 5, 2017 at 2:39 pm

> Either it means that an expectation remains invariant under reparameterisation.

That’s what Michael means.

To address Dan’s concern, the non-centered parameterization is a different distribution. Stan technically samples the parameters declared in the parameters block. The centered parameters are then defined as derived quantities.

I wrote Stan case studies on both of these topics (basically following Andrew’s strategy of writing something to get feedback from the experts):

For typical sets and curse of dimensionality:

R: http://mc-stan.org/users/documentation/case-studies/curse-dims.html

Python: http://mc-stan.org/users/documentation/case-studies/curse-dims-python.html

For how point estimates work differently than Bayesian expectations under reparameterization:

R: http://mc-stan.org/users/documentation/case-studies/mle-params.html

Michael and I are Americans; “awesome” is our baseline adjective, so we need to get creative with superlatives. (We did get dinged on an NSF grant proposal once for overselling; I find it usually works better than underselling ideas.)

Further addressing a point of Dan’s, NUTS tries to trace out the longest reversible trajectory it can without performing redundant work, but then it draws a single point along the trajectory with bias toward selecting one toward the end. This is true of Matt Hoffman’s original slice-sampled version and of Michael’s multinomial sampled update. Andrew was urging Matt Hoffman to figure out how to get the maximum expected squared “jumping” distance.

September 5, 2017 at 4:55 am

> It is not very mathematical and does not provide theoretical arguments for the defence of one solution versus another, but it (still) provides engaging reasons for using HMC.

The paper is a much less mathematical introduction to his maths papers, which do provide theoretical arguments, but also require some fairly hairy measure theory and differential geometry. For example, information about the effect symplectic integrators have is in his paper with Mark and Simon on optimising the step length.

>I find this notion quite compelling and am at the same time [almost] flabbergasted that I have never heard of it before.

It’s in a lot of papers, but it’s hiding in the maths!

> I am not sure I get this sentence.

I don’t either! Because we know that how well HMC converges changes under re-parameterisation (basically if you parameterise the model so the statistical geometry is more like the geometry assumed by the mass matrix in HMC, then it works better. This is the argument about why non-centred parameterisations sometimes work).

> Simulating directly the energy distribution for a complex target distribution does not seem more feasible than moving up likelihood levels in nested sampling.

Definitely. You choose the energy level implicitly when you sample new momenta. But one of the points that he makes is that after the fact you can compare the observed energy distribution with the implicit proposal kernel and compute a BFMI.

It also explains to me at least why HMC works so well. Instead of being a random walk in p-dimensions (like RWMH or MALA), it’s a random walk in 1-dimension and, in the NUTS case, a deterministic proposal that attempts to move as far away on the energy isocline as possible.

September 5, 2017 at 3:24 pm

Dan,

I don’t think doing a random walk in one dimension rather than p dimensions necessarily helps. Any plain Metropolis method will tend to suffer from bad scaling with dimensionality precisely because, even if it’s otherwise amazing, it does a random walk in energy levels, the range of which increases with dimension, while the change per update stays O(1). The crucial point, I think, is that HMC doesn’t do a random walk in energy levels with changes of O(1). When the momentum is replaced, the size of the change goes up with dimension, since the number of momentum variables goes up.