## geometry of Hamiltonian Monte Carlo

**T**he paper by Michael Betancourt and Leo Stein on the geometry of Hamiltonian Monte Carlo was one of the remaining ones on my pile of unread arXiv postings. I read it in the Paris métro yesterday (on my way to Jon Wellner’s *point de vue*), which is presumably the wrong place to learn about symplectic and differential geometry! In any case, I found the paper hard to relate to the implementation of Hamiltonian MCMC methods as illustrated by the Read Paper of Girolami and Calderhead (2011). Even though some notions looked familiar (e.g., “A*ny subsequent bias, however, can be avoided by treating the flow as a Metropolis proposal function*“, and “w*hile any reversible Hamiltonian defines both a probability distribution and respective Markov chain, practical applications demand the reverse construction*“), it felt like there was a big gap between the representation adopted therein (“*we have constructed the most general approach to Hamiltonian Monte Carlo*“) and a potential implementation. Again, presumably wrong place and wrong time to try to read an abstract paper! Especially for a reader who never had a strongly intuitive connection with Physics.

April 20, 2012 at 12:15 am

[…] of my black notebook (“bloc”) than in the past month!!! This morning started with an Hamiltonian session, Paul Fearnhead presenting recent developments in this area. I liked his coverage very […]

April 2, 2012 at 6:58 pm

casino…[…]geometry of Hamiltonian Monte Carlo « Xi'an's Og[…]…

February 7, 2012 at 7:27 pm

To be fair, we didn’t think anyone would actually read the paper…

The goal of the work was to derive the most general conditions for HMC, i.e. for what Hamiltonians will Hamilton’s equations give a Markov chain that converges to a specified target distribution. All of the math, which is admittedly terse, shows that the Hamiltonian decomposes into a potential and kinetic term and gives the necessary conditions for what the kinetic term could be. In the end, however, we note that the most practical choice is a quadratic kinetic energy which yields the RMHMC approach of Girolami and Calderhead.

Our proposed metric takes advantage of the geometric perspective used in the above derivation to avoid the need for any information geometry. Taking the background metric to be the identity metric has proved successful in initial tests, faster than using Fisher-Rao while being able to sample from notoriously difficult distributions such as the “funnel” from Neal’s slice sampling paper.

I’d be happy to discuss any further concepts with anyone interested.

February 7, 2012 at 10:39 pm

Thank you. (Great first sentence, to be recycled for sure!!!)

My lack of understanding of all things connected with physics makes me reluctant to get into a discussion with little intuition (for me), so I am a wee confused as to why the choice of metric would ever be restricted and in which sense one would be better than another.

February 8, 2012 at 12:17 am

We were never trying to restrict the metric of RMHMC (save for positive definiteness, of course). What we are restricting is the form of the kinetic energy itself — first we show that the general Hamiltonian function

H(p, q)factors intoH(p, q) = T(p, q) + V(q). AnyT(p, q)satisfying certain constraints would yield the desired Markov chain, butT(p, q) = 1/2 pleads to efficient resampling of the momenta. Again, we are just arguing that RMHMC is as general one would want to go for a practical HMC algorithm, at least at this point.^{T}Σ^{-1}(q) p + 1/2 log | Σ|As for the choice of metric, why –would– you want to use the Fisher-Rao metric? Even if you could perform all of the expectations analytically (unlikely in a complicated, non-linear model), why would you want to marginalize over the data? The local structure of the posterior depends on the specifics of the data. As an extreme, how do you define Fisher-Rao for a generic probability distribution that doesn’t have conditioning variables?

The metric we proposed requires no such expectations, utilizing only local information of the target distribution. The full Hessian is introduced into the dynamics providing more information than just the gradient, and as an added bonus the form of the metric allows for all relevant computations to be done at

O(N²)instead of theO(N³)needed for a generally dense metric.Of course it may very well be that our metric is ultimately flawed, but it’s certainly worth exploring it and other metrics that may provide more efficient sampling.

February 7, 2012 at 11:49 pm

The problem potential problem with the identity metric comes, for example, when you’ve got a hierarchical model. It can occur that the natural scaling is different for different parameters (by many orders of magnitude) and so some sort of “preconditioning” (to use the terminology) is necessary.

Not doing this can be catastrophic in two ways: you can either add a relatively large number to the diagonal and swamp the second order information, or you don’t add enough to the diagonal and the matrix remains indefinite.

If you were doing optimisation, one popular option is the Levenberg–Marquardt-type idea, which replaces the identity metric with a scalar times the diagonal of the Hessian (or some suitable approximation to it). But how to chose the scalar is an issue – the common option of adapting it based on the previous iterate won’t work here! (NB: Depending on how far you can take the analogy [the $1M question], the scaling parameter is strongly linked to a “trust region” radius, which suggests that it’s linked to the effective support of the proposal).

It’s also important to make this “adaptive” in some way because in areas where everything is ok, you’d expect that a metric based purely on the obseverved information will be a good choice, whereas you need to make sure that in difficult regions (where the observed information is [almost] indefinite) you don’t propose steps that are always rejected.

But how to do this while maintaining detailed balance…. (or, can you have a hamiltonian flow on a solution-dependent manifold that satisfies detailed balance? And can you simulate it!).

But Xi’an raises an important point that Girolami and Calderhead dodged: how do you tell a good metric from a bad metric? The obvious thing to do is to try to get a metric-dependent bound the spectral gap (or something similar) for Manifold HMC and minimise it over a class of “practical” metrics. But whether this is possible/practical/a good idea is anyone’s guess.

But I really hope some practical metric choices “magically appear”, because I want to use these methods, but the Fisher metric isn’t feasible for my models (if you parameterise a Gaussian by a sparse precision matrix [i.e. x ~ N(0, Q(\theta)^{-1})], you still seem to need all of the entries of the covariance matrix to compute the bit of the metric that corresponds to the \theta parameters).

February 8, 2012 at 4:17 pm

I’m not sure if you have the right metric in mind — our proposed metric is –not– the identity but rather a correction to the identity,

Σ. Using the Woodbury identity to invert the matrix (see the paper) you can see how this choice does indeed adaptively rotate and scale the space for more efficient sampling. Moreover, there is a sneaky feature that benefits complicated posteriors such as those from hierarchical models — the log determinant in the Hamiltonian. It turns out that this term effectively stores energy in the curvature of the distribution, allowing the chain to transition between regions of low potential and large curvature and regions of high potential and small curvature._{ij}= δ_{ij}+ ∂_{i}V ∂_{j}VThere’s even more suggestive properties if you look at the derivatives of the metric and log determinant (needed for the evolution) which are suspiciously similar to common non-linear optimization methods that take advantage of the Hessian.

This is clearly a young area of research and we are making no claims that our proposed metric is the best. We believe that we have generalized the conditions necessary for Hamiltonian evolution that produces a convergent Markov chain, as well as the conditions for the metric of a quadratic kinetic energy (namely that it’s a positive-definite, geometry object), and hope that this framework facilitates further study and proposed methods.

As for the criteria for choosing the best metric? I’m not sure there is enough understanding yet. We can shove more information into the metric (for example, explicit second derivatives if we use the Ricci tensor of our metric) but at a significantly greater computational cost. Any formal utility has to include both information content and computational cost, and in practice it’s likely that different metrics may be more appropriate for different problems.

February 6, 2012 at 7:32 am

Hmmm. Do you have any hints as to what this is saying that isn’t covered in different (saner!) notation by Girolami and Calderhead? Beyond the vague (interesting) promise in the second last paragraph? And the “new” metric, which hides a multitude of calibration parameters in ?

If my memory of trust region methods are correct, if , then this should approach a Langevin method as either goes to infinity (if the correspondence Langevin->Steepest Descent, HMC->Newton method holds) and so you will typically want to take “as small as possible”. This would mean choosing (and therefore ) adaptively, which makes that the base geometry is posterior dependent, which (at a guess) would break something.

February 6, 2012 at 7:50 am

Hmm, sounds like you are already ahead of me within a few minutes of perusing!!! Provided I grasped enough of its meaning, I thought the paper had a more general formulation for the kinetic energy. Obviously I would need to see a non-physical reformulation of the paper to understand whether or not this development has any applicability.