As Dan notes this is meant as an introduction for those without a strong mathematical background, hence the focus on concepts rather than theorems! There’s plenty of math deeper in the references. ;-)

> I am not sure I get this sentence. Either it means that an expectation remains invariant under reparameterisation. Or something else and more profound that eludes me. In particular because Michael repeats later (p.25) that the canonical density does not depend on the parameterisation.

What I was trying to get at is that expectations and really all of measure theory are reparameteriztion invariant, but implementations of statistical algorithms that depend on parameteriziation-dependent representations, namely densities, are not. If your algorithm is sensitive to these parameterization dependencies then you end up with a tuning problem — which parameterization is best? — which makes it harder to utilize the algorithm in practice.

Exact implementations of HMC (i.e. without an integrator) are fully geometric and do not depend on any chosen parameterization, hence the canonical density and more importantly the Hamiltonian being an invariant objects. That said, there are some choices to be made in that construction, and those choices often look like parameter dependencies. See below!

> “Every choice of kinetic energy and integration time yields a new Hamiltonian transition that will interact differently with a given target distribution (…) when poorly-chosen, however, the performance can suffer dramatically.”

This is exactly where it’s easy to get confused with what’s invariant and what’s not!

The target density gives rise to a potential energy, and the chosen density over momenta gives rise to a kinetic energy. The two energies transform in opposite ways under a reparameterization so their sum, the Hamiltonian, is invariant.

Really there’s a fully invariant, measure-theoretic construction where you use the target measure directly and add a “cotangent disintegration”.

In practice, however, we often choose a default kinetic energy, i.e. a log density, based on the parameterization of the target parameter space, for example an “identify mass matrix” kinetic energy. In other words, the algorithm itself is invariant but by selecting the algorithmic degrees of freedom based on the parameterization of the target parameter space we induce an implicit parameter dependence.

This all gets more complicated when we introducing the adaptation we use in Stan, which sets the elements of the mass matrix to marginal variances which means that the adapted algorithm is invariant to marginal transformations but not joint ones…

> The explanation of the HMC move as a combination of uniform moves along isoclines of fixed energy level and of jumps between energy levels does not seem to translate into practical implementations, at least not as explained in the paper. Simulating directly the energy distribution for a complex target distribution does not seem more feasible than moving up likelihood levels in nested sampling.

Indeed, being able to simulate exactly from the energy distribution, which is equivalent to being able to quantify the density of states in statistical mechanics, is intractable for the same reason that marginal likelihoods are intractable. Which is a shame, because conditioned on those samples HMC could be made embarrassingly parallel!

Instead we draw correlated samples using momenta resamplings between each trajectory. As Dan noted this provides some intuition about Stan (it reduced random walk behavior to one dimension) but also motivates some powerful energy-based diagnostics that immediately indicate when the momentum resampling is limiting performance and we need to improve it by, say, changing the kinetic energy. Or per my previous comment, by keeping the kinetic energy the same but changing the parameterization of the target parameter space. :-)

> In the end I cannot but agree with the concluding statement that the geometry of the target distribution holds the key to devising more efficient Monte Carlo methods.

Yes! That’s all I really want statisticians to take away from the paper. :-)

]]>Dan,

I don’t think doing a random walk in one dimension rather than p dimensions necessarily helps. Any plain Metropolis method will tend to suffer from bad scaling with dimensionality precisely because, even if it’s otherwise amazing, it does a random walk in energy levels, the range of which increases with dimension, while the change per update stays O(1). The crucial point, I think, is that HMC doesn’t do a random walk in energy levels with changes of O(1). When the momentum is replaced, the size of the change goes up with dimension, since the number of momentum variables goes up.

]]>That’s what Michael means.

To address Dan’s concern, the non-centered parameterization is a different distribution. Stan technically samples the parameters declared in the parameters block. The centered parameters are then defined as derived quantities.

I wrote Stan case studies on both of these topics (basically following Andrew’s strategy of writing something to get feedback from the experts):

For typical sets and curse of dimensionality:

R: http://mc-stan.org/users/documentation/case-studies/curse-dims.html

Python: http://mc-stan.org/users/documentation/case-studies/curse-dims-python.html

For how point estimates work differently than Bayesian expectations under reparameterization:

R: http://mc-stan.org/users/documentation/case-studies/mle-params.html

Michael and I are Americans; “awesome” is our baseline adjective, so we need to get creative with superlatives. (We did get dinged on an NSF grant proposal once for overselling; I find it usually works better than underselling ideas.)

Further addressing a point of Dan’s, NUTS tries to trace out the longest reversible trajectory it can without performing redundant work, but then it draws a single point along the trajectory with bias toward selecting one toward the end. This is true of Matt Hoffman’s original slice-sampled version and of Michael’s multinomial sampled update. Andrew was urging Matt Hoffman to figure out how to get the maximum expected squared “jumping” distance.

]]>The paper is a much less mathematical introduction to his maths papers, which do provide theoretical arguments, but also require some fairly hairy measure theory and differential geometry. For example, information about the effect symplectic integrators have is in his paper with Mark and Simon on optimising the step length.

>I find this notion quite compelling and am at the same time [almost] flabbergasted that I have never heard of it before.

It’s in a lot of papers, but it’s hiding in the maths!

> I am not sure I get this sentence.

I don’t either! Because we know that how well HMC converges changes under re-parameterisation (basically if you parameterise the model so the statistical geometry is more like the geometry assumed by the mass matrix in HMC, then it works better. This is the argument about why non-centred parameterisations sometimes work).

> Simulating directly the energy distribution for a complex target distribution does not seem more feasible than moving up likelihood levels in nested sampling.

Definitely. You choose the energy level implicitly when you sample new momenta. But one of the points that he makes is that after the fact you can compare the observed energy distribution with the implicit proposal kernel and compute a BFMI.

It also explains to me at least why HMC works so well. Instead of being a random walk in p-dimensions (like RWMH or MALA), it’s a random walk in 1-dimension and, in the NUTS case, a deterministic proposal that attempts to move as far away on the energy isocline as possible.

]]>