## unbiased Hamiltonian Monte Carlo with couplings

**I**n the June issue of Biometrika, which had been sitting for a few weeks on my desk under my teapot!, Jeremy Heng and Pierre Jacob published a paper on unbiased estimators for Hamiltonian Monte Carlo using couplings. (Disclaimer: I was not involved with the review or editing of this paper.) Which extends to HMC environments the earlier paper of Pierre Jacob, John O’Leary and Yves Atchadé, to be discussed soon at the Royal Statistical Society. The fundamentals are the same, namely that an unbiased estimator can be produced from a converging sequence of estimators and that it can be *de facto* computed if two Markov chains with the same marginal can be coupled. The issue with Hamiltonians is to figure out how to couple their dynamics. In the Gaussian case, it is relatively easy to see that two chains with the same initial momentum meet periodically. In general, there is contraction within a compact set (Lemma 1). The coupling extends to a time discretisation of the Hamiltonian flow by a leap-frog integrator, still using the same momentum. Which roughly amounts in using the same random numbers in both chains. When defining a relaxed meeting (!) where both chains are within δ of one another, the authors rely on a drift condition (8) that reminds me of the early days of MCMC convergence and seem to imply the existence of a small set “where the target distribution [density] is strongly log-concave”. And which makes me wonder if this small set could be used instead to create renewal events that would in turn ensure both stationarity and unbiasedness without the recourse to a second coupled chain. When compared on a Gaussian example with couplings on Metropolis-Hastings and MALA (Fig. 1), the coupled HMC sees hardly any impact of the dimension of the target (in the average coupling time), with a much lower value. However, I wonder at the relevance of the meeting time as an assessment of efficiency. In the sense that the coupling time is not a convergence time but reflects as well on the initial conditions. I acknowledge that this allows for an averaging over parallel implementations but I remain puzzled by the statement that this leads to “estimators that are consistent in the limit of the number of replicates, rather than in the usual limit of the number of Markov chain iterations”, since a particularly poor initial distribution could on principle lead to a mode of the target being never explored or on the coupling time being ever so rarely too large for the computing abilities at hand.

November 4, 2019 at 5:16 pm

Thanks for the comments, Christian!

“And which makes me wonder if this small set could be used instead to create renewal events that would in turn ensure both stationarity and unbiasedness without the recourse to a second coupled chain.”

For sure, in many cases you might be able to implement something like Brockwell & Kadane 2005, but that would require specifying something… e.g. a distribution supported on a certain set, a typical value of the target pdf over that set, or something of the sort. In contrast, the coupled chain approach does not require such a specification in practice.

“I wonder at the relevance of the meeting time as an assessment of efficiency. In the sense that the coupling time is not a convergence time but reflects as well on the initial conditions.”

On the relevance of monitoring meeting times: first, for unbiased MCMC estimators meeting times are very explicitly related to computing costs. Secondly, they are also related to the convergence of the marginal chains (from the initial distribution to the target), which is the topic of another paper (https://arxiv.org/abs/1905.09971).

Besides, how could an assessment of convergence *not* reflect on initial conditions? Maybe you’re using “convergence” for something else than the phenomenon by which chains evolve from initial conditions toward stationarity?

“I remain puzzled by the statement that this leads to “estimators that are consistent in the limit of the number of replicates, rather than in the usual limit of the number of Markov chain iterations”, since a particularly poor initial distribution could on principle lead to a mode of the target being never explored or on the coupling time being ever so rarely too large for the computing abilities at hand.”

I am not sure what’s puzzling about the statement, as long as “consistent” is understood in the usual sense of some law of large numbers holding. Consistency never meant “works great in all cases”. The unbiased MCMC paper (https://arxiv.org/abs/1708.03625v5) might contain clearer discussions of what can go wrong.

I genuinely believe users would typically gather a lot more knowledge about the target by running lots of coupled chains (or even lots of uncoupled chains), and monitoring where they go, where they don’t go and whether they meet, rather than (or on top of) very few very long chains, as is currently done by default. Even (especially) in cases where the chains struggle to move efficiently across the support of the target. At least that has been my personal experience so far.