generalizing Hamiltonian Monte Carlo with neural networks

Daniel Levy, Matthew Hoffman, and Jascha Sohl-Dickstein pointed out to me a recent paper of theirs submitted to and accepted by ICLR 2018, with the above title. This allowed me to discover the open source handling of paper reviews at ICLR, which I find quite convincing, except for not using MathJax or another medium for LaTeX formulas. And which provides a collection of comments besides mine’s. (Disclaimer: I was not involved in the processing of this paper for ICLR!)

“Ultimately our goal (and that of HMC) is to produce a proposal that mixes efficiently, not to simulate Hamiltonian dynamics accurately.”

The starting concept is the same as GANs (generative adversarial networks) discussed here a few weeks ago. Complemented by a new HMC that also uses deep neural networks to represent the HMC trajectory. (Also seen in earlier papers by e.g. Strathman.) The novelty in the HMC seems to be a binary direction indicator on top of the velocity. The leapfrog integrator is also modified, with a location scale generalisation for the velocity and a half-half location scale move for the original target x. The functions appearing in the location scale aspects are learned by neural nets. Towards minimising lag-one auto-correlation. Plus an extra penalty for not moving enough. Reflecting on the recent MCMC literature and in particular on the presentations at BayesComp last month, judging from comments of participants, this inclusion of neural tools in the tuning of MCMC algorithms sounds like a steady trend in the community. I am slightly at a loss about the adaptive aspects of the trend with regards to the Markovianity of the outcome.

“To compute the Metropolis-Hastings acceptance probability for a deterministic transition, the operator
must be invertible and have a tractable Jacobian.”

A remark (above) that seems to date back at least to Peter Green’s reversible jump. Duly mentioned in the paper. When reading about the performances of this new learning HMC, I could not see where the learning steps for the parameters of the leapfrog operators were accounted for, although the authors mention an identical number of gradient computations (which I take to mean the same thing). One evaluation of this method against earlier ones (Fig.2) checks successive values of the likelihood, which may be intuitive enough but does not necessarily qualify convergence to the right region since the posterior may concentrate away from the maximal likelihood.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.