## Andrew gone NUTS!

**M**atthew Hoffman and Andrew Gelman have posted a paper on arXiv entitled “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” and developing an improvement on the Hamiltonian Monte Carlo algorithm called NUTS (!). Here is the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size ε and a desired number of steps

L. In particular, ifLis too small then the algorithm exhibits undesirable random walk behavior, while ifLis too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of stepsL. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter ε on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

**N**ow, my suspicious and pessimistic nature always makes me wary of universality claims! I completely acknowledge the difficulty in picking the number of leapfrog steps *L* in the Hamiltonian algorithm, since the theory behind does not tell anything useful about *L*. And the paper is quite convincing in its description of the NUTS algorithm, being moreover beautifully written. As indicated in the paper, the “doubling” solution adopted by NUTS is reminding me of Radford Neal’s procedure in his *Annals of Statistics* paper on slice sampling. (The NUTS algorithm also relies on a slice sampling step.) However, besides its ability to work as an automatic Hamiltonian methodology, I wonder about the computing time (and the “unacceptably large amount of memory”, p.12) required by the doubling procedure: *2*^{j} is growing fast with j! (If my intuition is right, the computing time should increase rather quickly with the dimension. And I do not get the argument within the paper that the costly part is the gradient computation: it seems to me the gradient must be computed for all of the *2*^{j} points.) The authors also mention delayed rejection à la Tierney and Mira (1999) and the scheme reminded me a wee of the pinball sampler we devised a while ago with Kerrie Mengersen. Choosing the discretisation step ε is more “traditional”, using the stochastic approximation approach we set in our unpublished-yet-often-quoted tech report with Christophe Andrieu. (I do not think I got the crux of the “dual averaging” for optimal calibration on p.11) The illustration through four benchmarks [incl. a stochastic volatility model!] is quite convincing as well, with (unsurprisingly) great graphical tools. A final grumble: that the code is “only” available in the proprietary language Matlab! Now, I bet some kind of Rao-Blackwellisation is feasible with all the intermediate simulations!

November 5, 2012 at 4:01 am

[…] project announced (and discussed) some time ago here and here and has been officially released at the end of august. This effort to come with both a new […]

November 30, 2011 at 3:55 pm

[…] optimal static HMC!When the the Nuts paper appeared on Arxiv, Christian Robert noticed it and had some reactions.In response to Xian’s comments, Matt writes: Christian writes:I wonder about the computing […]

November 25, 2011 at 10:38 pm

I’m learning Octave because of the Stanford on-line course of Machine Learning. After some initial struggle, it worked very well. It has vetorization like R…

November 25, 2011 at 6:10 pm

The 2**j term is the total number of leapfrog steps in simulating the Hamiltonian. The major advantage of NUTS over basic HMC is that it lets you stop before 2**j gets too big!

The costly part in both NUTS and basic HMC is computing the (unnormalized) log density (typically a Bayesian posterior) and its gradient for each of the 2**j leapfrog steps. Using the adjoint method for dynamic programming symbolic gradient calculations, we can bound the time needed for the gradient by the number of expressions evaluated in the log density function. For the models we’ve looked at, the gradient takes a bit longer than the log density itself to evaluate.

The use of dual averaging isn’t critical. It just worked a bit better than Robbins-Monro.

Andrew had the same intuition about using the intermediate leapfrog steps somehow in basic HMC.

NUTS is also available in (relatively, not universally) portable C++ as part of our soon-to-be-released package Stan (named after Stanislaw Ulam). Stan also includes a directed graphical model compiler that converts BUGS-like model specifications into C++ for computing unnormalized densities and their gradients (the latter using algorithmic differentiation, which implements the adjoint method based on a templated density function). We’re testing with the BUGS example models and some others now and plan to release a stable version as soon as we finish this testing, do some more performance tuning, write an R2jags-like R interface, and write user manuals.

November 25, 2011 at 5:52 pm

Octave is to Matlab as R is to S-PLUS.

November 25, 2011 at 6:12 pm

I never tried Octave. Having to use Scilab for my teaching at Polytechnique was enough of a nightmare…