accelerating HMC by learning the leapfrog scale

In this new arXiv submission that was part of Changye Wu’s thesis [defended last week],  we try to reduce the high sensitivity of the HMC algorithm to its hand-tuned parameters, namely the step size ε  of the discretisation scheme, the number of steps L of the integrator, and the covariance matrix of the auxiliary variables. By calibrating the number of steps of the Leapfrog integrator towards avoiding both slow mixing chains and wasteful computation costs. We do so by learning from the No-U-Turn Sampler (NUTS) of Hoffman and Gelman (2014) which already automatically tunes both the step size and the number of leapfrogs.

The core idea behind NUTS is to pick the step size via primal-dual averaging in a burn-in (warmup, Andrew would say) phase and to build at each iteration a proposal based on following a locally longest path on a level set of the Hamiltonian. This is achieved by a recursive algorithm that, at each call to the leapfrog integrator, requires to evaluate both the gradient of the target distribution and the Hamiltonianitself. Roughly speaking an iteration of NUTS costs twice as much as regular HMC with the same number of calls to the integrator. Our approach is to learn from NUTS the scale of the leapfrog length and use the resulting empirical distribution of the longest leapfrog path to randomly pick the value of  L at each iteration of an HMC scheme. This obviously preserves the validity of the HMC algorithm.

While a theoretical comparison of the convergence performances of NUTS and this eHMC proposal seem beyond our reach, we ran a series of experiments to evaluate these performances, using as a criterion an ESS value that is calibrated by the evaluation cost of the logarithm of target density function and of its gradient, as this is usually the most costly part of the algorithms. As well as a similarly calibrated expected square jumping distance. Above is one such illustration for a stochastic volatility model, the first axis representing the targeted acceptance probability in the Metropolis step. Some of the gains in either ESS or ESJD are by a factor of ten, which relates to our argument that NUTS somewhat wastes computation effort using a uniformly distributed proposal over the candidate set, instead of being close to its end-points, which automatically reduces the distance between the current position and the proposal.

7 Responses to “accelerating HMC by learning the leapfrog scale”

  1. mattmgraham Says:

    Thank you Christian and your co-authors for the interesting paper!

    I have a few comments about the experiments and evaluation. Apologies in advance if I’ve made any errors in what I say below.

    It is argued in the paper that each NUTS integration step is twice as costly as in standard HMC as it requires evaluating both the potential energy (negative log target density) and its gradient at each step rather than just the potential energy gradient. This I think is not generally true as in most applications of HMC the potential energy gradient will be calculated using reverse-mode automatic differentiation which will generally require the original function (here the potential energy) to be calculated in a forward pass before the gradients are then calculated in the backwards pass. Each gradient evaluation therefore also gives the original function value for ‘free’ – for example this seems to be the case in Stan from the docstring of the Stan Math library `gradient` function (https://github.com/stan-dev/math/blob/develop/stan/math/rev/mat/functor/gradient.hpp).

    Even if the gradients are manually coded, typically there will be a lot of shared computation between the original function and gradient calculations so the even if using a more optimised implementation the cost of evaluating the value and gradient of a (scalar) function together versus is unlikely to be twice as costly as evaluating just the gradient. For example in the logistic regression example in the paper the potential energy function and gradient are defined

    “`
    U <- function(x)
    {
    val <- as.vector(R %*% x)
    return(sum(log(1+exp(val)) – S*val))
    }

    grad_U <- function(x)
    {
    val <- as.vector(R %*% x)
    val <- 1/(1+exp(-val)) – S
    val <- as.vector(t(R) %*% val)
    return(val)
    }
    “`

    however we could compute both together at a small additional overhead of the gradient computation cost e.g. something like

    “`
    val_and_grad_U <- function(x)
    {
    R_x <- as.vector(R %*% x)
    log1p_exp_R_x <- log1p(exp(R_x))
    grad <- t(R) %*% (log1p_exp_R_x – S)
    val <- sum(log1p_exp_R_x – S*R_x)
    return(c(grad, val))
    }
    “`

    Further in the evaluations of run-time it is assumed the cost of each NUTS transition is `[number of leapfrog steps] * 2 + 1` and for a standard HMC transition `[number of leapfrog steps] + 2`. I'm assuming the additive constants are to account for the potential energy evaluations at the final / initial states for the Metropolis accept step however in reality we always have already calculated the gradient at the initial and final states in the process of running the integrator (or will calculate the gradient in the next iteration) and so as we can evaluate both gradient and value together for the same cost as just the gradient, if we always cache the potential energy values when we calculate the gradient (which Stan does I think) then the number of additional potential energy and gradient evaluations per transition for both NUTS and standard HMC is just the number of leapfrog steps.

    On a separate note, the implementations of the eHMC* methods in the accompanying code (https://github.com/wcythh/eHMC) have a potentially incorrect treatment of the HMC transitions which produce `NaN` values for the Metropolis acceptance ratio. Rather than reject in these cases it appears that the samplers 'retry' – the iteration counter `i` is not incremented in these cases and the sampled state (i.e. equal to the previous value) is not added to the chain. I'm not sure this implementation defines a valid MCMC algorithm (as we are 'downweighting' the states at which such rejections should occur – possibly wrong here though?) but more importantly from the evaluation perspective these failed transitions are also not included in the computational cost, which given, as implemented, the integrator still proceeds to run the full number of leapfrog steps even if there is a divergence in an earlier step which causes `NaN` values to be generated, means that the computed costs are not reflective of the actual number of gradient evaluations required. In Stan and other NUTS implementations such as in PyMC3 such integrator divergences lead to early termination of the trajectory building and so save on the cost of continuing to integrate once we know we will reject – this could be implemented similarly for the eHMC methods, but even so the cost of such transitions would still be non-zero (such shortened trajectories are reflected in the Stan `n_leapfrog__` statistics used to calculate the NUTS computational cost in the code).

    I made a fork of the Github repository linked to in the paper with my attempts to fix for the above two issues (or at least issues in my opinion!) for the German-Credit logistic regression experiment at https://github.com/matt-graham/eHMC/tree/updated-lr-exp

    Running the adjusted code there for a single target acceptance rate of 0.75 and for just 4 independent repeats (running the original script with 40 repeats in parallel wasn't feasible on my laptop due to requiring too much memory!) I get the following cost-normalised minimum ESS estimates (mean over 4 runs +/- standard error) for the four methods:

    NUTS : 0.088 +/- 0.012
    eHMC : 0.090 +/- 0.019
    eHMCq: 0.085 +/- 0.012
    eHMCu: 0.097 +/- 0.011

    The German-credit `.Rdata` files used in the original script were not available in the repository so I created new versions (with the feature normalisation specified in the paper) using the Python script `download_and_preprocess_data.py` I added to the `GermanCredit` directory in same Github fork. The data file for the fixed chain start state used in the script also was not available so I used the standard Stan random initialisation.

    Although this is just for a single target accept rate and the standard errors are quite high due to the small number of repeats, this suggests a much smaller difference in performance between NUTS and the eHMC* methods than shown in the paper for this Bayesian logistic regression case at least.

    In the conclusion it is claimed that NUTS samples uniformly from the trajectory of states generated which is suggested is part of the reason for its relatively poorer performance. Although this is the case for the 'naive' implementation of NUTS given in Algorithm 2 in Hoffman and Gelman (2014) which is used to motivate the initial discussion of the algorithm, in practice a more efficient implementation detailed in Algorithm 3 is used. Rather than sampling *independently* from the uniform distribution on the set of candidate states from the trajectory which are within the slice, this variant instead uses a Markov transition kernel which leaves this uniform distribution invariant while favouring moves towards states near to the end-points of the trajectory. The NUTS implementation used in Stan (and I think PyMC3) actually no longer use a slice sampling step but instead use the multinomial / Rao-Blackwellised version described by Michael Betancourt in 'A conceptual introduction to Hamiltonian Monte Carlo' (https://arxiv.org/abs/1701.02434) and equivalently to the slice-sampling case use an efficient 'progressive' sampling implementation which leaves the multinomial distribution over the candidate state invariant while favouring moves closer to the trajectory end-points.

    • Thank you Matt for checking so thoroughly and so quickly our [as of this] morning paper! We’ll get back to you when we have assessed how a less naïve programming of NUTS impacts the comparison.

    • A quick point from Changye: Sorry to have omitted to upload the dataset, now on https://github.com/wcythh/eHMC. Actually, I rerun your modified code with:

      R1 <- MixSampler(1)
      R2 <- MixSampler(2)
      R3 <- MixSampler(3)
      R4 <- MixSampler(4)
      R1_4 <- rbind(R1$SummaryResult[,1], R2$SummaryResult[,1],
      R3$SummaryResult[,1], R4$SummaryResult[,1])
      apply(R1_4,2,mean)
      apply(R1_4,2,sd)

      and the result is different from yours. I tested the programme twice. I hope that you can rerun it to check the result. Note that apply(R1_4,1,mean); apply(R1_4,1,sd) does not seem correct and gives a similar result to yours. For the above,

      NUTS: 0.05427967 +/- 0.007149331
      eHMCq: 0.09918268 +/- 0.005367541
      eHMC: 0.1121522 +/- 0.006003432
      eHMCu: 0.08351558 +/- 0.00560957

      the improvement is about 1.5 to 2, which is similar to the one in our paper.

      • mattmgraham Says:

        My apologies Changye and Christian, you were completely correct to get me to recheck the figures – I misunderstood the ordering of the results when computing the summary statistics as I had both 4 independent runs and 4 different methods, so that the figures I provided where incorrectly averaging over the method dimension not the runs. Sorry for this silly mistake – not really a valid excuse but I’ve rarely used R before so I am a bit unfamiliar with the data structures! Computing the summaries using the code you provided on a new run of 10 repeats this morning I get minimum ESS figures very close to those in your comment:

        NUTS: 0.056 +/- 0.0017
        eHMCq: 0.096+/- 0.0034
        eHMC: 0.1070 +/- 0.0038
        eHMCu: 0.083 +/- 0.0031

        Once accounting for the adjusted computational cost calculations I proposed in my code as you say the ~1.5 to 2 times improvement in these figures is concordant with the ~3-4 times improvement in the corresponding values in figure 2 in your paper.

        It’s interesting that the relative ordering of the performance of the eHMC* methods is quite different when comparing on the mean ESS (and similarly for the median / max)

        NUTS (mean, median, max ESS):

        0.1379 +/- 0.0009, 0.1482 +/- 0.0022, 0.2189 +/- 0.0050

        eHMCq (mean, median, max ESS):

        0.1348 +/- 0.0025, 0.1334 +/- 0.0029, 0.1837 +/- 0.0054

        eHMC (mean, median, max ESS):

        0.1727 +/- 0.0023, 0.1697 +/- 0.0024, 0.2664 +/- 0.0097

        eHMCu (mean, median, max ESS):

        0.1825 +/- 0.0013, 0.1879 +/- 0.0027, 0.2759 +/- 0.0068

        From this it seems that the relatively poorer performance of eHMCu is due to lower ESSs in just a few components / dimensions. It seems that this might also explain some of the relatively poorer performance of NUTS as the performance improvements of the eHMC* methods are also a bit lower when comparing on the statistics other than the minimum, and intuitively at least eHMCu seems likely to give the most similar distribution of integration times to NUTS.

        This might partly be explained by the use of an identity mass matrix as this means there is no adjustment for different relative scaling along the different dimensions, and so the lower minimum ESS for NUTS / eHMCu may be due to using integration times smaller than required for efficiently exploring the dimension(s) with the largest scale while still exploring the smaller scale dimensions well (as we may end up doing multiple traverses of these dimensions for even a partial traverse of the larger dimensions). It would be interesting to see if / how for example using an adaptively tuned diagonal mass matrix changes things.

      • Thanks for the new experiments, Matt. It is indeed a good point that the metric used in the exploration may impact the comparison, suggesting to investigate the parameterisation itself!

      • mattmgraham Says:

        As follow up to my previous comment I re-ran the same logistic regression experiment with an adaptively tuned diagonal mass matrix in the Stan NUTS runs and used this same diagonal mass matrix in the eHMC runs. Averaging over 10 independent runs gives the following ESS statistics (first table means, second table standard errors of mean)

        Min Mean Median Max
        0.0595 0.0838 0.0847 0.1097 (NUTS)
        0.1226 0.1578 0.1570 0.2068 (eHMCq)
        0.1550 0.2135 0.2067 0.3051 (eHMC)
        0.1090 0.1653 0.1656 0.2331 (eHMCu)

        Min Mean Median Max
        0.0031 0.0031 0.0031 0.0032 (NUTS)
        0.0020 0.0013 0.0017 0.0057 (eHMCq)
        0.0024 0.0019 0.0020 0.0078 (eHMC)
        0.0025 0.0023 0.0033 0.0055 (eHMCu)

        The range of the per-dimension ESS estimates for the eHMCu runs has decreased from [0.083, 0.2759] to [0.1090, 0.2331] and similarly for NUTS from [0.056, 0.2189] to [0.0595, 0.1097]. Interestingly the performance improvement is more significant however for the eHMC method, with it now giving a 2.5 gain in efficiency over NUTS.

      • Here are further points from Changye: (1) About the “Retry” triggered by reaching “NaN” (divergence) in the leapfrog steps: as far as we can tell, such divergence events come from large step sizes, which make the leapfrog integrator unstable. While this could induce a resampling step instead, in practice, if such divergence warnings occur too often, the resulting samples will not be reliable and users need to switch to a smaller step size. Since we keep the fraction of “Retry” to a minimum, the corresponding computation cost of “Retry” is negligible. And it should not impact the stationary distribution in fine. When compared with the first two examples, the values of the targeted accept probabilities are larger than 0.45 in the last three examples, values chosen towards preventing too many “Retry”, with a negligible added cost. In our experiments, “Retry” is used to smooth out the code.

        (2) Stan uses multinomial sampling, instead of slice sampling, and biased progressive sampling (if our understanding is correct), towards favouring candidates close to the endpoints. Thus, Stan favours large ESJD (maybe, and ESS) and shows better performances than the original efficient NUTS in Hoffman and Gelman (2014). However, even though these techniques alleviate the presence of autocorrelation in the samples, they cannot make sure that the current position and the proposal correspond to the endpoints. According to our experiments, NUTS still wastes more computation cost than eHMC*.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.