We just treat divergences (NaN values or other divergences from the Hamiltonian) as rejections like Metropolis rejections in Stan. In NUTS, you only have to reject the last doubling.

The biasing toward the last doubling is a really important step in NUTS, and is indeed part of the off-the-shelf version of Stan. It’s not Rao-Blackwellized, though—we just draw a single next state rather than averaging over the intermediates. Michael Betancourt evaluated the Rao-Blackwellized approach and it doesn’t gain you much at the cost of completely changing the interfaces to only compute expectations (as you no longer get draws).

The advantage you get from a diagonal metric is limited, but it’s important if the variables aren’t unit scaled in the posterior.

The reason I like using Stan’s ESS calculations is that it discounts non-convergence by using cross-chain information. Standard ESS calculations that treat chains independently overestimate ESS in cases where there isn’t good convergence.

I wouldn’t say we change recommendations from computation to inference. We have a model we want to fit, which determines the inferences we want. We often have to reparameterize the model to make that inference feasible. We almost never want to do inference with models with “objective” or “weak” priors and prefer models with at least weakly informative priors to determine scales of variables. This is independent of computational concerns, though Andrew Gelman’s folk theorem is relevant here, in that often these models we don’t like for statistical reasons are also suboptimal computationally.

I don’t know what you mean that you’re pertrubed (?) by divergences. Floating point arithmetic is hard. Our notion of divergence is having the Hamiltonian diverge by more than 1000 (this is on the log scale!). That means a combination of floating point and first-order approximations haven’t worked. We want to diagnose these rather than pretend they don’t exist. When the algorithm diverges, it’s a signal that you’re getting biased estimates, especially if the divergences all arise in the same place (like the neck of the funnel in Neal’s example).

I’m reproducing this work and can share my code when I’m done.

]]>Thanks, Matt, the new code is on https://github.com/jstoehr/eHMC instead.

]]>