non-reversible MCMC [comments]
[Here are comments made by Matt Graham that I thought would be more readable in a post format. The beautiful picture of the Alps above is his as well. I do not
truly understand what Matt’s point is, as I did not cover continuous time processes in my discussion…]
In terms of interpretation of the diffusion with non-reversible drift component, I think this can be generalised from the Gaussian invariant density case by
dx = [ – (∂E/∂x) dt + √2 dw ] + S’ (∂E/∂x) dt
where ∂E/∂x is the usual gradient of the negative log (unnormalised) density / energy and S=-S’ is a skew symmetric matrix. In this form it seems the dynamic can be interpreted as the composition of an energy and volume conserving (non-canonical) Hamiltonian dynamic
dx/dt = S’ ∂E/∂x
and a (non-preconditioned) Langevin diffusion
dx = – (∂E/∂x) dt + √2 dw
As an alternative to discretising the combined dynamic, it might be interesting to compare to sequential alternation between ‘Hamiltonian’ steps either using a simple Euler discretisation
x’ = x + h S’ ∂E/∂x
or a symplectic method like implicit midpoint to maintain reversibility / volume preservation and then a standard MALA step
x’ = x – h (∂E/∂x) + √2 h w, w ~ N(0, I)
plus MH accept. If only one final MH accept step is done this overall dynamic will be reversible, however if a an intermediary MH accept was done after each Hamiltonian step (flipping the sign / transposing S on a rejection to maintain reversibility), the composed dynamic would in general be non-longer reversible and it would be interesting to compare performance in this case to that using a non-reversible MH acceptance on the combined dynamic (this alternative sidestepping the issues with finding an appropriate scale ε to maintain the non-negativity condition on the sum of the vorticity density and joint density on a proposed and current state).
With regards to your point on the positivity of g(x,y)+π(y)q(y,x), I’m not sure if I have fully understood your notation correctly or not, but I think you may have misread the definition of g(x,y) for the discretised Ornstein-Uhlenbeck case (apologies if instead the misinterpretation is mine!). The vorticity density is defined as the skew symmetric component of the density f of F(dx, dy) = µ(dx) Q(x, dy) with respect to the Lebesgue measure, where µ(dx) is the true invariant distribution of the Euler-Maruyama discretised diffusion based proposal density Q(x, dy) rather than g(x, y) being defined in terms of the skew-symmetric component of π(dx) Q(x, dy) which in general would lead to a vorticity density which does not meet the zero integral requirement as the target density π is not invariant in general with respect to Q.
May 26, 2015 at 2:37 pm
@Xi’an:
Thanks for following this up and for giving it such prominence (more than it probably deserves!).
The first part of my comment was more in response to Dan Simpson’s comment on an intuition ‘that the skew-symmetric vorticity operator is inducing some sort of “momentum flow” on the space of orthogonal transformations that can “push” you across an energy barrier’. That was where my suggestion of thinking of how the continuous time dynamic can be decomposed in to Langevin and Hamiltonian updates might be interesting came from.
Due to the non-canonical nature of the Hamiltonian dynamic component, there are no momentum variables as such (though if we were to make a change of variables we could bring the system in to a canonical form by some linear transform), the Hamiltonian energy exchange happening within the original system. However it is introducing a coherent flow to the system, without the Langevin dynamic component, the trajectories under the dynamic would follow paths along the isopotential surfaces – so rather than the vorticity pushing across energy barriers, to me it seems more like pushing around energy barriers. This is obviously a simplification as the vorticity kernel does not only represent the non-reversibility introduced by the extra drift component introduced by S, but also (I think) the inherent non-reversibility of the Euler discretisation of the Langevin dynamic, but I thought it might be a partially useful intuition as to what might be going on.
@Alex:
Again thanks for the comments on my comment, some interesting points!
My experience with using implicit-midpoint for integrating a Hamiltonian dynamic like this tallies with what you said regarding the need to set the step-size quite small to get consistent convergence (also using fixed-point iterations). I’m not sure about any theoretical guarantees on convergence, but I’ve found in practice that it’s possible to get consistent convergence without too much trouble even if the energy gradients have quite a complex non-linear form, and in this case due to the symplecticness of implicit mid-point you can get high acceptance rates (and so limited sign flipping) even when integrating over large number of steps due to the bounded changes in the Hamiltonian.
Obviously there is a high computational cost for using a small step size and iterative implicit updates so such a dynamic is probably unlikely to be of benefit when simpler methods or conventional HMC works well, but in some cases integrating paths along isopotential surfaces like this seems to help when the energy function is non-convex as standard HMC seems to have trouble finding a way through narrow bottlenecks / ‘valleys’ in the energy surface particularly in high-dimensions. RM-HMC can help with this as well, though again requires implicit updates and evaluating the (inverse) metric and derivatives is generally costly.
May 26, 2015 at 2:04 am
Thanks Matt for these interesting thoughts!
While playing a few months ago with more or less the same approach, I have found quite difficult in practice to:
– choose a sufficiently small time step \epsilon for the implicit-midpoint update that ensures that the implicit equation can be solved (I was simply iterating the fixed point equation — btw, I suspect that similar problems can be encountered while implementing the Manifold-HMC of Girolami-Calderhead). Indeed, in general (without strong assumptions on the target distribution) there is no guarantee that such an implicit update has one, and only one, solution. Also, I have observed that for a large-ish value of \epsilon, even if the implicit equation can sometimes be solved, this leads to a large number of rejections (because of the sign flip in these cases) and thus to a lot of random-walk behaviour (which is precisely what one is trying to avoid).
– in practice, it is also quite important to add an update step for the skew-symmetric matrix S so that at equilibrium the pair (x,S) has a well defined invariant distribution. For example, one can do a full update of S every “N” steps. Choosing “N” efficiently is quite tricky.
– indeed, one can mix the implicit-midpoint update with any other MCMC update i.e. not necessarily a MALA update. I was using a simple RWM, and that was not working too badly.
– overall, the above strategy was not doing better than a standard HMC on the examples I was playing with (i.e. not really worth the trouble of the implicit update, etc…) but I am sure, though, that one can cook up examples where this may be worth it…