## an independent sampler that maximizes the acceptance rate of the MH algorithm

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , on September 3, 2019 by xi'an

An ICLR 2019 paper by Neklyudov, Egorov and Vetrov on an optimal choice of the proposal in an independent Metropolis algorithm I discovered via an X validated question. Namely whether or not the expected Metropolis-Hastings acceptance ratio is always one (which it is not when the support of the proposal is restricted). The paper mentions the domination of the Accept-Reject algorithm by the associated independent Metropolis-Hastings algorithm, which has actually been stated in our Monte Carlo Statistical Methods (1999, Lemma 6.3.2) and may prove even older. The authors also note that the expected acceptance probability is equal to one minus the total variation distance between the joint defined as target x Metropolis-Hastings proposal distribution and its time-reversed version. Which seems to suffer from the same difficulty as the one mentioned in the X validated question. Namely that it only holds when the support of the Metropolis-Hastings proposal is at least the support of the target (or else when the support of the joint defined as target x Metropolis-Hastings proposal distribution is somewhat symmetric. Replacing total variation with Kullback-Leibler then leads to a manageable optimisation target if the proposal is a parameterised independent distribution. With a GAN version when the proposal is not explicitly available. I find it rather strange that one still seeks independent proposals for running Metropolis-Hastings algorithms as the result will depend on the family of proposals considered and as performances will deteriorate with dimension (the authors mention a 10% acceptance rate, which sounds quite low). [As an aside, ICLR 2020 will take part in Addis Abeba next April.]

## MCMC importance samplers for intractable likelihoods

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , on May 3, 2019 by xi'an

Jordan Franks just posted on arXiv his PhD dissertation at the University of Jyväskylä, where he discuses several of his works:

1. M. Vihola, J. Helske, and J. Franks. Importance sampling type estimators based on approximate marginal MCMC. Preprint arXiv:1609.02541v5, 2016.
2. J. Franks and M. Vihola. Importance sampling correction versus standard averages of reversible MCMCs in terms of the asymptotic variance. Preprint arXiv:1706.09873v4, 2017.
3. J. Franks, A. Jasra, K. J. H. Law and M. Vihola.Unbiased inference for discretely observed hidden Markov model diffusions. Preprint arXiv:1807.10259v4, 2018.
4. M. Vihola and J. Franks. On the use of ABC-MCMC with inflated tolerance and post-correction. Preprint arXiv:1902.00412, 2019

focusing on accelerated approximate MCMC (in the sense of pseudo-marginal MCMC) and delayed acceptance (as in our recently accepted paper). Comparing delayed acceptance with MCMC importance sampling to the advantage of the later. And discussing the choice of the tolerance sequence for ABC-MCMC. (Although I did not get from the thesis itself the target of the improvement discussed.)

## non-reversible Langevin samplers

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on February 6, 2017 by xi'an

In the train to Oxford yesterday night, I read through the recently arXived Duncan et al.’s Nonreversible Langevin Samplers: Splitting Schemes, Analysis and Implementation. Standing up the whole trip in the great tradition of British trains.

The paper is fairly theoretical and full of Foster-Lyapunov assumptions but aims at defending an approach based on a non-reversible diffusion. One idea is that the diffusion based on the drift {∇ log π(x) + γ(x)} is associated with the target π provided

∇ . {π(x)γ(x)} = 0

which holds for the Langevin diffusion when γ(x)=0, but produces a non-reversible process in the alternative. The Langevin choice γ(x)=0 happens to be the worst possible when considering the asymptotic variance. In practice however the diffusion need be discretised, which induces an approximation that may be catastrophic for convergence if not corrected, and a relapse into reversibility if corrected by Metropolis. The proposal in the paper is to use a Lie-Trotter splitting I had never heard of before to split between reversible [∇ log π(x)] and non-reversible [γ(x)] parts of the process. The deterministic part is chosen as γ(x)=∇ log π(x) [but then what is the point since this is Langevin?] or as the gradient of a power of π(x). Although I was mostly lost by that stage, the paper then considers the error induced by a numerical integrator related with this deterministic part, towards deriving asymptotic mean and variance for the splitting scheme. On the unit hypercube. Although the paper includes a numerical example for the warped normal target, I find it hard to visualise the implementation of this scheme. Having obviously not heeded Nicolas’ and James’ advice, the authors also analyse the Pima Indian dataset by a logistic regression!)

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , on May 26, 2015 by xi'an

[Here are comments made by Matt Graham that I thought would be more readable in a post format. The beautiful picture of the Alps above is his as well. I do not truly understand what Matt’s point is, as I did not cover continuous time processes in my discussion…]

In terms of interpretation of the diffusion with non-reversible drift component, I think this can be generalised from the Gaussian invariant density case by

dx = [ – (∂E/∂x) dt + √2 dw ] + S’ (∂E/∂x) dt

where ∂E/∂x is the usual gradient of the negative log (unnormalised) density / energy and S=-S’ is a skew symmetric matrix. In this form it seems the dynamic can be interpreted as the composition of an energy and volume conserving (non-canonical) Hamiltonian dynamic

dx/dt = S’ ∂E/∂x

and a (non-preconditioned) Langevin diffusion

dx = – (∂E/∂x) dt + √2 dw

As an alternative to discretising the combined dynamic, it might be interesting to compare to sequential alternation between ‘Hamiltonian’ steps either using a simple Euler discretisation

x’ = x + h S’ ∂E/∂x

or a symplectic method like implicit midpoint to maintain reversibility / volume preservation and then a standard MALA step

x’ = x – h (∂E/∂x) + √2 h w, w ~ N(0, I)

plus MH accept. If only one final MH accept step is done this overall dynamic will be reversible, however if a an intermediary MH accept was done after each Hamiltonian step (flipping the sign / transposing S on a rejection to maintain reversibility), the composed dynamic would in general be non-longer reversible and it would be interesting to compare performance in this case to that using a non-reversible MH acceptance on the combined dynamic (this alternative sidestepping the issues with finding an appropriate scale ε to maintain the non-negativity condition on the sum of the vorticity density and joint density on a proposed and current state).

With regards to your point on the positivity of g(x,y)+π(y)q(y,x), I’m not sure if I have fully understood your notation correctly or not, but I think you may have misread the definition of g(x,y) for the discretised Ornstein-Uhlenbeck case (apologies if instead the misinterpretation is mine!). The vorticity density is defined as the skew symmetric component of the density f of F(dx, dy) = µ(dx) Q(x, dy) with respect to the Lebesgue measure, where µ(dx) is the true invariant distribution of the Euler-Maruyama discretised diffusion based proposal density Q(x, dy) rather than g(x, y) being defined in terms of the skew-symmetric component of π(dx) Q(x, dy) which in general would lead to a vorticity density which does not meet the zero integral requirement as the target density π is not invariant in general with respect to Q.

## non-reversible MCMC

Posted in Books, Statistics, University life with tags , , , , , , on May 21, 2015 by xi'an

While visiting Dauphine, Natesh Pillai and Aaron Smith pointed out this interesting paper of Joris Bierkens (Warwick) that had escaped my arXiv watch/monitoring. The paper is about turning Metropolis-Hastings algorithms into non-reversible versions, towards improving mixing.

In a discrete setting, a way to produce a non-reversible move is to mix the proposal kernel Q with its time-reversed version Q’ and use an acceptance probability of the form

$\epsilon\pi(y)Q(y,x)+(1-\epsilon)\pi(x)Q(x,y) \big/ \pi(x)Q(x,y)$

where ε is any weight. This construction is generalised in the paper to any vorticity (skew-symmetric with zero sum rows) matrix Γ, with the acceptance probability

$\epsilon\Gamma(x,y)+\pi(y)Q(y,x)\big/\pi(x)Q(x,y)$

where ε is small enough to ensure all numerator values are non-negative. This is a rather annoying assumption in that, except for the special case derived from the time-reversed kernel, it has to be checked over all pairs (x,y). (I first thought it also implied the normalising constant of π but everything can be set in terms of the unormalised version of π, Γ or ε included.) The paper establishes that the new acceptance probability preserves π as its stationary distribution. An alternative construction is to make the proposal change from Q in H such that H(x,y)=Q(x,y)+εΓ(x,y)/π(x). Which seems more pertinent as not changing the proposal cannot improve that much the mixing behaviour of the chain. Still, the move to the non-reversible versions has the noticeable plus of decreasing the asymptotic variance of the Monte Carlo estimate for any integrable function. Any. (Those results are found in the physics literature of the 2000’s.)

The extension to the continuous case is a wee bit more delicate. One needs to find an anti-symmetric vortex function g with zero integral [equivalent to the row sums being zero] such that g(x,y)+π(y)q(y,x)>0 and with same support as π(x)q(x,y) so that the acceptance probability of g(x,y)+π(y)q(y,x)/π(x)q(x,y) leads to π being the stationary distribution. Once again g(x,y)=ε(π(y)q(y,x)-π(x)q(x,y)) is a natural candidate but it is unclear to me why it should work. As the paper only contains one illustration for the discretised Ornstein-Uhlenbeck model, with the above choice of g for a small enough ε (a point I fail to understand since any ε<1 should provide a positive g(x,y)+π(y)q(y,x)), it is also unclear to me that this modification (i) is widely applicable and (ii) is relevant for genuine MCMC settings.

## delayed acceptance [alternative]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on October 22, 2014 by xi'an

In a comment on our Accelerating Metropolis-Hastings algorithms: Delayed acceptance with prefetching paper, Philip commented that he had experimented with an alternative splitting technique retaining the right stationary measure: the idea behind his alternative acceleration is again (a) to divide the target into bits and (b) run the acceptance step by parts, towards a major reduction in computing time. The difference with our approach is to represent the  overall acceptance probability

$\min_{k=0,..,d}\left\{\prod_{j=1}^k \rho_j(\eta,\theta),1\right\}$

and, even more surprisingly than in our case, this representation remains associated with the right (posterior) target!!! Provided the ordering of the terms is random with a symmetric distribution on the permutation. This property can be directly checked via the detailed balance condition.

In a toy example, I compared the acceptance rates (acrat) for our delayed solution (letabin.R), for this alternative (letamin.R), and for a non-delayed reference (letabaz.R), when considering more and more fractured decompositions of a Bernoulli likelihood.

> system.time(source("letabin.R"))
user system elapsed
225.918 0.444 227.200
> acrat
[1] 0.3195 0.2424 0.2154 0.1917 0.1305 0.0958
> system.time(source("letamin.R"))
user system elapsed
340.677 0.512 345.389
> acrat
[1] 0.4045 0.4138 0.4194 0.4003 0.3998 0.4145
> system.time(source("letabaz.R"))
user system elapsed
49.271 0.080 49.862
> acrat
[1] 0.6078 0.6068 0.6103 0.6086 0.6040 0.6158


A very interesting outcome since the acceptance rate does not change with the number of terms in the decomposition for the alternative delayed acceptance method… Even though it logically takes longer than our solution. However, the drawback is that detailed balance implies picking the order at random, hence loosing on the gain in computing the cheap terms first. If reversibility could be bypassed, then this alternative would definitely get very appealing!