## love-hate Metropolis algorithm

**H**yungsuk Tak, Xiao-Li Meng and David van Dyk just arXived a paper on a multiple choice proposal in Metropolis-Hastings algorithms towards dealing with multimodal targets. Called “A repulsive-attractive Metropolis algorithm for multimodality” *[although I wonder why XXL did not jump at the opportunity to use the “love-hate” denomination!]*. The proposal distribution includes a [forced] downward Metropolis-Hastings move that uses the inverse of the target density π as its own target, namely 1/{π(x)+ε}. Followed by a [forced] Metropolis-Hastings upward move which target is {π(x)+ε}. The +ε is just there to avoid handling ratios of zeroes (although I wonder why using the convention 0/0=1 would not work). And chosen as 10⁻³²³ by default in connection with R smallest positive number. Whether or not the “downward” move is truly downwards and the “upward” move is truly upwards obviously depends on the generating distribution: I find it rather surprising that the authors consider the *same* random walk density in both cases as I would have imagined relying on a more dispersed distribution for the downward move in order to reach more easily other modes. For instance, the downward move could have been based on an *anti*-Langevin proposal, relying on the gradient to proceed further down…

This special choice of a single proposal however simplifies the acceptance ratio (and keeps the overall proposal symmetric). The final acceptance ratio still requires a ratio of intractable normalising constants that the authors bypass by Møller et al. (2006) auxiliary variable trick. While the authors mention the alternative pseudo-marginal approach of Andrieu and Roberts (2009), they do not try to implement it, although this would be straightforward here: since the normalising constants are the probabilities of accepting a downward and an upward move, respectively. Those can easily be evaluated at a cost similar to the use of the auxiliary variables. That is,

– generate a few moves from the current value and record the proportion *p* of accepted downward moves;

– generate a few moves from the final proposed value and record the proportion *q* of accepted downward moves;

and replace the ratio of intractable normalising constants with *p/q*. It is not even clear that one needs those extra moves since the algorithm requires an acceptance in the downward and upward moves, hence generate Geometric variates associated with those probabilities p and q, variates that can be used for estimating them. From a theoretical perspective, I also wonder if forcing the downward and upward moves truly leads to an improved convergence speed. Considering the case when the random walk is poorly calibrated for either the downward or upward move, the number of failed attempts before an acceptance may get beyond the reasonable.

As XXL and David pointed out to me, the unusual aspect of the approach is that here the proposal density is intractable, rather than the target density itself. This makes using Andrieu and Roberts (2009) seemingly less straightforward. However, as I was reminded this afternoon at the statistics and probability seminar in Bristol, the argument for the pseudo-marginal based on an unbiased estimator is that w Q(w|x) has a marginal in x equal to π(x) when the expectation of w is π(x). In thecurrent problem, the proposal in x can extended into a proposal in (x,w), w P(w|x) whose marginal is the proposal on x.

If we complement the target π(x) with the conditional P(w|x), the acceptance probability would then involve

{π(x’) P(w’|x’) / π(x) P(w|x)} / {w’ P(w’|x’) / w P(w|x)} = {π(x’) / π(x)} {w/w’}

so it seems the pseudo-marginal (or auxiliary variable) argument also extends to the proposal. Here is a short experiment that shows no discrepancy between target and histogram:

nozero=1e-300 #love-hate move move<-function(x){ bacwa=1;prop1=prop2=rnorm(1,x,2) while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){ prop1=rnorm(1,x,2);bacwa=bacwa+1} while (runif(1)>{pi(prop2)+nozero}/{pi(prop1)+nozero}) prop2=rnorm(1,prop1,2) y=x if (runif(1)<pi(prop2)*bacwa/pi(x)/fowa){ y=prop2;assign("fowa",bacwa)} return(y)} #arbitrary bimodal target pi<-function(x){.25*dnorm(x)+.75*dnorm(x,mean=5)} #running the chain T=1e5 x=5*rnorm(1);luv8=rep(x,T) fowa=1;prop1=rnorm(1,x,2) #initial estimate while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){ fowa=fowa+1;prop1=rnorm(1,x,2)} for (t in 2:T) luv8[t]=move(luv8[t-1])

September 4, 2016 at 12:59 pm

I don’t understand the method so well. Does this not just correspond to a proposal distribution that is the sum of two Gaussians. One Gaussian centered at the current location of the MCMC sampler. The other at a distance from it.

Why not then just a single Gaussian with a nonzero mean and rely on rejections to sample the local mode?

September 4, 2016 at 3:47 pm

I do not think the method simplifies into a mixture of two Gaussian as the inverse part is anything but Gaussian. Or a Gaussian with a big hole in the middle. I would rather see it as an auxiliary variable model where the intermediate value is associated with as a target.

September 4, 2016 at 5:51 pm

Okay, yes, they use the underlying distribution.

February 1, 2016 at 6:28 pm

[…] love-hate Metropolis algorithm Hyungsuk Tak, Xiao-Li Meng and David van Dyk just arXived a paper on a multiple choice proposal in Metropolis-Hastings algorithms towards dealing with multimodal targets. Called “A repulsive-attractive Metropolis algorithm for multimodality” [although I wonder why XXL did not jump at the opportunity to use the “love-hate” denomination!]. The proposal distribution includes a [forced] downward Metropolis-Hastings move that uses the inverse of the target density π as its own target, namely 1/{π(x)+ε}. Followed by a [forced] Metropolis-Hastings upward move which target is {π(x)+ε}. The +ε is just there to avoid handling ratios of zeroes (although I wonder why using the convention 0/0=1 would not work). And chosen as 10⁻³²³ by default in connection with R smallest positive number. Whether or not the “downward” move is truly downwards and the “upward” move is truly upwards obviously depends on the generating distribution: I find it rather surprising that the authors consider the same random walk density in both cases as I would have imagined relying on a more dispersed distribution for the downward move in order to reach more easily other modes. For instance, the downward move could have been based on an anti-Langevin proposal, relying on the gradient to proceed further down… […]

January 28, 2016 at 5:33 pm

This can be viewed as a negative temperature. So you can also use a whole ladder of multiple positive and negative temperatures. At very negative temperatures, the chain should move away from modes even faster than at -1.

Negative temperatures (Kelvin) have almost no place in the physical world. They mean that high energy states are more probable than low, so in a sense they are hotter than positive temperatures. [From what I recall, if such a system comes into contact with a positive temperature one, they immediately become a positive temp system. They might just be a weird theoretical corner case.] As the article shows there are difficulties around energy = 0.

It will be interesting to see what happens when you alternate climbing and descending the posterior probability. Ideally you move away from one mode to then ascend a different one. Probably some landscapes are favorable to that while others might keep sending you back to the first mode.

January 28, 2016 at 10:07 pm

Thanks for the comments, Art! Indeed, the ability to switch modes with this algorithm is rather aleatory in that it only use local values of the target, without accounting for other modes or even for the gradient or Hessian. I would be more interested in a Langevin version, except that its construction gets quite involved…