debunking a (minor and personal) myth

diriXFor quite a while, I entertained the idea that Beta and Dirichlet proposals  were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate…When I tried to implement this approach to illustrate my point

target=function(x,a=1,b=1){dbeta(x,a,b)}
T=1e4 #MCMC runs
bsch=50 #batsch size
mkv=rep(runif(1),T) #MCMC chain
eps=rep(1/runif(1,10,100),T) #adapted proposal scale
ace=rep(0,T) #acceptance rate
for (t in 2:T){
batsch=mkv[t-1]
for (d in 1:bsch){
# Beta proposal
prop=rbeta(1,eps[t-1]*batsch,eps[t-1]*(1-batsch))
mh=target(prop)*dbeta(batsch,eps[t-1]*
  prop,eps[t-1]*(1-prop))/
  dbeta(prop,eps[t-1]*batsch,eps[t-1]*(1-batsch))/
  target(batsch) #Metropolis-Hastings ratio
if (runif(1)<mh){ace[t-1]=ace[t-1]+1;batsch=prop}}
ace[t-1]=ace[t-1]/bsch
mkv[t]=batsch #subsampled Markov chain 
# update of epsilon
if (ace[t-1]<.456){eps[t]=eps[t-1]*1.01
   }else{eps[t]=eps[t-1]/1.01}
} #end of the MCMC loop

Indeed, when using a Be(a,b) target with either a<1 or b<1, the above MCMC algorithm will eventually but inevitably run into difficulties, because the Beta generator ends up producing either 0 or 1, thanks to rounding effects, hence an NA value for the Metropolis-Hastings ratio. For a Be(a,b) target with both coefficients above one, there was no convergence issue and no difficulty with the behaviour of ε. As shown by the picture above when the target is uniform. Or the one below when it is a Be(4,8).

diriYRobin Ryder was sitting in front of me in our CREST office as I was experimenting with this and he suggested to use the current value of the Markov chain as the mode of the proposal instead of as the mean of the proposal. Which amounts to adding one (1) to both Beta coefficients. And which solves the problem altogether. Thanks, Robin!

4 Responses to “debunking a (minor and personal) myth”

  1. This point seemed immensely underexplored in the literature (or just handwaved) when I last ran into it.
    I don’t think it makes much sense to center on the mean, to be frank, due to the above issues. What we want is to stay close to our current location, with the probability of selecting any point in the vicinity decreasing with some distance measure (e.g. the Bregman divergence associated with an exponential family). So we should center proposals on the mode. These two quantities just happen to coincide in the classical gaussian case. When we work with distributions which can lose unimodality under some parameter settings we should be more careful.
    I made some arguments to this effect in a paper last year when I ran into this, as I hadn’t found any mention of the issue elsewhere.

  2. Dan Simpson Says:

    I remember talking to Matt Moores about proposals on a simplex, and he had a lot more luck getting it to move when he reparameterised the dirichlet as a normalised sum of gammas and did the random walk on the (maybe log) gammas.

    • Yes! Although this does not solve directly our problem of achieving the targeted acceptance rate, the idea of sticking to the Gammas behind the Dirichlets is quite helpful in avoid NA’s. Some NA’s, as there is also a possibility that the Gamma generator returns zeros…

      • I thought it did, because a RW on the gammas (or, more likely on their logs) should be fairly straightforward to tune.

        I won’t deny, however, that your solution is more elegant. And given that you’ve given it more than a second’s thought, probably works better :p

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.