## Non-reversible Markov Chains for Monte Carlo sampling

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on September 24, 2015 by xi'an

This “week in Warwick” was not chosen at random as I was aware there is a workshop on non-reversible MCMC going on. (Even though CRiSM sponsored so many workshops in September that almost any week would have worked for the above sentence!) It has always been kind of a mystery to me that non-reversibility could make a massive difference in practice, even though I am quite aware that it does. And I can grasp some of the theoretical arguments why it does. So it was quite rewarding to sit in this Warwick amphitheatre and learn about overdamped Langevin algorithms and other non-reversible diffusions, to see results where convergence times moved from n to √n, and to grasp some of the appeal of lifting albeit in finite state spaces. Plus, the cartoon presentation of Hamiltonian Monte Carlo by Michael Betancourt was a great moment, not only because of the satellite bursting into flames on the screen but also because it gave a very welcome intuition about why reversibility was inefficient and HMC appealing. So I am grateful to my two colleagues, Joris Bierkens and Gareth Roberts, for organising this exciting workshop, with a most profitable scheduling favouring long and few talks. My next visit to Warwick will also coincide with a workshop on intractable likelihood, next November. This time part of the new Alan Turing Institute programme.

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , , , on May 27, 2015 by xi'an

Another trip in the métro today (to work with Pierre Jacob and Lawrence Murray in a Paris Anticafé!, as the University was closed) led me to infer—warning!, this is not the exact distribution!—the distribution of x, namely

$f(x|N) = \frac{4^p}{4^{\ell+2p}} {\ell+p \choose p}\,\mathbb{I}_{N=\ell+2p}$

since a path x of length l(x) will corresponds to N draws if N-l(x) is an even integer 2p and p undistinguishable annihilations in 4 possible directions have to be distributed over l(x)+1 possible locations, with Feller’s number of distinguishable distributions as a result. With a prior π(N)=1/N on N, hence on p, the posterior on p is given by

$\pi(p|x) \propto 4^{-p} {\ell+p \choose p} \frac{1}{\ell+2p}$

Now, given N and  x, the probability of no annihilation on the last round is 1 when l(x)=N and in general

$\frac{4^p}{4^{\ell+2p}}{\ell-1+p \choose p}\big/\frac{4^p}{4^{\ell+2p}}{\ell+p \choose p}=\frac{\ell}{\ell+p}=\frac{2\ell}{N+\ell}$

which can be integrated against the posterior. The numerical expectation is represented for a range of values of l(x) in the above graph. Interestingly, the posterior probability is constant for l(x) large  and equal to 0.8125 under a flat prior over N.

Getting back to Pierre Druilhet’s approach, he sets a flat prior on the length of the path θ and from there derives that the probability of annihilation is about 3/4. However, “the uniform prior on the paths of lengths lower or equal to M” used for this derivation which gives a probability of length l proportional to 3l is quite different from the distribution of l(θ) given a number of draws N. Which as shown above looks much more like a Binomial B(N,1/2).

However, being not quite certain about the reasoning involving Fieller’s trick, I ran an ABC experiment under a flat prior restricted to (l(x),4l(x)) and got the above, where the histogram is for a posterior sample associated with l(x)=195 and the gold curve is the potential posterior. Since ABC is exact in this case (i.e., I only picked N’s for which l(x)=195), ABC is not to blame for the discrepancy! I asked about the distribution on Stack Exchange maths forum (and a few colleagues here as well) but got no reply so far… Here is the R code that goes with the ABC implementation:

#observation:
elo=195
#ABC version
T=1e6
el=rep(NA,T)
N=sample(elo:(4*elo),T,rep=TRUE)
for (t in 1:T){
#generate a path
paz=sample(c(-(1:2),1:2),N[t],rep=TRUE)
#eliminate U-turns
uturn=paz[-N[t]]==-paz[-1]
while (sum(uturn>0)){
uturn[-1]=uturn[-1]*(1-
uturn[-(length(paz)-1)])
uturn=c((1:(length(paz)-1))[uturn==1],
(2:length(paz))[uturn==1])
paz=paz[-uturn]
uturn=paz[-length(paz)]==-paz[-1]
}
el[t]=length(paz)}
#subsample to get exact posterior
poster=N[abs(el-elo)==0]


## single variable transformation approach to MCMC

Posted in Books, Statistics, Travel with tags , , , , on September 9, 2014 by xi'an

I read the newly arXived paper “On Single Variable Transformation Approach to Markov Chain Monte Carlo” by Dey and Bhattacharya on the pleasant train ride between Bristol and Coventry last weekend. The paper actually follows several earlier papers by the authors that I have not read in detail. The notion of single variable transform is to add plus or minus the same random noise to all components of the current value of the Markov chain, instead of the standard d-dimensional random walk proposal of the reference Metropolis-Hastings algorithm, namely all proposals are of the form

$x_i'=x_i\pm \epsilon\ i=1,\cdots,d$

meaning the chain proceeds [after acceptance] along one and only one of the d diagonals. The authors’ arguments are that (a) the proposal is cheaper and (b) the acceptance rate is higher. What I find questionable in this argument is that this does not directly matter in the evaluation of the performances of the algorithm. For instance, higher acceptance in a Metropolis-Hasting algorithm does not imply faster convergence and smaller asymptotic variance. (This goes without mentioning the fact that the comparative Figure 1 is so variable with the dimension as to be of limited worth. Figure 1 and 2 are also found in an earlier arXived paper of the authors.) For instance, restricting the moves along the diagonals of the Euclidean space implies that there is a positive probability to make two successive proposals along the same diagonal, which is a waste of time. When considering the two-dimensional case, joining two arbitrary points using an everywhere positive density g upon ε means generating two successive values from g, which is equivalent cost-wise to generating a single noise from a two-dimensional proposal. Without the intermediate step of checking the one-dimensional move along one diagonal. So much for a gain. In fine, the proposal found in this paper sums up as being a one-at-a-time version of a standard random walk Metropolis-Hastings algorithm.

## MCMC on zero measure sets

Posted in R, Statistics with tags , , , , , , , on March 24, 2014 by xi'an

Simulating a bivariate normal under the constraint (or conditional to the fact) that x²-y²=1 (a non-linear zero measure curve in the 2-dimensional Euclidean space) is not that easy: if running a random walk along that curve (by running a random walk on y and deducing x as x²=y²+1 and accepting with a Metropolis-Hastings ratio based on the bivariate normal density), the outcome differs from the target predicted by a change of variable and the proper derivation of the conditional. The above graph resulting from the R code below illustrates the discrepancy!

targ=function(y){
exp(-y^2)/(1.52*sqrt(1+y^2))}

T=10^5
Eps=3
ys=xs=rep(runif(1),T)
xs[1]=sqrt(1+ys[1]^2)
for (t in 2:T){
propy=runif(1,-Eps,Eps)+ys[t-1]
propx=sqrt(1+propy^2)
ace=(runif(1)<(dnorm(propy)*dnorm(propx))/
(dnorm(ys[t-1])*dnorm(xs[t-1])))
if (ace){
ys[t]=propy;xs[t]=propx
}else{
ys[t]=ys[t-1];xs[t]=xs[t-1]}}


  ace=(runif(1)<(dnorm(propy)*dnorm(propx)/propx)/
(dnorm(ys[t-1])*dnorm(xs[t-1])/xs[t-1]))


the fit is there. My open question is how to make this derivation generic, i.e. without requiring the (dreaded) computation of the (dreadful) Jacobian.

## sampling from time-varying log-concave distributions

Posted in Statistics, University life with tags , , , , , on October 2, 2013 by xi'an

Sasha Rakhlin from Wharton sent me this paper he wrote (and arXived) with Hariharan Narayanan on a specific Markov chain algorithm that handles sequential Monte Carlo problems for log-concave targets. By relying on novel (by my standards) mathematical techniques, they manage to obtain geometric ergodicity results for random-walk based algorithms and log-concave targets. One of the new tools is the notion of self-concordant barrier, a sort of convex potential function F associated with a reference convex set and with Lipschitz properties. The second tool is a Gaussian distribution based on the metric induced by F. The third is the Dikin walk Markov chain, which uses this Gaussian as proposal and moves almost like the Metropolis-Hastings algorithm, except that it rejects with at least a probability of ½. The scale (or step size) of the Gaussian proposal is determined by the regularity of the log-concave target. In that setting, the total variation distance between the target at the t-th level and the distribution of the Markov chain can be fairly precisely approximated. Which leads in turn to a scaling of the number of random walk steps that are necessary to ensure convergence. Depending on the pace of the moving target, a single step of the random walk may be sufficient, which is quite an interesting feature.

## making a random walk geometrically ergodic

Posted in R, Statistics with tags , , , , , , , , , on March 2, 2013 by xi'an

While a random walk Metropolis-Hastings algorithm cannot be uniformly ergodic in a general setting (Mengersen and Tweedie, AoS, 1996), because it needs more energy to leave far away starting points, it can be geometrically ergodic depending on the target (and the proposal). In a recent Annals of Statistics paper, Leif Johnson and Charlie Geyer designed a trick to turn a random walk Metropolis-Hastings algorithm into a geometrically ergodic random walk Metropolis-Hastings algorithm by virtue of an isotropic transform (under the provision that the original target density has a moment generating function). This theoretical result is complemented by an R package called mcmc. (I have not tested it so far, having read the paper in the métro.) The examples included in the paper are however fairly academic and I wonder how the method performs in practice, on truly complex models, in particular because the change of variables relies on (a) an origin and (b) changing the curvature of space uniformly in all dimensions. Nonetheless, the idea is attractive and reminds me of a project of ours with Randal Douc,  started thanks to the ‘Og and still under completion.

## Random dive MH

Posted in R, Statistics with tags , , , , on September 2, 2010 by xi'an

A new Metropolis-Hastings algorithm that I would call “universal” was posted by Somak Dutta yesterday on arXiv. Multiplicative random walk Metropolis-Hastings on the real line contains a different Metropolis-Hastings algorithm called the random dive. The proposed new value x’ given the current value x is defined by

$\begin{cases} x\epsilon &\text{with probability } 1/2\\ x/\epsilon &\text{with probability } 1/2\end{cases}$

when $\epsilon$ is a random variable on $(-1,1)$. Thus, at each iteration, the current value is either shrunk or expanded by a random amount. When I read the paper in the Paris metro, while appreciating that this algorithm could be geometrically ergodic as opposed to the classical random walk algorithm, I was not convinced by the practical impact of the method, as I thought that the special role of zero in the proposal was not always reflected in the target. Especially when considering that the proposal is parameter-free. However, after running the following R program on a target centred away from zero, I found the proposal quite efficient.

f=function(x){.5*dnorm(x,mean=14)+.5*dnorm(x,mean=35)}
Nsim=10^5
x=rep(5,Nsim)
for (t in 2:Nsim){
coef=runif(1,min=-1)^sample(c(-1,1),1)
prop=x[t-1]*coef
prob=abs(coef)*f(prop)/f(x[t-1])
x[t]=x[t-1]
if (runif(1)<prob) x[t]=prop
}
hist(x,pro=T,nclass=113,col=”wheat2″)