non-reversible MCMC

Posted in Books, Statistics, University life with tags , , , , , , on May 21, 2015 by xi'an

While visiting Dauphine, Natesh Pillai and Aaron Smith pointed out this interesting paper of Joris Bierkens (Warwick) that had escaped my arXiv watch/monitoring. The paper is about turning Metropolis-Hastings algorithms into non-reversible versions, towards improving mixing.

In a discrete setting, a way to produce a non-reversible move is to mix the proposal kernel Q with its time-reversed version Q’ and use an acceptance probability of the form

$\epsilon\pi(y)Q(y,x)+(1-\epsilon)\pi(x)Q(x,y) \big/ \pi(x)Q(x,y)$

where ε is any weight. This construction is generalised in the paper to any vorticity (skew-symmetric with zero sum rows) matrix Γ, with the acceptance probability

$\epsilon\Gamma(x,y)+\pi(y)Q(y,x)\big/\pi(x)Q(x,y)$

where ε is small enough to ensure all numerator values are non-negative. This is a rather annoying assumption in that, except for the special case derived from the time-reversed kernel, it has to be checked over all pairs (x,y). (I first thought it also implied the normalising constant of π but everything can be set in terms of the unormalised version of π, Γ or ε included.) The paper establishes that the new acceptance probability preserves π as its stationary distribution. An alternative construction is to make the proposal change from Q in H such that H(x,y)=Q(x,y)+εΓ(x,y)/π(x). Which seems more pertinent as not changing the proposal cannot improve that much the mixing behaviour of the chain. Still, the move to the non-reversible versions has the noticeable plus of decreasing the asymptotic variance of the Monte Carlo estimate for any integrable function. Any. (Those results are found in the physics literature of the 2000’s.)

The extension to the continuous case is a wee bit more delicate. One needs to find an anti-symmetric vortex function g with zero integral [equivalent to the row sums being zero] such that g(x,y)+π(y)q(y,x)>0 and with same support as π(x)q(x,y) so that the acceptance probability of g(x,y)+π(y)q(y,x)/π(x)q(x,y) leads to π being the stationary distribution. Once again g(x,y)=ε(π(y)q(y,x)-π(x)q(x,y)) is a natural candidate but it is unclear to me why it should work. As the paper only contains one illustration for the discretised Ornstein-Uhlenbeck model, with the above choice of g for a small enough ε (a point I fail to understand since any ε<1 should provide a positive g(x,y)+π(y)q(y,x)), it is also unclear to me that this modification (i) is widely applicable and (ii) is relevant for genuine MCMC settings.

Cauchy Distribution: Evil or Angel?

Posted in Books, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , on May 19, 2015 by xi'an

Natesh Pillai and Xiao-Li Meng just arXived a short paper that solves the Cauchy conjecture of Drton and Xiao [I mentioned last year at JSM], namely that, when considering two normal vectors with generic variance matrix S, a weighted average of the ratios X/Y remains Cauchy(0,1), just as in the iid S=I case. Even when the weights are random. The fascinating side of this now resolved (!) conjecture is that the correlation between the terms does not seem to matter. Pushing the correlation to one [assuming it is meaningful, which is a suspension of belief!, since there is no standard correlation for Cauchy variates] leads to a paradox: all terms are equal and yet… it works: we recover a single term, which again is Cauchy(0,1). All that remains thus to prove is that it stays Cauchy(0,1) between those two extremes, a weird kind of intermediary values theorem!

Actually, Natesh and XL further prove an inverse χ² theorem: the inverse of the normal vector, renormalised into a quadratic form is an inverse χ² no matter what its covariance matrix. The proof of this amazing theorem relies on a spherical representation of the bivariate Gaussian (also underlying the Box-Müller algorithm). The angles are then jointly distributed as

$\exp\{-\sum_{i,j}\alpha_{ij}\cos(\theta_i-\theta_j)\}$

and from there follows the argument that conditional on the differences between the θ’s, all ratios are Cauchy distributed. Hence the conclusion!

A question that stems from reading this version of the paper is whether this property extends to other formats of non-independent Cauchy variates. Somewhat connected to my recent post about generating correlated variates from arbitrary distributions: using the inverse cdf transform of a Gaussian copula shows this is possibly the case: the following code is meaningless in that the empirical correlation has no connection with a “true” correlation, but nonetheless the experiment seems of interest…

> ro=.999999;x=matrix(rnorm(2e4),ncol=2);y=ro*x+sqrt(1-ro^2)*matrix(rnorm(2e4),ncol=2)
> cor(x[,1]/x[,2],y[,1]/y[,2])
[1] -0.1351967
> ro=.99999999;x=matrix(rnorm(2e4),ncol=2);y=ro*x+sqrt(1-ro^2)*matrix(rnorm(2e4),ncol=2)
> cor(x[,1]/x[,2],y[,1]/y[,2])
[1] 0.8622714
> ro=1-1e-5;x=matrix(rnorm(2e4),ncol=2);y=ro*x+sqrt(1-ro^2)*matrix(rnorm(2e4),ncol=2)
> z=qcauchy(pnorm(as.vector(x)));w=qcauchy(pnorm(as.vector(y)))
> cor(x=z,y=w)
[1] 0.9999732
> ks.test((z+w)/2,"pcauchy")

One-sample Kolmogorov-Smirnov test

data:  (z + w)/2
D = 0.0068, p-value = 0.3203
alternative hypothesis: two-sided
> ro=1-1e-3;x=matrix(rnorm(2e4),ncol=2);y=ro*x+sqrt(1-ro^2)*matrix(rnorm(2e4),ncol=2)
> z=qcauchy(pnorm(as.vector(x)));w=qcauchy(pnorm(as.vector(y)))
> cor(x=z,y=w)
[1] 0.9920858
> ks.test((z+w)/2,"pcauchy")

One-sample Kolmogorov-Smirnov test

data:  (z + w)/2
D = 0.0036, p-value = 0.9574
alternative hypothesis: two-sided


arbitrary distributions with set correlation

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , , , on May 11, 2015 by xi'an

A question recently posted on X Validated by Antoni Parrelada: given two arbitrary cdfs F and G, how can we simulate a pair (X,Y) with marginals  F and G, and with set correlation ρ? The answer posted by Antoni Parrelada was to reproduce the Gaussian copula solution: produce (X’,Y’) as a Gaussian bivariate vector with correlation ρ and then turn it into (X,Y)=(F⁻¹(Φ(X’)),G⁻¹(Φ(Y’))). Unfortunately, this does not work, because the correlation does not keep under the double transform. The graph above is part of my answer for a χ² and a log-Normal cdf for F amd G: while corr(X’,Y’)=ρ, corr(X,Y) drifts quite a  lot from the diagonal! Actually, by playing long enough with my function

tacor=function(rho=0,nsim=1e4,fx=qnorm,fy=qnorm)
{
x1=rnorm(nsim);x2=rnorm(nsim)
coeur=rho
rho2=sqrt(1-rho^2)
for (t in 1:length(rho)){
y=pnorm(cbind(x1,rho[t]*x1+rho2[t]*x2))
coeur[t]=cor(fx(y[,1]),fy(y[,2]))}
return(coeur)
}


Playing further, I managed to get an almost flat correlation graph for the admittedly convoluted call

tacor(seq(-1,1,.01),
fx=function(x) qchisq(x^59,df=.01),
fy=function(x) qlogis(x^59))


Now, the most interesting question is how to produce correlated simulations. A pedestrian way is to start with a copula, e.g. the above Gaussian copula, and to twist the correlation coefficient ρ of the copula until the desired correlation is attained for the transformed pair. That is, to draw the above curve and invert it. (Note that, as clearly exhibited by the graph just above, all desired correlations cannot be achieved for arbitrary cdfs F and G.) This is however very pedestrian and I wonder whether or not there is a generic and somewhat automated solution…

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on April 30, 2015 by xi'an

Another occurrence [out of many!] of a question on X validated where the originator (primitivus petitor) was trying to get an explanation without the proper background. On either Bayesian statistics or simulation. The introductory sentence to the question was about “trying to understand how the choice of priors affects a Bayesian model estimated using MCMC” but the bulk of the question was in fact failing to understand an R code for a random-walk Metropolis-Hastings algorithm for a simple regression model provided in a introductory blog by Florian Hartig. And even more precisely about confusing the R code dnorm(b, sd = 5, log = T) in the prior with rnorm(1,mean=b, sd = 5, log = T) in the proposal…

“You should definitely invest some time in learning the bases of Bayesian statistics and MCMC methods from textbooks or on-line courses.” X

So I started my answer with the above warning. Which sums up my feelings about many of those X validated questions, namely that primitivi petitores lack the most basic background to consider such questions. Obviously, I should not have bothered with an answer, but it was late at night after a long day, a good meal at the pub in Kenilworth, and a broken toe still bothering me. So I got this reply from the primitivus petitor that it was a patronizing piece of advice and he prefers to learn from R code than from textbooks and on-line courses, having “looked through a number of textbooks”. Good luck with this endeavour then!

vertical likelihood Monte Carlo integration

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , on April 17, 2015 by xi'an

A few months ago, Nick Polson and James Scott arXived a paper on one of my favourite problems, namely the approximation of normalising constants (and it went way under my radar, as I only became aware of it quite recently!, then it remained in my travel bag for an extra few weeks…). The method for approximating the constant Z draws from an analogy with the energy level sampling methods found in physics, like the Wang-Landau algorithm. The authors rely on a one-dimensional slice sampling representation of the posterior distribution and [main innovation in the paper] add a weight function on the auxiliary uniform. The choice of the weight function links the approach with the dreaded harmonic estimator (!), but also with power-posterior and bridge sampling. The paper recommends a specific weighting function, based on a “score-function heuristic” I do not get. Further, the optimal weight depends on intractable cumulative functions as in nested sampling. It would be fantastic if one could draw directly from the prior distribution of the likelihood function—rather than draw an x [from the prior or from something better, as suggested in our 2009 Biometrika paper] and transform it into L(x)—but as in all existing alternatives this alas is not the case. (Which is why I find the recommendations in the paper for practical implementation rather impractical, since, were the prior cdf of L(X) available, direct simulation of L(X) would be feasible. Maybe not the optimal choice though.)

“What is the distribution of the likelihood ordinates calculated via nested sampling? The answer is surprising: it is essentially the same as the distribution of likelihood ordinates by recommended weight function from Section 4.”

The approach is thus very much related to nested sampling, at least in spirit. As the authors later demonstrate, nested sampling is another case of weighting, Both versions require simulations under truncated likelihood values. Albeit with a possibility of going down [in likelihood values] with the current version. Actually, more weighting could prove [more] efficient as both the original nested and vertical sampling simulate from the prior under the likelihood constraint. Getting away from the prior should help. (I am quite curious to see how the method is received and applied.)

MCMskv, Lenzerheide, Jan. 5-7, 2016

Posted in Kids, Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on March 31, 2015 by xi'an

Following the highly successful [authorised opinion!, from objective sources] MCMski IV, in Chamonix last year, the BayesComp section of ISBA has decided in favour of a two-year period, which means the great item of news that next year we will meet again for MCMski V [or MCMskv for short], this time on the snowy slopes of the Swiss town of Lenzerheide, south of Zürich. The committees are headed by the indefatigable Antonietta Mira and Mark Girolami. The plenary speakers have already been contacted and Steve Scott (Google), Steve Fienberg (CMU), David Dunson (Duke), Krys Latuszynski (Warwick), and Tony Lelièvre (Mines, Paris), have agreed to talk. Similarly, the nine invited sessions have been selected and will include Hamiltonian Monte Carlo,  Algorithms for Intractable Problems (ABC included!), Theory of (Ultra)High-Dimensional Bayesian Computation, Bayesian NonParametrics, Bayesian Econometrics,  Quasi Monte Carlo, Statistics of Deep Learning, Uncertainty Quantification in Mathematical Models, and Biostatistics. There will be afternoon tutorials, including a practical session from the Stan team, tutorials for which call is open, poster sessions, a conference dinner at which we will be entertained by the unstoppable Imposteriors. The Richard Tweedie ski race is back as well, with a pair of Blossom skis for the winner!

As in Chamonix, there will be parallel sessions and hence the scientific committee has issued a call for proposals to organise contributed sessions, tutorials and the presentation of posters on particularly timely and exciting areas of research relevant and of current interest to Bayesian Computation. All proposals should be sent to Mark Girolami directly by May the 4th (be with him!).

Hamiltonian ABC

Posted in Books, pictures, Statistics, University life with tags , , , , , , , on March 13, 2015 by xi'an

On Monday, Ed Meeds, Robert Leenders, and Max Welling (from Amsterdam) arXived a paper entitled Hamiltonian ABC. Before looking at the paper in any detail, I got puzzled by this association of antagonistic terms, since ABC is intended for complex and mostly intractable likelihoods, while Hamiltonian Monte Carlo requires a lot from the target, in order to compute gradients and Hessians… [Warning: some graphs on pages 13-14 may be harmful to your printer!]

Somewhat obviously (ex-post!), the paper suggests to use Hamiltonian dynamics on ABC approximations of the likelihood. They compare a Gaussian kernel version

$\frac{1}{S}\sum_{s=1}^S \varphi(y^\text{obs}-x_s(\theta);\epsilon^2)$

with the synthetic Gaussian likelihood version of Wood (2010)

$\varphi(y^\text{obs}-\mu(\theta);\sigma(\theta)^2+\epsilon^2)$

where both mean and variance are estimated from the simulated data. If ε is taken as an external quantity and driven to zero, the second approach is much more stable. But… ε is never driven to zero in ABC, or fixed at ε=0.37: It is instead considered as a kernel bandwidth and hence estimated from the simulated data. Hence ε is commensurable with σ(θ).  And this makes me wonder at the relevance of the conclusion that synthetic is better than kernel for Hamiltonian ABC. More globally, I wonder at the relevance of better simulating from a still approximate target when the true goal is to better approximate the genuine posterior.

Some of the paper covers separate issues like handling gradient by finite differences à la Spall [if you can afford it!] and incorporating the random generator as part of the Markov chain. And using S common random numbers in computing the gradients for all values of θ. (Although I am not certain all random generators can be represented as a deterministic transform of a parameter θ and of a fixed number of random uniforms. But the authors may consider a random number of random uniforms when they represent their random generators as deterministic transform of a parameter θ and of the random seed. I am also uncertain about the distinction between common, sticky, and persistent random numbers!)