Ingmar Schuster, who visited Paris-Dauphine last Spring (and is soon to return here as a postdoc funded by Fondation des Sciences Mathématiques de Paris) has arXived last week a paper on gradient importance sampling. In this paper, he builds a sequential importance sampling (or population Monte Carlo) algorithm that exploits the additional information contained in the gradient of the target. The proposal or importance function being essentially the MALA move as its proposal, mixed across the elements of the previous population. When compared with our original PMC mixture of random walk proposals found in e.g. this paper, each term in the mixture thus involves an extra gradient, with a scale factor that decreases to zero as 1/t√t. Ingmar compares his proposal with an adaptive Metropolis, an adaptive MALTa and an HM algorithms, for two mixture distributions and the banana target of Haario et al. (1999) we also used in our paper. As well as a logistic regression. In each case, he finds both a smaller squared error and a smaller bias for the same computing time (evaluated as the number of likelihood evaluations). While we discussed this scheme when he visited, I remain intrigued as to why it works so well when compared with the other solutions. One possible explanation is that the use of the gradient drift is more efficient on a population of particles than on a single Markov chain, provided the population covers all modes of importance on the target surface: the “fatal” attraction of the local model is then much less of an issue…
Archive for Université Paris Dauphine
While visiting Dauphine, Natesh Pillai and Aaron Smith pointed out this interesting paper of Joris Bierkens (Warwick) that had escaped my arXiv watch/monitoring. The paper is about turning Metropolis-Hastings algorithms into non-reversible versions, towards improving mixing.
In a discrete setting, a way to produce a non-reversible move is to mix the proposal kernel Q with its time-reversed version Q’ and use an acceptance probability of the form
where ε is any weight. This construction is generalised in the paper to any vorticity (skew-symmetric with zero sum rows) matrix Γ, with the acceptance probability
where ε is small enough to ensure all numerator values are non-negative. This is a rather annoying assumption in that, except for the special case derived from the time-reversed kernel, it has to be checked over all pairs (x,y). (I first thought it also implied the normalising constant of π but everything can be set in terms of the unormalised version of π, Γ or ε included.) The paper establishes that the new acceptance probability preserves π as its stationary distribution. An alternative construction is to make the proposal change from Q in H such that H(x,y)=Q(x,y)+εΓ(x,y)/π(x). Which seems more pertinent as not changing the proposal cannot improve that much the mixing behaviour of the chain. Still, the move to the non-reversible versions has the noticeable plus of decreasing the asymptotic variance of the Monte Carlo estimate for any integrable function. Any. (Those results are found in the physics literature of the 2000’s.)
The extension to the continuous case is a wee bit more delicate. One needs to find an anti-symmetric vortex function g with zero integral [equivalent to the row sums being zero] such that g(x,y)+π(y)q(y,x)>0 and with same support as π(x)q(x,y) so that the acceptance probability of g(x,y)+π(y)q(y,x)/π(x)q(x,y) leads to π being the stationary distribution. Once again g(x,y)=ε(π(y)q(y,x)-π(x)q(x,y)) is a natural candidate but it is unclear to me why it should work. As the paper only contains one illustration for the discretised Ornstein-Uhlenbeck model, with the above choice of g for a small enough ε (a point I fail to understand since any ε<1 should provide a positive g(x,y)+π(y)q(y,x)), it is also unclear to me that this modification (i) is widely applicable and (ii) is relevant for genuine MCMC settings.
This morning, in the train to Dauphine (train that was even more delayed than usual!), I read a recent arXival of Brendon Brewer and Courtney Donovan. Entitled Fast Bayesian inference for exoplanet discovery in radial velocity data, the paper suggests to associate Matthew Stephens’ (2000) birth-and-death MCMC approach with nested sampling to infer about the number N of exoplanets in an exoplanetary system. The paper is somewhat sparse in its description of the suggested approach, but states that the birth-date moves involves adding a planet with parameters simulated from the prior and removing a planet at random, both being accepted under a likelihood constraint associated with nested sampling. I actually wonder if this actually is the birth-date version of Peter Green’s (1995) RJMCMC rather than the continuous time birth-and-death process version of Matthew…
“The traditional approach to inferring N also contradicts fundamental ideas in Bayesian computation. Imagine we are trying to compute the posterior distribution for a parameter a in the presence of a nuisance parameter b. This is usually solved by exploring the joint posterior for a and b, and then only looking at the generated values of a. Nobody would suggest the wasteful alternative of using a discrete grid of possible a values and doing an entire Nested Sampling run for each, to get the marginal likelihood as a function of a.”
This criticism is receivable when there is a huge number of possible values of N, even though I see no fundamental contradiction with my ideas about Bayesian computation. However, it is more debatable when there are a few possible values for N, given that the exploration of the augmented space by a RJMCMC algorithm is often very inefficient, in particular when the proposed parameters are generated from the prior. The more when nested sampling is involved and simulations are run under the likelihood constraint! In the astronomy examples given in the paper, N never exceeds 15… Furthermore, by merging all N’s together, it is unclear how the evidences associated with the various values of N can be computed. At least, those are not reported in the paper.
The paper also omits to provide the likelihood function so I do not completely understand where “label switching” occurs therein. My first impression is that this is not a mixture model. However if the observed signal (from an exoplanetary system) is the sum of N signals corresponding to N planets, this makes more sense.
Today was the final session of our Reading Classics Seminar for the academic year 2014-2015. I have not reported on this seminar much so far because it has had starting problems, namely hardly any student present on the first classes and therefore several re-starts until we reached a small group of interested students. And this is truly The End for this enjoyable experiment as this is the final year for my TSI Master at Paris-Dauphine, as it will become integrated within the new MASH Master next year.
As a last presentation for the entire series, my student picked John Skilling’s Nested Sampling, not that it was in my list of “classics”, but he had worked on the paper in a summer project and was thus reasonably fluent with the topic. As he did a good enough job (!), here are his slides.
Some of the questions that came to me during the talk were on how to run nested sampling sequentially, both in the data and in the number of simulated points, and on incorporating more deterministic moves in order to remove some of the Monte Carlo variability. I was about to ask about (!) the Hamiltonian version of nested sampling but then he mentioned his last summer internship on this very topic! I also realised during that talk that the formula (for positive random variables)
does not require absolute continuity of the distribution F.
Now my grading is over, I can reflect on the unexpected difficulties in the mathematical statistics exam. I knew that the first question in the multiple choice exercise, borrowed from Cross Validation, was going to be quasi-impossible and indeed only one student out of 118 managed to find the right solution. More surprisingly, most students did not manage to solve the (absence of) MLE when observing that n unobserved exponential Exp(λ) were larger than a fixed bound δ. I was also amazed that they did poorly on a N(0,σ²) setup, failing to see that
and determine an unbiased estimator that can be improved by Rao-Blackwellisation. No student reached the conditioning part. And a rather frequent mistake more understandable due to the limited exposure they had to Bayesian statistics: many confused parameter λ with observation x in the prior, writing
hence could not derive a proper posterior.