**I**ngmar Schuster, who visited Paris-Dauphine last Spring (and is soon to return here as a postdoc funded by Fondation des Sciences Mathématiques de Paris) has arXived last week a paper on gradient importance sampling. In this paper, he builds a sequential importance sampling (or population Monte Carlo) algorithm that exploits the additional information contained in the gradient of the target. The proposal or importance function being essentially the MALA move as its proposal, mixed across the elements of the previous population. When compared with our original PMC mixture of random walk proposals found in e.g. this paper, each term in the mixture thus involves an extra gradient, with a scale factor that decreases to zero as *1/t√t*. Ingmar compares his proposal with an adaptive Metropolis, an adaptive MALTa and an HM algorithms, for two mixture distributions and the banana target of Haario et al. (1999) we also used in our paper. As well as a logistic regression. In each case, he finds both a smaller squared error and a smaller bias for the same computing time (evaluated as the number of likelihood evaluations). While we discussed this scheme when he visited, I remain intrigued as to why it works so well when compared with the other solutions. One possible explanation is that the use of the gradient drift is more efficient on a population of particles than on a single Markov chain, provided the population covers all modes of importance on the target surface: the “fatal” attraction of the local model is then much less of an issue…

## Archive for Université Paris Dauphine

## gradient importance sampling

Posted in Books, pictures, Statistics, University life with tags adaptive importance sampling, Fondation Sciences Mathématiques de Paris, Langevin MCMC algorithm, Leipzig, population Monte Carlo, sequential Monte Carlo, Université Paris Dauphine on July 30, 2015 by xi'an## non-reversible MCMC

Posted in Books, Statistics, University life with tags arXiv, MCMC algorithms, Monte Carlo Statistical Methods, Ornstein-Uhlenbeck model, reversibility, Université Paris Dauphine, University of Warwick on May 21, 2015 by xi'an**W**hile visiting Dauphine, Natesh Pillai and Aaron Smith pointed out this interesting paper of Joris Bierkens (Warwick) that had escaped my arXiv watch/monitoring. The paper is about turning Metropolis-Hastings algorithms into non-reversible versions, towards improving mixing.

In a discrete setting, a way to produce a non-reversible move is to mix the proposal kernel Q with its time-reversed version Q’ and use an acceptance probability of the form

where ε is any weight. This construction is generalised in the paper to any vorticity (skew-symmetric with zero sum rows) matrix Γ, with the acceptance probability

where ε is small enough to ensure all numerator values are non-negative. This is a rather annoying assumption in that, except for the special case derived from the time-reversed kernel, it has to be checked over all pairs (x,y). (I first thought it also implied the normalising constant of π but everything can be set in terms of the unormalised version of π, Γ or ε included.) The paper establishes that the new acceptance probability preserves π as its stationary distribution. An alternative construction is to make the proposal change from Q in H such that H(x,y)=Q(x,y)+εΓ(x,y)/π(x). Which seems more pertinent as not changing the proposal cannot improve that much the mixing behaviour of the chain. Still, the move to the non-reversible versions has the noticeable plus of decreasing the asymptotic variance of the Monte Carlo estimate for any integrable function. Any. (Those results are found in the physics literature of the 2000’s.)

The extension to the continuous case is a wee bit more delicate. One needs to find an anti-symmetric vortex function g with zero integral [equivalent to the row sums being zero] such that g(x,y)+π(y)q(y,x)>0 and with same support as π(x)q(x,y) so that the acceptance probability of g(x,y)+π(y)q(y,x)/π(x)q(x,y) leads to π being the stationary distribution. Once again g(x,y)=ε(π(y)q(y,x)-π(x)q(x,y)) is a natural candidate but it is unclear to me why it should work. As the paper only contains one illustration for the discretised Ornstein-Uhlenbeck model, with the above choice of g for a small enough ε (a point I fail to understand since any ε<1 should provide a positive g(x,y)+π(y)q(y,x)), it is also unclear to me that this modification (i) is widely applicable and (ii) is relevant for genuine MCMC settings.

## from my office

Posted in pictures, Travel, University life with tags bois de Boulogne, Fondation Louis Vuiton, La Défense, Paris, Université Paris Dauphine on April 10, 2015 by xi'an## trans-dimensional nested sampling and a few planets

Posted in Books, Statistics, Travel, University life with tags birth-and-death process, Chamonix, exoplanet, label switching, métro, nested sampling, Paris, RER B, reversible jump, Université Paris Dauphine on March 2, 2015 by xi'an**T**his morning, in the train to Dauphine (train that was even more delayed than usual!), I read a recent arXival of Brendon Brewer and Courtney Donovan. Entitled Fast Bayesian inference for exoplanet discovery in radial velocity data, the paper suggests to associate Matthew Stephens’ (2000) birth-and-death MCMC approach with nested sampling to infer about the number N of exoplanets in an exoplanetary system. The paper is somewhat sparse in its description of the suggested approach, but states that the birth-date moves involves adding a planet with parameters simulated from the prior and removing a planet at random, both being accepted under a likelihood constraint associated with nested sampling. I actually wonder if this actually is the birth-date version of Peter Green’s (1995) RJMCMC rather than the continuous time birth-and-death process version of Matthew…

“The traditional approach to inferring N also contradicts fundamental ideas in Bayesian computation. Imagine we are trying to compute the posterior distribution for a parameter a in the presence of a nuisance parameter b. This is usually solved by exploring the joint posterior for a and b, and then only looking at the generated values of a. Nobody would suggest the wasteful alternative of using a discrete grid of possible a values and doing an entire Nested Sampling run for each, to get the marginal likelihood as a function of a.”

This criticism is receivable when there is a huge number of possible values of N, even though I see no fundamental contradiction with my ideas about Bayesian computation. However, it is more debatable when there are a few possible values for N, given that the exploration of the augmented space by a RJMCMC algorithm is often very inefficient, in particular when the proposed parameters are generated from the prior. The more when nested sampling is involved and simulations are run under the likelihood constraint! In the astronomy examples given in the paper, N never exceeds 15… Furthermore, by merging all N’s together, it is unclear how the evidences associated with the various values of N can be computed. At least, those are not reported in the paper.

The paper also omits to provide the likelihood function so I do not completely understand where “label switching” occurs therein. My first impression is that this is not a mixture model. However if the observed signal (from an exoplanetary system) is the sum of N signals corresponding to N planets, this makes more sense.

## post-grading weekend

Posted in Kids, pictures, Statistics, University life with tags Bayesian statistics, copies, final exam, grading, mathematical statistics, MLE, Université Paris Dauphine on January 19, 2015 by xi'an**N**ow my grading is over, I can reflect on the unexpected difficulties in the mathematical statistics exam. I knew that the first question in the multiple choice exercise, borrowed from Cross Validation, was going to be quasi-impossible and indeed only one student out of 118 managed to find the right solution. More surprisingly, most students did not manage to solve the (absence of) MLE when observing that n unobserved exponential Exp(λ) were larger than a fixed bound δ. I was also amazed that they did poorly on a N(0,σ²) setup, failing to see that

and determine an unbiased estimator that can be improved by Rao-Blackwellisation. No student reached the conditioning part. And a rather frequent mistake more understandable due to the limited exposure they had to Bayesian statistics: many confused parameter λ with observation x in the prior, writing

instead of

hence could not derive a proper posterior.