## independent Metropolis-Hastings

Posted in Books, Statistics with tags , , , , , , on November 24, 2015 by xi'an

“In this paper we have demonstrated the potential benefits, both theoretical and practical, of the independence sampler over the random walk Metropolis algorithm.”

Peter Neal and Tsun Man Clement Lee arXived a paper on optimising the independent Metropolis-Hastings algorithm. I was a bit surprised at this “return” of the independent sampler, which I hardly mention in my lectures, so I had a look at the paper. The goal is to produce an equivalent to what Gelman, Gilks and Wild (1996) obtained for random walk samplers.  In the formal setting when the target is a product of n identical densities f, the optimal number k of components to update in one Metropolis-Hastings (within Gibbs) round is approximately 2.835/I, where I is the symmetrised Kullback-Leibler distance between the (univariate) target f and the independent proposal q. When I is finite. The most surprising part is that the optimal acceptance rate is again 0.234, as in the random walk case. This is surprising in that I usually associate the independent Metropolis-Hastings algorithm with high acceptance rates. But this is of course when calibrating the proposal q, not the block size k of the Gibbs part. Hence, while this calibration of the independent Metropolis-within-Gibbs sampler is worth the study and almost automatically applicable, it remains that it only applies to a certain category of problems where blocking can take place. As in the disease models illustrating the paper. And requires an adequate choice of proposal distribution for, otherwise, the above quote becomes inappropriate.

## bouncy particle sampler

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on October 30, 2015 by xi'an

Alexandre Bouchard-Coté, Sebastian Vollmer and Arnaud Doucet just arXived a paper with the above title, which reminded me of a proposal Kerrie Mengersen and I made at Valencia 7, in Tenerife, the [short-lived!] pinball sampler. This sampler was a particle (MCMC) sampler where we used the location of the other particles to avoid their neighbourhood, by bouncing away from them according to a delayed rejection principle, with an overall Gibbs justification since the resulting target was the product of copies of the target distribution. The difficulty in implementing the (neat!) idea was in figuring out the amount of bouncing or, in more physical terms, the energy allocated to the move.

In the current paper, inspired from an earlier paper in physics, the Markov chain (or single particle) evolves by linear moves, changing directions according to a Poisson process, with intensity and direction depending on the target distribution. A local version takes advantage of a decomposition of the target into a product of terms involving only some components of the whole parameter to be simulated. And hence allowing for moves in subspaces. An extension proposed by the authors is to bounce along the Hamiltonian isoclines. The method is demonstrably ergodic and irreducible. In practice, I wonder at the level of calibration or preliminary testing required to facilitate the exploration of the parameter space, particularly in the local version that seems to multiply items to be calibrated.

## asynchronous distributed Gibbs sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , on October 13, 2015 by xi'an

Alexander Terenin, Dan Simpson, and David Draper just arXived a paper on an alternative brand of Gibbs sampler, which they think can revolutionise the sampler and overcome its well-known bottlenecks. David had sent me the paper in advance and thus I had time to read it in the plane to Calgary. (This is also the very first paper I see acknowledging a pair of trousers..! With no connection whatsoever with bottlenecks!)

“Note that not all updates that are sent will be received by all other workers: because of network traffic congestion and other types of failures, a significant portion of the updates will be lost along the way.”

The approach is inherently parallel in that several “workers” (processors or graphical units) run Gibbs samplers in parallel, using their current knowledge of the system. This means they update a component of the model parameter, based on the information they have last received, and then send back this new value to the system. For physical reasons, there is not instantaneity in this transmission and thus all workers do not condition on the same “current” value, necessarily. Perceiving this algorithm as a “garden of forking paths” where each full conditional uses values picked at random from a collection of subchains (one for each worker), I can see why the algorithm should remain valid.

“Thus, the quality of this [ABC] method rises and falls with the ingenuity of the user in identifying nearly-sufficient statistics.”

It is also clear that this approach allows for any degree of parallelisation. However, it is less clear to me why this should constitute an improvement. With respect to the bottlenecks mentioned at the beginning of the paper, I do not truly see how the large data problem is bypassed. Except in cases where conditionals only depend on small parts of the data. Or why large dimensions can be more easily managed when compared with a plain Gibbs sampler or, better, parallel plain Gibbs samplers that would run on the same number of processors. (I do not think the paper runs the comparison in that manner, using instead a one-processor Gibbs sampler as its benchmark. Or less processors in the third example.) Since the forking paths repeatedly merge back at aperiodic stages, there is no multiplication or clear increase of the exploratory abilities of the sampler. Except for having competing proposed values [or even proposals] selected randomly. So maybe reaching a wee bit farther from time to time.

## adaptive rejection sampling with fixed number of nodes

Posted in Books, Statistics, University life with tags , , , , , on October 8, 2015 by xi'an

This paper was arXived today by Martino and Louzada. It starts from the adaptive rejection sampling algorithm of Gilks and Wild (1993), which provided an almost automatic random generator for univariate log-concave target probability densities. By creating an envelope of the true target based on convexity. This algorithm was at the core of BUGS in its early days. And makes a good example of accept reject algorithm that I often use in class. It is however not so popular nowadays because of the unidimensional constraint. And because it is not particularly adequate for simulating a single realisation from a given target. As is the case when used in a Gibbs sampler. The authors only consider here the issue of the growing cost of running the adaptive rejection sampling algorithm, which is due to the update of the envelope at each rejection. They however fail to quantify this increase, which involves (a) finding the interval where the current rejection lies, among n existing intervals, which is of order O(n), and (b) creating both modified envelopes on both new intervals, which is of order O(1). The proposal of the authors, called cheap adaptive rejection sampling, settles for a fixed number τ of support points (and intervals), solely swapping a rejected point with the closest support point when this improves the acceptance rate. The test for swapping involves first computing the alternative target (on a pair of intervals) and the corresponding normalising constant, while swapping the rejected point with the closest support point involves finding the closest point, which is of order O(log τ). I am rather surprised that the authors do not even mention the execution time orders, resorting instead to a simulation where the gain brought by their proposal is not overwhelming. There is also an additional cost for figuring out the fixed number τ of support points, another issue not mentioned in the paper.

## amazing Gibbs sampler

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , on February 19, 2015 by xi'an

When playing with Peter Rossi’s bayesm R package during a visit of Jean-Michel Marin to Paris, last week, we came up with the above Gibbs outcome. The setting is a Gaussian mixture model with three components in dimension 5 and the prior distributions are standard conjugate. In this case, with 500 observations and 5000 Gibbs iterations, the Markov chain (for one component of one mean of the mixture) has two highly distinct regimes: one that revolves around the true value of the parameter, 2.5, and one that explores a much broader area (which is associated with a much smaller value of the component weight). What we found amazing is the Gibbs ability to entertain both regimes, simultaneously.

## aperiodic Gibbs sampler

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , on February 11, 2015 by xi'an

A question on Cross Validated led me to realise I had never truly considered the issue of periodic Gibbs samplers! In MCMC, non-aperiodic chains are a minor nuisance in that the skeleton trick of randomly subsampling the Markov chain leads to a aperiodic Markov chain. (The picture relates to the skeleton!)  Intuitively, while the systematic Gibbs sampler has a tendency to non-reversibility, it seems difficult to imagine a sequence of full conditionals that would force the chain away from the current value..!In the discrete case, given that the current state of the Markov chain has positive probability for the target distribution, the conditional probabilities are all positive as well and hence the Markov chain can stay at its current value after one Gibbs cycle, with positive probabilities, which means strong aperiodicity. In the continuous case, a similar argument applies by considering a neighbourhood of the current value. (Incidentally, the same person asked a question about the absolute continuity of the Gibbs kernel. Being confused by our chapter on the topic!!!)

## Bangalore workshop [ಬೆಂಗಳೂರು ಕಾರ್ಯಾಗಾರ]

Posted in pictures, R, Running, Statistics, Travel, University life, Wines with tags , , , , , , on July 31, 2014 by xi'an

Second day at the Indo-French Centre for Applied Mathematics and the workshop. Maybe not the most exciting day in terms of talks (as I missed the first two plenary sessions by (a) oversleeping and (b) running across the campus!). However I had a neat talk with another conference participant that led to [what I think are] interesting questions… (And a very good meal in a local restaurant as the guest house had not booked me for dinner!)

To wit: given a target like

$\lambda \exp(-\lambda) \prod_{i=1}^n \dfrac{1-\exp(-\lambda y_i)}{\lambda}\quad (*)$

the simulation of λ can be demarginalised into the simulation of

$\pi (\lambda,\mathbf{z})\propto \lambda \exp(-\lambda) \prod_{i=1}^n \exp(-\lambda z_i) \mathbb{I}(z_i\le y_i)$

where z is a latent (and artificial) variable. This means a Gibbs sampler simulating λ given z and z given λ can produce an outcome from the target (*). Interestingly, another completion is to consider that the zi‘s are U(0,yi) and to see the quantity

$\pi(\lambda,\mathbf{z}) \propto \lambda \exp(-\lambda) \prod_{i=1}^n \exp(-\lambda z_i) \mathbb{I}(z_i\le y_i)$

as an unbiased estimator of the target. What’s quite intriguing is that the quantity remains the same but with different motivations: (a) demarginalisation versus unbiasedness and (b) zi ∼ Exp(λ) versus zi ∼ U(0,yi). The stationary is the same, as shown by the graph below, the core distributions are [formally] the same, … but the reasoning deeply differs.

Obviously, since unbiased estimators of the likelihood can be justified by auxiliary variable arguments, this is not in fine a big surprise. Still, I had not thought of the analogy between demarginalisation and unbiased likelihood estimation previously. Continue reading