Archive for Gibbs sampler

slice sampling for Dirichlet mixture process

Posted in Books, Statistics, University life with tags , , , , , , , on June 21, 2017 by xi'an

When working with my PhD student Changye in Dauphine this morning I realised that slice sampling also applies to discrete support distributions and could even be of use in such settings. That it works is (now) straightforward in that the missing variable representation behind the slice sampler also applies to densities defined with respect to a discrete measure. That this is useful transpires from the short paper of Stephen Walker (2007) where we saw this, as Stephen relies on the slice sampler to sample from the Dirichlet mixture model by eliminating the tail problem associated with this distribution. (This paper appeared in Communications in Statistics and it is through Pati & Dunson (2014) taking advantage of this trick that Changye found about its very existence. I may have known about it in an earlier life, but I had clearly forgotten everything!)

While the prior distribution (of the weights) of the Dirichlet mixture process is easy to generate via the stick breaking representation, the posterior distribution is trickier as the weights are multiplied by the values of the sampling distribution (likelihood) at the corresponding parameter values and they cannot be normalised. Introducing a uniform to replace all weights in the mixture with an indicator that the uniform is less than those weights corresponds to a (latent variable) completion [or a demarginalisation as we called this trick in Monte Carlo Statistical Methods]. As elaborated in the paper, the Gibbs steps corresponding to this completion are easy to implement, involving only a finite number of components. Meaning the allocation to a component of the mixture can be operated rather efficiently. Or not when considering that the weights in the Dirichlet mixture are not monotone, hence that a large number of them may need to be computed before picking the next index in the mixture when the uniform draw happens to be quite small.

simulation under zero measure constraints

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on November 17, 2016 by xi'an

A theme that comes up fairly regularly on X validated is the production of a sample with given moments, either for calibration motives or from a misunderstanding of the difference between a distribution mean and a sample average. Here are some entries on that topic:

In most of those questions, the constraint in on the sum or mean of the sample, which allows for an easy resolution by a change of variables. It however gets somewhat harder when the constraint involves more moments or, worse, an implicit solution to an equation. A good example of the later is the quest for a sample with a given maximum likelihood estimate in the case this MLE cannot be derived analytically. As for instance with a location-scale t sample…

Actually, even when the constraint is solely on the sum, a relevant question is the production of an efficient simulation mechanism. Using a Gibbs sampler that changes one component of the sample at each iteration does not qualify, even though it eventually produces the proper sample. Except for small samples. As in this example

s0=.5 #fixed average
for (t in 2:T){
 for (i in 1:(n-1)){

For very large samples, I figure that proposing from the unconstrained density can achieve a sufficient efficiency, but the in-between setting remains an interesting problem.

independent Metropolis-Hastings

Posted in Books, Statistics with tags , , , , , , on November 24, 2015 by xi'an

“In this paper we have demonstrated the potential benefits, both theoretical and practical, of the independence sampler over the random walk Metropolis algorithm.”

Peter Neal and Tsun Man Clement Lee arXived a paper on optimising the independent Metropolis-Hastings algorithm. I was a bit surprised at this “return” of the independent sampler, which I hardly mention in my lectures, so I had a look at the paper. The goal is to produce an equivalent to what Gelman, Gilks and Wild (1996) obtained for random walk samplers.  In the formal setting when the target is a product of n identical densities f, the optimal number k of components to update in one Metropolis-Hastings (within Gibbs) round is approximately 2.835/I, where I is the symmetrised Kullback-Leibler distance between the (univariate) target f and the independent proposal q. When I is finite. The most surprising part is that the optimal acceptance rate is again 0.234, as in the random walk case. This is surprising in that I usually associate the independent Metropolis-Hastings algorithm with high acceptance rates. But this is of course when calibrating the proposal q, not the block size k of the Gibbs part. Hence, while this calibration of the independent Metropolis-within-Gibbs sampler is worth the study and almost automatically applicable, it remains that it only applies to a certain category of problems where blocking can take place. As in the disease models illustrating the paper. And requires an adequate choice of proposal distribution for, otherwise, the above quote becomes inappropriate.

bouncy particle sampler

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on October 30, 2015 by xi'an

 Alexandre Bouchard-Coté, Sebastian Vollmer and Arnaud Doucet just arXived a paper with the above title, which reminded me of a proposal Kerrie Mengersen and I made at Valencia 7, in Tenerife, the [short-lived!] pinball sampler. This sampler was a particle (MCMC) sampler where we used the location of the other particles to avoid their neighbourhood, by bouncing away from them according to a delayed rejection principle, with an overall Gibbs justification since the resulting target was the product of copies of the target distribution. The difficulty in implementing the (neat!) idea was in figuring out the amount of bouncing or, in more physical terms, the energy allocated to the move.

In the current paper, inspired from an earlier paper in physics, the Markov chain (or single particle) evolves by linear moves, changing directions according to a Poisson process, with intensity and direction depending on the target distribution. A local version takes advantage of a decomposition of the target into a product of terms involving only some components of the whole parameter to be simulated. And hence allowing for moves in subspaces. An extension proposed by the authors is to bounce along the Hamiltonian isoclines. The method is demonstrably ergodic and irreducible. In practice, I wonder at the level of calibration or preliminary testing required to facilitate the exploration of the parameter space, particularly in the local version that seems to multiply items to be calibrated.

asynchronous distributed Gibbs sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , on October 13, 2015 by xi'an

Alexander Terenin, Dan Simpson, and David Draper just arXived a paper on an alternative brand of Gibbs sampler, which they think can revolutionise the sampler and overcome its well-known bottlenecks. David had sent me the paper in advance and thus I had time to read it in the plane to Calgary. (This is also the very first paper I see acknowledging a pair of trousers..! With no connection whatsoever with bottlenecks!)

“Note that not all updates that are sent will be received by all other workers: because of network traffic congestion and other types of failures, a significant portion of the updates will be lost along the way.”

The approach is inherently parallel in that several “workers” (processors or graphical units) run Gibbs samplers in parallel, using their current knowledge of the system. This means they update a component of the model parameter, based on the information they have last received, and then send back this new value to the system. For physical reasons, there is not instantaneity in this transmission and thus all workers do not condition on the same “current” value, necessarily. Perceiving this algorithm as a “garden of forking paths” where each full conditional uses values picked at random from a collection of subchains (one for each worker), I can see why the algorithm should remain valid.

“Thus, the quality of this [ABC] method rises and falls with the ingenuity of the user in identifying nearly-sufficient statistics.”

It is also clear that this approach allows for any degree of parallelisation. However, it is less clear to me why this should constitute an improvement. With respect to the bottlenecks mentioned at the beginning of the paper, I do not truly see how the large data problem is bypassed. Except in cases where conditionals only depend on small parts of the data. Or why large dimensions can be more easily managed when compared with a plain Gibbs sampler or, better, parallel plain Gibbs samplers that would run on the same number of processors. (I do not think the paper runs the comparison in that manner, using instead a one-processor Gibbs sampler as its benchmark. Or less processors in the third example.) Since the forking paths repeatedly merge back at aperiodic stages, there is no multiplication or clear increase of the exploratory abilities of the sampler. Except for having competing proposed values [or even proposals] selected randomly. So maybe reaching a wee bit farther from time to time.

adaptive rejection sampling with fixed number of nodes

Posted in Books, Statistics, University life with tags , , , , , on October 8, 2015 by xi'an

This paper was arXived today by Martino and Louzada. It starts from the adaptive rejection sampling algorithm of Gilks and Wild (1993), which provided an almost automatic random generator for univariate log-concave target probability densities. By creating an envelope of the true target based on convexity. This algorithm was at the core of BUGS in its early days. And makes a good example of accept reject algorithm that I often use in class. It is however not so popular nowadays because of the unidimensional constraint. And because it is not particularly adequate for simulating a single realisation from a given target. As is the case when used in a Gibbs sampler. The authors only consider here the issue of the growing cost of running the adaptive rejection sampling algorithm, which is due to the update of the envelope at each rejection. They however fail to quantify this increase, which involves (a) finding the interval where the current rejection lies, among n existing intervals, which is of order O(n), and (b) creating both modified envelopes on both new intervals, which is of order O(1). The proposal of the authors, called cheap adaptive rejection sampling, settles for a fixed number τ of support points (and intervals), solely swapping a rejected point with the closest support point when this improves the acceptance rate. The test for swapping involves first computing the alternative target (on a pair of intervals) and the corresponding normalising constant, while swapping the rejected point with the closest support point involves finding the closest point, which is of order O(log τ). I am rather surprised that the authors do not even mention the execution time orders, resorting instead to a simulation where the gain brought by their proposal is not overwhelming. There is also an additional cost for figuring out the fixed number τ of support points, another issue not mentioned in the paper.

amazing Gibbs sampler

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , on February 19, 2015 by xi'an

BayesmWhen playing with Peter Rossi’s bayesm R package during a visit of Jean-Michel Marin to Paris, last week, we came up with the above Gibbs outcome. The setting is a Gaussian mixture model with three components in dimension 5 and the prior distributions are standard conjugate. In this case, with 500 observations and 5000 Gibbs iterations, the Markov chain (for one component of one mean of the mixture) has two highly distinct regimes: one that revolves around the true value of the parameter, 2.5, and one that explores a much broader area (which is associated with a much smaller value of the component weight). What we found amazing is the Gibbs ability to entertain both regimes, simultaneously.