A question on Cross Validated led me to realise I had never truly considered the issue of periodic Gibbs samplers! In MCMC, non-aperiodic chains are a minor nuisance in that the skeleton trick of randomly subsampling the Markov chain leads to a aperiodic Markov chain. (The picture relates to the skeleton!) Intuitively, while the systematic Gibbs sampler has a tendency to non-reversibility, it seems difficult to imagine a sequence of full conditionals that would force the chain away from the current value..!In the discrete case, given that the current state of the Markov chain has positive probability for the target distribution, the conditional probabilities are all positive as well and hence the Markov chain can stay at its current value after one Gibbs cycle, with positive probabilities, which means strong aperiodicity. In the continuous case, a similar argument applies by considering a neighbourhood of the current value. (Incidentally, the same person asked a question about the absolute continuity of the Gibbs kernel. Being confused by our chapter on the topic!!!)
Archive for MCMC algorithms
With my friends Peter Green (Bristol), Krzysztof Łatuszyński (Warwick) and Marcello Pereyra (Bristol), we just arXived the first version of “Bayesian computation: a perspective on the current state, and sampling backwards and forwards”, which first title was the title of this post. This is a survey of our own perspective on Bayesian computation, from what occurred in the last 25 years [a lot!] to what could occur in the near future [a lot as well!]. Submitted to Statistics and Computing towards the special 25th anniversary issue, as announced in an earlier post.. Pulling strength and breadth from each other’s opinion, we have certainly attained more than the sum of our initial respective contributions, but we are welcoming comments about bits and pieces of importance that we miss and even more about promising new directions that are not posted in this survey. (A warning that is should go with most of my surveys is that my input in this paper will not differ by a large margin from ideas expressed here or in previous surveys.)
Following the previous post, I went and had a (long) look at Puolamäki and Kaski’s paper. I must acknowledge that, despite having several runs through the paper, I still have trouble with the approach… From what I understand, the authors use a Bernoulli mixture pseudo-model to reallocate the observations to components. That is, given an MCMC output with simulated allocations variables (a.k.a., hidden or latent variables), they create a (TxK)xn matrix of component binary indicators e.g., for a three component mixture,
0 1 0 0 1 0…
1 0 0 0 0 0…
0 0 1 1 0 1…
0 1 0 0 1 1…
and estimate a probability to be in component j for each of the n observations, according to the (pseudo-)likelihood
It took me a few days, between morning runs and those wee hours when I cannot get back to sleep (!), to make some sense of this Bernoulli modelling. The allocation vectors are used together to estimate the probabilities of being “in” component j together. However the data—which is the outcome of an MCMC simulation and de facto does not originate from that Bernoulli mixture—does not seem appropriate, both because it is produced by an MCMC simulation and is made of blocks of highly correlated rows [which sum up to one]. The Bernoulli likelihood above also defines a new model, with many more parameters than in the original mixture model. And I fail to see why perfect, partial or inexistent label switching [in the MCMC sequence] is not going to impact the estimation of the Bernoulli mixture. And why an argument based on a fixed parameter value (Theorem 3) extends to an MCMC outcome where parameters themselves are subjected to some degree of label switching. Bemused, I remain…
Today, I took a look at a recently arXived paper posted in physics, lifting – A non reversible MCMC algorithm by Marija Vucleja, but I simply could not understand the concept of lifting. Presumably because of the physics perspective. And also because the paper is mostly a review, referring to the author’s earlier work. The notion of lifting is to create a duplicate of a regular Markov chain with given stationary distribution towards cancelling reversibility and hence speeding up the exploration of the state space. The central innovation in the paper seems to be in imposing a lifted reversibility, which means using the reverse dynamics on the lifted version of the chain, that is, the dual proposal
However, the paper does not explicit how the resulting Markov transition matrix on the augmented space is derived from the original matrix. I now realise my description is most likely giving the impression of two coupled Markov chains, which is not the case: the new setting is made of a duplicated sample space, in the sense of Nummelin split chain (but without the specific meaning for the binary variable found in Nummelin!). In the case of the 1-d Ising model, the implementation of the method means for instance picking a site at random, proposing to change its spin value by a Metropolis acceptance step and then, if the proposal is rejected, possibly switching to the corresponding value in the dual part of the state. Given the elementary proposal in the first place, I fail to see where the improvement can occur… I’d be most interested in seeing a version of this lifting in a realistic statistical setting.
A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.
Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms
the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.
This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1″ would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1″.
And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.
“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”
One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity
which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)
does not hold when the two permutations σ and τ give different images of zi…
An email from one of my Master students who sent his problem sheet (taken from Monte Carlo Statistical Methods) late:
Je « suis » votre cours du mercredi dont le formalisme mathématique me fait froid partout
Avec beaucoup de difficulté je vous envoie mes exercices du premier chapitre de votre livre.
which translates as
Good evening Professor,
I “follow” your Wednesday class which mathematical formalism makes me cold all over. With much hardship, I send you the first batch of problems from your book.
I know that winter is coming, but, still, making students shudder from mathematical cold is not my primary goal when teaching Monte Carlo methods!
…already Thursday, our [early] departure day!, with an nth (!) non-parametric session that saw [the newly elected ISBA Fellow!] Judith Rousseau present an ongoing work with Chris Holmes on the convergence or non-convergence conditions for a Bayes factor of a non-parametric hypothesis against another non-parametric. I wondered at the applicability of this test as the selection criterion in ABC settings, even though having an iid sample to start with is a rather strong requirement.
Switching between a scalable computation session with Alex Beskos, who talked about adaptive Langevin algorithms for differential equations, and a non-local prior session, with David Rossell presenting a smoother way to handle point masses in order to accommodate frequentist coverage. Something we definitely need to discuss the next time I am in Warwick! Although this made me alas miss both the first talk of the non-local session by Shane Jensen the final talk of the scalable session by Doug Vandewrken where I happened to be quoted (!) for my warning about discretising Markov chains into non-Markov processes. In the 1998 JASA paper with Chantal Guihenneuc.
After a farewell meal of ceviche with friends in the sweltering humidity of a local restaurant, I attended [the newly elected ISBA Fellow!] Maria Vanucci’s talk on her deeply involved modelling of fMRI. The last talk before the airport shuttle was François Caron’s description of a joint work with Emily Fox on a sparser modelling of networks, along with an auxiliary variable approach that allowed for parallelisation of a Gibbs sampler. François mentioned an earlier alternative found in machine learning where all components of a vector are updated simultaneously conditional on the previous avatar of the other components, e.g. simulating (x’,y’) from π(x’|y) π(y’|x) which does not produce a convergent Markov chain. At least not convergent to the right stationary. However, running a quick [in-flight] check on a 2-d normal target did not show any divergent feature, when compared with the regular Gibbs sampler. I thus wonder at what can be said about the resulting target or which conditions are need for divergence. A few scribbles later, I realised that the 2-d case was the exception, namely that the stationary distribution of the chain is the product of the marginal. However, running a 3-d example with an auto-exponential distribution in the taxi back home, I still could not spot a difference in the outcome.