This (early) summer, a conference on missing data will be organised in Rennes, Brittany, with the support of the French Statistical Society [SFDS]. (Check the website if interested, Rennes is a mere two hours from Paris by fast train.)
A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.
Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms
the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.
This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1” would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1”.
And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.
“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”
One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity
which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)
does not hold when the two permutations σ and τ give different images of zi…
Jan Hanning kindly sent me this email about several difficulties with Chapters 3, Monte Carlo Integration, and 5, Monte Carlo Optimization, when teaching out of our book Monte Carlo Statistical Methods [my replies in italics between square brackets, apologies for the late reply and posting, as well as for the confusion thus created. Of course, the additional typos will soon be included in the typo lists on my book webpage.]:
Thanks a ton to Jan and to his UNC students (and apologies for leading them astray with those typos!!!)
Edward Kao is engaged in a detailed parallel reading of Monte Carlo Statistical Methods and of Introducing Monte Carlo Methods with R. He has pointed out several typos in Example 5.18 of Monte Carlo Statistical Methods which studies a missing data phone plan model and its EM resolution. First, the customers in area i should be double-indexed, i.e.
which implies in turn that
Then the summary T should be defined as
given that the first m customers have the fifth plan missing.
The first day at JSM is always a bit sluggish, as people slowly drip in and get their bearings. Similar to last year in Washington D.C., the meeting takes place in a huge conference centre and thus there is no feeling of overcrowded [so far]. It may also be that the peripheric and foreign location of the meeting put some regular attendees off (not to mention the expensive living costs!).
Nonetheless, the Sunday afternoon sessions started with a highly interesting How Fast Can We Compute? How Fast Will We Compute? session organised by Mike West and featuring Steve Scott, Mark Suchard and Qanli Wang. The topic was on parallel processing, either via multiple processors or via GPUS, the later relating to the exciting talk Chris Holmes gave at the Valencia meeting. Steve showed us some code in order to explain how feasible the jump to parallel programming—a point demonstrated by Julien Cornebise and Pierre Jacob after they returned from Valencia—was, while stressing the fact that a lot of the processing in MCMC runs was opened to parallelisation. For instance, data augmentation schemes can allocate the missing data in a parallel way in most problems and the same for independent data likelihood computation. Marc Suchard focussed on GPUs and phylogenetic trees, both of high interest to me!, and he stressed the huge gains—of the order of hundreds in the decrease in computing time—made possible by the exploitation of laptop [Macbook] GPUs. (If I got his example correctly, he seemed to be doing an exact computation of the phylogeny likelihood, not an ABC approximation… Which is quite interesting, if potentially killing one of my main areas of research!) Qanli Wang linked both previous with the example of mixtures with a huge number of components. Plenty of food for thought.
I completed the afternoon session with the Student Paper Competition: Bayesian Nonparametric and Semiparametric Methods which was discouragingly empty of participants, with two of the five speakers missing and less than twenty people in the room. (I did not get the point about the competition as to who was ranking those papers. Not the participants apparently!)