Archive for multimodality

label switching in Bayesian mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on October 31, 2014 by xi'an

cover of Mixture Estimation and ApplicationsA referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

\sum_{j=1}^k \omega_j f(y|\theta_i)

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1” would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1”.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

p(z_i=j|{\mathbf y})

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})

does not hold when the two permutations σ and τ give different images of zi

high-dimensional stochastic simulation and optimisation in image processing [day #3]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on August 31, 2014 by xi'an

Last and maybe most exciting day of the “High-dimensional Stochastic Simulation and Optimisation in Image Processing” in Bristol as it was exclusively about simulation (MCMC) methods. Except my own talk on ABC. And Peter Green’s on consistency of Bayesian inference in non-regular models. The talks today were indeed about using convex optimisation devices to speed up MCMC algorithms with tools that were entirely new to me, like the Moreau transform discussed by Marcelo Pereyra. Or using auxiliary variables à la RJMCMC to bypass expensive Choleski decompositions. Or optimisation steps from one dual space to the original space for the same reason. Or using pseudo-gradients on partly differentiable functions in the talk by Sylvain Lecorff on a paper commented earlier in the ‘Og. I particularly liked the notion of Moreau regularisation that leads to more efficient Langevin algorithms when the target is not regular enough. Actually, the discretised diffusion itself may be geometrically ergodic without the corrective step of the Metropolis-Hastings acceptance. This obviously begs the question of an extension to Hamiltonian Monte Carlo. And to multimodal targets, possibly requiring as many normalisation factors as there are modes. So, in fine, a highly informative workshop, with the perfect size and the perfect crowd (which happened to be predominantly French, albeit from a community I did not have the opportunity to practice previously). Massive kudos to Marcello for putting this workshop together, esp. on a week where family major happy events should have kept him at home!

As the workshop ended up in mid-afternoon, I had plenty of time for a long run with Florence Forbes down to the Avon river and back up among the deers of Ashton Court, avoiding most of the rain, all of the mountain bikes on a bike trail that sounded like trail running practice, and building enough of an appetite for the South Indian cooking of the nearby Thali Café. Brilliant!