label switching in Bayesian mixture models

cover of Mixture Estimation and ApplicationsA referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

\sum_{j=1}^k \omega_j f(y|\theta_i)

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1” would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1”.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

p(z_i=j|{\mathbf y})

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})

does not hold when the two permutations σ and τ give different images of zi

15 Responses to “label switching in Bayesian mixture models”

  1. Dear Prof xi’an,
    I’m a newbie in Gaussian mixture model.
    Could you please advice me any fundamental notes / preferably sample notes in R to implement the latent class in Gaussian mixture.

    This is my snippet code to do the finite gaussian mixture:
    myData.obs= faithful[,1:2]; # the data matrix
    prob1 = pi1*dmvnorm(myData.obs,m1,Sigma1); # probabilities of sample points under model 1
    prob2 = pi2*dmvnorm(myData.obs,m2,Sigma2); # same for model 2
    Z<-rbinom(no,1,prob1/(prob1 + prob2 ))

    pi11/2) {

    I want to generate the Z values which is able to classify the data points into the particular Gaussian component such as 0 for the first Gaussian and 1 for the second Gaussian component with the given mean and sigma.
    However, if i run the code with 30 number of iterations, only the first and second iteration giving some changes on the data classification but does not properly classify the data. The next iteration is remain unchanged which is means that the data point are not properly assigned to the particular Gaussian component.
    I would feel grateful if you could give me any advice or comment to improve the solution.

    Thank you in advance.

  2. Better still, I would say that the unknown quantity (‘parameter’) in a finite mixture model is an atomic measure with finitely many atoms, and total mass 1. There’s no need (or even basis) to label anything.

    • But then how do you reconcile this utterly nihilist message with consistency results like those of Rousseau & Mengersen (2011)?

      • I don’t see the clash here. The labels are meaningless, the things they label aren’t. (And if I read Judith and Kerrie’s paper right, it says that the weight of un-necessary components -> 0 asymptotically)

  3. Peter Green Says:

    For what it’s worth, I agree with all that Radford has written above. On the computational point, I suggest that adding a ‘permute labels’ move to a MCMC sampler to achieve perfect balancing among the k! equal modes is a positively BAD thing to do (although advocated by some distinguished people) – it provides the illusion that your whole sampler is mixing fine, whereas a moment’s thought shows that a irredeemably bad sampler will be bad in many ways that have nothing to do with labels, so can’t be fixed by permuting labels.

    I agree with Christian that (un)identifiability is a statistical not computational problem – but I disagree with the implication that labelling in mixtures is to do with (un)identifiability. The problem is caused by misunderstanding the nature of the parameters. The set of (say) k means in a finite mixture representation is not a vector, it is a point process.

  4. Hi all and thanks for the interesting points.

    On Dan’s comment that he doesn’t “agree that MCMC convergence should be separated from the question of what is inferentially useful”, may I point you to a recent joint work in that direction with O. Cappé, G. Fort and B. Kégl
    One of the contributions is to solve relabeling online, in a way that makes sense both from an inferential point of view and from a sampling point of view. The bottom line is that if you allow a sweep over all permutations at each MCMC iteration, you can dynamically find the cost function that makes your relabeled posterior as unimodal as possible and favours good mixing of adaptive Metropolis-Hastings at the same time.

    On other comments, I agree that identifiability should be bypassed whenever possible. However, some applications require to get dirty and relabel. For example, we had this nice signal processing problem in particle physics, for the Auger experiment in particular:

  5. Exposing my ignorance, this is one of those things that has always confused me. My understanding of identifiably and MCMC convergence was always that if the only meaningful thing is some (nonlinear) combination of the components, then that’s all that needs to converge. Which, in the case of fully symmetric posteriors, surely removes the problems with label switching.

    Is this just a desire for exogenous meaning for the k labels that leads to non-convergence, whereas a “BNP” approach of considering the sum as an approximation to the required distribution would lead to everything being ok?

    Or am I naive?

    • To me, identifiability or lack thereof is a statistical concept, while convergence of an MCMC algorithm is a probabilistic one, hence the latter is separate from the former. While components have not a well-defined statistical meaning, the target density is a well-defined function in the parameter space and an MCMC algorithm targeting this function should reproduce all of its characteristics, including multimodality and symmetry. This was the essence of our 2000 JASA paper. So I would not link the convergence of MCMC with statistical properties, even when this is all that matters in the end. And this is John Geweke’s argument about label switching.

      • So I agree that identifiability vs convergence aren’t interchangeable concepts, the idea was that I would describe an MCMC chain that has converged on the identifiable components “converged”.

        I understand that you can ask for more (convergence of the posterior) and you can get it, but I’m uncertain about the gain from doing this.

        I scanned Geweke’s article and I think he said just that in the “permatuation invariant” cases. But the existence of the more sensitive cases probably means that I just don’t know what I’m talking about.

        But I don’t agree that MCMC convergence should be separated from the question of what is inferentially useful. MCMC is a means to an end, but not an end unto itself. If you desperately need to recover the posterior density in all its multimodal glory, then that’s a judgement about what is inferentially useful in the problem in question. But I would imagine that in lots of contexts you could get decent savings by being more targeted in your desires… Just perhaps not in mixture models :p

      • Well, this is a case for doubting MCMC convergence and again the starting point of our 2000 paper (sorry for bringing it out again and again!). If you do not see a feature of your target, namely this massive symmetry, on your MCMC output, what level of trust do you put on it (i.e., on it converging to the target)?! In other words if I restrict the MCMC setting to the vicinity of a single mode, what are my tools to be happy with convergence issues?

    • Radford Neal Says:

      I too find the attention to this topic puzzling. Yes, “component #1 of the mixure” is not a meaningful thing to estimate. So don’t! All the meaningful things are obtained without any problem from ordinary MCMC. You can add a special transition to your MCMC if you like, that just randomly permutes the labels, but once you understand the issue you realize that although this gets rid of the convergence “problem”, it’s also pointless.

      (Note that an exception is computation of marginal likelihoods, where you do have to pay attention to the k! modes, but you almost certainly shouldn’t be calculating marginal likelihoods of mixture models anyway…)

      • Thanks, Radford. Your point brings Dan and I together in that having a perfect sampler that reproduces every symmetry is not much of an help for conducting inference, except telling you your sampler is rather good and hence you can trust (more) that it has “converged” (with the usual provisions about this sentence).

        Now, may I ask you what is so irremediably damning about computing marginal likelihoods?!

      • Radford Neal Says:

        If your MCMC method moved among the k! symmetrical modes without any special transitions designed to ensure this, then you might indeed have increased confidence that it had also found all the non-symmetrical modes that exist. But failure to move among the k! symmetrical modes isn’t really much of a sign of any other problem (one that actually matters), since the k! symmetrical modes are often very far apart, so it’s no surprise that a generic MCMC method can’t move between them, even if it does move among non-symmetrical modes that aren’t so far apart.

        Marginal likelihoods for mixture models are extremely sensitive to the prior used for the parameters of each mixture component. As this prior becomes more diffuse, the model with only one component becomes more and more favoured. So computing marginal likelihoods (eg, for models with different numbers of components) makes sense only if you have a very carefully thought out prior for the parameters of the mixture components, based on real prior information, regarding components that correspond to real physical entities.

        In most cases, you should just be using a Dirichlet process mixture model (with an infinite number of components), and sampling from the posterior distribution of continuous hyperparameters such as the Dirichlet concentration parameter. The number of components used in a finite sample is still sensitive to the prior on component parameters (which ought to be controlled by hyperparameters), but fortunately you’ll be even less tempted to think individual components have some simple interpretation.

      • I find the suggestion in the last paragraph intriguing. I’d also add that the inconsistency of the posterior of a DPM for the number of components provides an extra psychological restraint against the simple interpretation of individual components

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: