Overfitting Bayesian mixture models with an unknown number of components
During my Czech vacations, Zoé van Havre, Nicole White, Judith Rousseau, and Kerrie Mengersen1 posted on arXiv a paper on overfitting mixture models to estimate the number of components. This is directly related with Judith and Kerrie’s 2011 paper and with Zoé’s PhD topic. The paper also returns to the vexing (?) issue of label switching! I very much like the paper and not only because the author are good friends!, but also because it brings a solution to an approach I briefly attempted with Marie-Anne Gruet in the early 1990’s, just before finding about the reversible jump MCMC algorithm of Peter Green at a workshop in Luminy and considering we were not going to “beat the competition”! Hence not publishing the output of our over-fitted Gibbs samplers that were nicely emptying extra components… It also brings a rebuke about a later assertion of mine’s at an ICMS workshop on mixtures, where I defended the notion that over-fitted mixtures could not be detected, a notion that was severely disputed by David McKay…
What is so fantastic in Rousseau and Mengersen (2011) is that a simple constraint on the Dirichlet prior on the mixture weights suffices to guarantee that asymptotically superfluous components will empty out and signal they are truly superfluous! The authors here cumulate the over-fitted mixture with a tempering strategy, which seems somewhat redundant, the number of extra components being a sort of temperature, but eliminates the need for fragile RJMCMC steps. Label switching is obviously even more of an issue with a larger number of components and identifying empty components seems to require a lack of label switching for some components to remain empty!
When reading through the paper, I came upon the condition that only the priors of the weights are allowed to vary between temperatures. Distinguishing the weights from the other parameters does make perfect sense, as some representations of a mixture work without those weights. Still I feel a bit uncertain about the fixed prior constraint, even though I can see the rationale in not allowing for complete freedom in picking those priors. More fundamentally, I am less and less happy with independent identical or exchangeable priors on the components.
Our own recent experience with almost zero weights mixtures (and with Judith, Kaniav, and Kerrie) suggests not using solely a Gibbs sampler there as it shows poor mixing. And even poorer label switching. The current paper does not seem to meet the same difficulties, maybe thanks to (prior) tempering.
The paper proposes a strategy called Zswitch to resolve label switching, which amounts to identify a MAP for each possible number of components and a subsequent relabelling. Even though I do not entirely understand the way the permutation is constructed. I wonder in particular at the cost of the relabelling.