## Finite mixture models do not reliably learn the number of components

**W**hen preparing my talk for Padova, I found that Diana Cai, Trevor Campbell, and Tamara Broderick wrote this ICML / PLMR paper last year on the impossible estimation of the number of components in a mixture.

“A natural check on a Bayesian mixture analysis is to establish that the Bayesian posterior on the number of components increasingly concentrates near the truth as the number of data points becomes arbitrarily large.” Cai, Campbell & Broderick (2021)

Which seems to contradict [my formerly-Glaswegian friend] Agostino Nobile who showed in his thesis that the posterior on the number of components does concentrate at the true number of components, provided the prior contains that number in its support. As well as numerous papers on the consistency of the Bayes factor, including the one against an infinite mixture alternative, as we discussed in our recent paper with Adrien and Judith. And reminded me of the rebuke I got in 2001 from the late David McKay when mentioning that I did not believe in estimating the number of components, both because of the impact of the prior modelling and of the tendency of the data to push for more clusters as the sample size increased. (This was a most lively workshop Mike Titterington and I organised at ICMS in Edinburgh, where Radford Neal also delivered an impromptu talk to argue against using the Galaxy dataset as a benchmark!)

“In principle, the Bayes factor for the MFM versus the DPM could be used as an empirical criterion for choosing between the two models, and in fact, it is quite easy to compute an approximation to the Bayes factor using importance sampling” Miller & Harrison (2018)

This is however a point made in Miller & Harrison (2018) that the estimation of k logically goes south if the data is not from the assumed mixture model. In this paper, Cai et al. demonstrate that the posterior diverges, even when it depends on the sample size. Or even the sample as in empirical Bayes solutions.

October 15, 2022 at 6:19 pm

There is no true number of mixture components. So the problem is not that estimating the true number of mixture components is difficult; the problem is that it is impossible because it is not a well defined problem. Unless you have some sort of strong restriction on the sizes of the components or how close they can be.

October 16, 2022 at 4:23 pm

Sounds like you belong to the tequila creed when it comes to mixtures! There still are consistency theorems that tell you the BF will eventually single out the true number when it exists. (Obviously, you can argue that real data almost never corresponds to a mixture model with a well defined number of components, I would not argue this point.)

October 15, 2022 at 5:23 pm

In low dimensional mixtures, like Gaussian location mixtures on the real line, the classical NPMLE of Kiefer and Wolfowitz is quite good at estimating the number of components when this number is finite. https://arxiv.org/abs/2008.08244.

Of course this is perhaps a more structured setting than you have in mind.

October 16, 2022 at 4:26 pm

Thanks, this is true and probably connected with the consistency of the Bayes factor in this setting.

October 15, 2022 at 5:58 am

Hi Xi’an, clustering & mixture modeling is a topic that I’m very interested in and I have worked on for quite a bit. So here are my 2 cents.

From an application perspective, recovering the “true” number of clusters (non-empty components given data) is often too difficult, because nobody knows what exactly the true data generative process is — for example, correct specification of mixture component is nearly impossible when we deal with high-dimensional data such as images.

Nevertheless, one major purpose of clustering is to tackle heterogeneity of data, via making the partitioned data more interpretable & potentially reducing the confounding issues caused by heterogeneity. Therefore, for that goal, a probability partition function that favors clustering parsimony is arguably more useful, compared to, say, the one of DP(alpha=1) on which the number of clusters grows very rapidly.

From a theory perspective, this is an interesting math problem that has drawn a lot of attention in recent years. I guess one of reasons is that, this is a pathological problem that a lot of us believe to be fixable (in theory) — and we love solving challenging problems! For example, it has been a conjecture that DP(alpha) with alpha decreasing as some function of sample size may actually lead to consistency under a correct component specification (Ohn and Lin 2022 has a related result); more recently, cluster number consistency for infinite mixture model is shown to be achievable via some hyper-prior on alpha (Ascolani, Lijoi, Rebaudo and Zanella 2022), or via a soft truncation during stick breaking (shameless plug of our work: Zeng, Miller and Duan 2022). In the context of component model misspecification, of course consistency would be out of question; however, all of the above solutions provide some mechanisms to calibrate the effects of probability partition function being too generous in adding a new cluster.

Therefore, based on the above & how much energy our community is spending on this, I’m optimistic that there will be some principled solution that gives a “better” answer on a robust estimation of cluster/component number.

October 16, 2022 at 4:34 pm

Thank you for the point and the references, the debate of components versus clusters has been raging for a while now. I am less optimistic than you are on achieving consistency for clustering unless strong assumptions are made (?). Especially with infinite / NP mixtures.