finite mixture models [book review]

Here is a review of Finite Mixture Models (2000) by Geoff McLachlan & David Peel that I wrote aeons ago (circa 1999), supposedly for JASA, which lost first the files and second the will to publish it. As I was working with my student today, I mentioned the book to her and decided to publish it here, if only because I think the book deserved a positive review, even after all those years! (Since then, Sylvia Frühwirth-Schnatter published Finite Mixture and Markov Switching Models (2004), which is closer to my perspective on the topic and that I would more naturally recommend.)

Mixture modeling, that is, the use of weighted sums of standard distributions as in

\sum_{i=1}^k p_i f({\mathbf y};{\mathbf \theta}_i)\,,

is a widespread and increasingly used technique to overcome the rigidity of standard parametric distributions such as f(y;θ), while retaining a parametric nature, as exposed in the introduction of my JASA review to Böhning’s (1998) book on non-parametric mixture estimation (Robert, 2000). This review pointed out that, while there are many books available on the topic of mixture estimation, the unsurpassed reference remained the book by Titterington, Smith and Makov (1985)  [hereafter TSM]. I also suggested that a new edition of TSM would be quite timely, given the methodological and computational advances that took place in the past 15 years: while it remains unclear whether or not this new edition will ever take place, the book by McLachlan and Peel gives an enjoyable and fairly exhaustive update on the topic, incorporating the most recent advances on mixtures and some related models.

Geoff McLachlan has been a major actor in the field for at least 25 years, through papers, software—the book concludes with a review of existing software—and books: McLachlan (1992), McLachlan and Basford (1988), and McLachlan and Krishnan (1997). I refer the reader to Lindsay (1989) for a review of the second book, which is a forerunner of, and has much in common with, the present book.A general introduction (Chapter 1) on mixture models includes a detailed survey of the book, which may appeal to the hurried reader, and a brief history of mixture estimation, as well as notations. A more in-depth treatment of identifiability issues can be found in TSM, as well as a catalogue of the applications of mixture modelling in various domains, which is missing here—although there are about 40 different datasets analysed.

Chapter 2 covers maximum likelihood [ML] estimation and, of course, the EM algorithm which revolutionised ML estimation for latent variable models and, in particular, mixtures. This chapter does not contain proofs or even theorem-like statements about the convergence of the ML estimators (or, more precisely, of some solutions to the ML equations); on the other hand, it gives a detailed study of the implementation of EM, of the role and choice of the starting value, and of the stochastic versions of EM (although simulated annealing techniques such as those of Celeux and Diebolt (1992) and Lavielle and Moulines (1997) are missing).

A remark that applies to the whole book is that the choice of mentioning almost every reference published on the topic—there are 44 pages of references, 40%~of which are from after 1995!—somehow gets in the way of clarity and readability: this level of exhaustivity is fine for a reference book, but it makes the reading harder at the textbook level because sentences like “Celeux and Govaert (1995) have considered the equivalence of the classification ML approach to other clustering criteria under varying assumptions on the group densities” are far too elliptic to be useful.

Chapter 3 gets more detailed about the special case of normal mixtures, which are the number-one type of mixtures used in applications, with many examples, including some where spurious local maxima get in the way of EM. Chapter 12 presents some variations of the EM algorithm which may speed up the algorithm in large databases: IEM, lazzy EM, sparse EM, scalable EM, and multiresolution EM.

To my eyes, Chapter 4 is what gives a strong appeal to the book as a current reference on mixtures. Indeed, it covers Bayesian inference for mixtures and obviously concentrates on MCMC techniques. These developments on MCMC methodology did occur in the late 80s with Tanner and Wong (1987) and Gelfand and Smith (1990), that is, after TSM conception: the Bayesian computation techniques available in pre-Gibbs days were, compared with their successors, often crude and not always reliable. (It is surprising that, given the impetus brought by EM to ML mixture estimation and the fact that EM is one half of a Gibbs sampler, especially in its stochastic versions, the idea of completing the other half by simulation did not impose itself earlier. But such insights are always much easier a posteriori!) As in earlier chapters, the focus of Chapter 4 is on implementation: some background on MCMC methods is thus assumed, because the half-page of presentation in §4.4.1 is not enough. The authors only present Gibbs sampling algorithms for conjugate priors, although Metropolis–Hastings alternatives can also be used (see, e.g., Celeux, Robert and Hurn, 2000). They also mention results on perfect sampling, improper priors, label switching and the estimation of the number of components by reversible jump techniques, but the interested reader needs to invest in the references provided in the text, as this chapter is not self-contained enough to allow for implementation.

The following chapter covers non-normal mixtures, focusing on the important case of mixtures of generalized linear models [GLM], also called mixtures-of-experts and switching regressions in different literatures. The focus in on ML estimation and the steps of the EM algorithm are provided (see Hurn, Justel and Robert, 1999, and Viele and Tong, 2000, for Bayesian solutions). Similarly, Chapter 7 covers the case of multivariate t distributions as robust alternatives to normal mixtures, with EM and ECEM estimation of the t parameters (including the degrees of freedom). Chapters 8 to 11 deal with other specific cases such as factor analysers (Chapter 8), which generalize principal component analysis, and is estimated via the AECM alternative of Meng and van Dyk (1997); binned data (Chapter 9); failure time data (Chapter 10); and directional data (Chapter 11), based on the Kent distribution.

Chapter 6 specializes on the very current and still open problem of assessing the number k of components in a mixture. Due to the weak identifiability of mixtures and to the complex geometry of the parameter space when considering several values of k at once, standard testing tools such as the likelihood ratio test [LRT] do no work as usual. The book recalls the recent works on the distribution of the LRT under the null hypothesis, both theoretical and simulation-based. The authors detail the use of bootstrapped LRTs, with words of caution, and also present Bayesian criteria such as the BIC and Laplace-based methods. This chapter is, unsurprisingly, inconclusive, because of the weak identifiability mentioned above: for arbitrary large datasets, it is impossible to distinguish between

\sum_{i=1}^k p_i f({\mathbf y};{\mathbf \theta}_i) \quad \hbox{and} \quad \sum_{i=1}^{k+1} q_i f({\mathbf y};{\mathbf \xi}_i)\,,

because of the contiguity between both representations. Unless some separating constraint is imposed, either in the Ghosh and Sen (1985) format or through a penalisation factor, it seems to me the testing problem about the number of components is fundamentally meaningless. (The Bayesian solution of the estimation of k is much more satisfactory in that it incorporates the above penalisation in the prior distribution.)

Chapter 13 deals with one of the many possible extensions of a mixture model, namely the setup of hidden Markov models. Such models are of interest in many areas; besides, they were one of the first models to use the EM algorithm (Baum and Petrie, 1966). The setting is in addition very contemporary: applications in signal processing, finance or genetics abound, while theoretical developments on the limiting properties of the ML estimate have been found only recently (Bickel, Rydén and Ritov, 1988; Douc and Mathias, 2000; Douc, Moulines and Rydén, 2001). Given the scope of this field, which would call for a volume of its own (such as the future Andrieu and Doucet, 2001) the chapter only alludes to some possible areas, and misses others like stochastic volatility models (Kim, Shephard and Chib, 1998). But it is nonetheless a nice entry to this new domain.

To conclude, I hope it is clear I consider this book as a good monograph on the current trends in mixture estimation; from using it as support in graduate school, I can also add that it is only appropriate as a textbook for advanced audience since there are no exercises and readers are forced to get involved in the literature to get a clear picture of the finer details. Nonetheless, it is a welcome addition to the field that most people working on mixture analysis should consider buying.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: