Archive for mixture estimation

finite mixture models [book review]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , on February 17, 2014 by xi'an

Here is a review of Finite Mixture Models (2000) by Geoff McLachlan & David Peel that I wrote aeons ago (circa 1999), supposedly for JASA, which lost first the files and second the will to publish it. As I was working with my student today, I mentioned the book to her and decided to publish it here, if only because I think the book deserved a positive review, even after all those years! (Since then, Sylvia Frühwirth-Schnatter published Finite Mixture and Markov Switching Models (2004), which is closer to my perspective on the topic and that I would more naturally recommend.)

Mixture modeling, that is, the use of weighted sums of standard distributions as in

\sum_{i=1}^k p_i f({\mathbf y};{\mathbf \theta}_i)\,,

is a widespread and increasingly used technique to overcome the rigidity of standard parametric distributions such as f(y;θ), while retaining a parametric nature, as exposed in the introduction of my JASA review to Böhning’s (1998) book on non-parametric mixture estimation (Robert, 2000). This review pointed out that, while there are many books available on the topic of mixture estimation, the unsurpassed reference remained the book by Titterington, Smith and Makov (1985)  [hereafter TSM]. I also suggested that a new edition of TSM would be quite timely, given the methodological and computational advances that took place in the past 15 years: while it remains unclear whether or not this new edition will ever take place, the book by McLachlan and Peel gives an enjoyable and fairly exhaustive update on the topic, incorporating the most recent advances on mixtures and some related models.

Geoff McLachlan has been a major actor in the field for at least 25 years, through papers, software—the book concludes with a review of existing software—and books: McLachlan (1992), McLachlan and Basford (1988), and McLachlan and Krishnan (1997). I refer the reader to Lindsay (1989) for a review of the second book, which is a forerunner of, and has much in common with, the present book. Continue reading

Importance sampling schemes for evidence approximation in mixture models

Posted in R, Statistics, University life with tags , , , , , , , , , on November 27, 2013 by xi'an

boximpJeong Eun (Kate) Lee and I completed this paper, “Importance sampling schemes for evidence approximation in mixture models“, now posted on arXiv. (With the customary one-day lag for posting, making me bemoan the days of yore when arXiv would give a definitive arXiv number at the time of submission.) Kate came twice to Paris in the past years to work with me on this evaluation of Chib’s original marginal likelihood estimate (also called the candidate formula by Julian Besag). And on the improvement proposed by Berkhof, van Mechelen, and Gelman (2003), based on averaging over all permutations, idea that we rediscovered in an earlier paper with Jean-Michel Marin. (And that Andrew seemed to have completely forgotten. Despite being the very first one to publish [in English] a paper on a Gibbs sampler for mixtures.) Given that this averaging can get quite costly, we propose a preliminary step to reduce the number of relevant permutations to be considered in the averaging, removing far-away modes that do not contribute to the Rao-Blackwell estimate and called dual importance sampling. We also considered modelling the posterior as a product of k-component mixtures on the components, following a vague idea I had in the back of my mind for many years, but it did not help. In the above boxplot comparison of estimators, the marginal likelihood estimators are

  1. Chib’s method using T = 5000 samples with a permutation correction by multiplying by k!.
  2. Chib’s method (1), using T = 5000 samples which are randomly permuted.
  3. Importance sampling estimate (7), using the maximum likelihood estimate (MLE) of the latents as centre.
  4. Dual importance sampling using q in (8).
  5. Dual importance sampling using an approximate in (14).
  6. Bridge sampling (3). Here, label switching is imposed in hyperparameters.

AMOR at 5000ft in a water tank…

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , , , , , , , on November 22, 2012 by xi'an

On Monday, I attended the thesis defence of Rémi Bardenet in Orsay as a member (referee) of his thesis committee. While this was a thesis in computer science, which took place in the Linear Accelerator Lab in Orsay, it was clearly rooted in computational statistics, hence justifying my presence in the committee. The justification (!) for the splashy headline of this post is that Rémi’s work was motivated by the Pierre-Auger experiment on ultra-high-energy cosmic rays, where particles are detected through a network of 1600 water tanks spread over the Argentinian Pampa Amarilla on an area the size of Rhode Island (where I am incidentally going next week).

The part of Rémi’s thesis presented during the defence concentrated on his AMOR algorithm, arXived in a paper written with Olivier Cappé and Gersende Fort. AMOR stands for adaptive Metropolis online relabelling and combines adaptive MCMC techniques with relabelling strategies to fight label-switching (e.g., in mixtures). I have been interested in mixtures for eons (starting in 1987 in Ottawa with applying Titterington, Smith, and Makov to chest radiographs) and in label switching for ages (starting at the COMPSTAT conférence in Bristol in 1998). Rémi’s approach to the label switching problem follows the relabelling path, namely a projection of the original parameter space into a smaller subspace (that is also a quotient space) to avoid permutation invariance and lack of identifiability. (In the survey I wrote with Kate Lee, Jean-Michel Marin and Kerrie Mengersen, we suggest using the mode as a pivot to determine which permutation to use on the components of the mixture.) The paper suggests using an Euclidean distance to a mean determined adaptively, μt, with a quadratic form Σt also determined on-the-go, minimising (Pθ-μt)TΣt(Pθ-μt) over all permutations P at each step of the algorithm. The intuition behind the method is that the posterior over the restricted space should look like a roughly elliptically symmetric distribution, or at least like a unimodal distribution, rather than borrowing bits and pieces from different modes. While I appreciate the technical tour de force represented by the proof of convergence of the AMOR algorithm, I remain somehow sceptical about the approach and voiced the following objections during the defence: first, the assumption that the posterior becomes unimodal under an appropriate restriction is not necessarily realistic. Secondary modes often pop in with real data (as in the counter-example we used in our paper with Alessandra Iacobucci and Jean-Michel Marin). Next, the whole apparatus of fighting multiple modes and non-identifiability, i.e. fighting label switching, is to fall back on posterior means as Bayes estimators. As stressed in our JASA paper with Gilles Celeux and Merrilee Hurn, there is no reason for doing so and there are several reasons for not doing so:

  • it breaks down under model specification, i.e., when the number of components is not correct
  • it does not improve the speed of convergence but, on the opposite, restricts the space visited by the Markov chain
  • it may fall victim to the fatal attraction of secondary modes by fitting too small an ellipse around one of those modes
  • it ultimately depends on the parameterisation of the model
  • there is no reason for using posterior means in mixture problems, posterior modes or cluster centres can be used instead

I am therefore very much more in favour of producing a posterior distribution that is as label switching as possible (since the true posterior is completely symmetric in this respect). Post-processing the resulting sample can be done by using off-the-shelf clustering in the component space, derived from the point process representation used by Matthew Stephens in his thesis and subsequent papers. It also allows for a direct estimation of the number of components.

In any case, this was a defence worth-attending that led me to think afresh about the label switching problem, with directions worth exploring next month while Kate Lee is visiting from Auckland. Rémi Bardenet is now headed for a postdoc in Oxford, a perfect location to discuss further label switching and to engage into new computational statistics research!

Number of components in a mixture

Posted in Books, R, Statistics, University life with tags , , , , , on August 6, 2011 by xi'an

I got a paper (unavailable online) to referee about testing for the order (i.e. the number of components) of a normal mixture. Although this is an easily spelled problem, namely estimate k in

\sum_{i=1}^k p_{ik} \mathcal{N}(\mu_{ik},\sigma^2_{ik}),

I came to the conclusion that it is a kind of ill-posed problem. Without a clear definition of what a component is, i.e. without a well-articulated prior distribution, I remain unconvinced that k can be at all estimated. Indeed, how can we distinguish between a k component mixture and a (k+1) component mixture with an extremely small (in the sense of the component weight) additional component? Solutions ending up with a convenient chi-square test thus sound unrealistic to me… I am not implying the maths are wrong in any way, simply that the meaning of the test and the nature of the null hypothesis are unclear from a practical and methodological perspective. In the case of normal (but also Laplace) mixtures, the difficulty is compounded by the fact that the likelihood function is unbounded, thus wide open to over-fitting (at least in a non-Bayesian setting). Since Ghosh and Sen (1985), authors have come up with various penalisation functions, but I remain openly atheistic about the approach! (I do not know whether or not this is related with the summer season, but I have received an unusual number of papers to referee lately, e.g., handling three papers last Friday, one on Saturday, and yet another one on Monday morning. Interestingly, about half of them are from  non-statistical journals!)

Improving convergence of Data Augmentation algorithms

Posted in Mountains, Statistics, University life with tags , , , , on June 7, 2011 by xi'an

Following an earlier submission to Statistical Science, we have now resubmitted and arXived the new version of our paper “Improving the convergence properties of the Data Augmentation algorithm with an application to Bayesian mixture modelling”, written with Jim Hobert (University of Florida), and Vivek Roy (Iowa State University). Given that both referees were quite positive about the earlier version, the changes are truly minor and overwhelmingly stylistic. Again, I am I am very glad to be part of this paper because of the results but also because it relates to a problem I discussed at length with Richard Tweedie when I visited him in Colorado in 1993… (The above picture of Richard, along with Gareth Roberts and Anto Mira, was taken during the TMR Workshop on Computational and Spatial Statistics I organised in Aussois in 1998, with unfortunately no remaining webpage! A pre-MCMC’ski of sorts, when I had not yet started skiing!)


Get every new post delivered to your Inbox.

Join 557 other followers