## Archive for Gaussian mixture

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on April 21, 2014 by xi'an

As I was flying over Skye (with [maybe] a first if hazy perspective on the Cuillin ridge!) to Iceland, three long sets of replies to some of my posts appeared on the ‘Og:

Thanks to them for taking the time to answer my musings…

## MCMC for sampling from mixture models

Posted in Kids, Statistics, University life with tags , , on April 17, 2014 by xi'an

Randal Douc, Florian Maire, and Jimmy Olsson recently arXived a paper on the use of Markov chain Monte Carlo methods for the sampling of mixture models, which contains the recourse to Carlin and Chib (1995) pseudo-priors to simulate from a mixture distribution (and not from the posterior distribution associated with a mixture sampling model). As reported earlier, I was in the thesis defence of Florian Maire and this approach had already puzzled me at the time. In short, a mixture structure

$\pi(z)\propto\sum_{m=1}^k \tilde\pi(m,z)$

gives rises to as many auxiliary variables as there are components, minus one: namely, if a simulation z is generated from a given component i of the mixture, one can create pseudo-simulations u from all the other components, using pseudo-priors à la Carlin and Chib. A Gibbs sampler based on this augmented state-space can then be implemented:  (a) simulate a new component index m given (z,u);  (b) simulate a new value of (z,u) given m. One version (MCC) of the algorithm simulates z given m from the proper conditional posterior by a Metropolis step, while another one (FCC) only simulate the u‘s. The paper shows that MCC has a smaller asymptotic variance than FCC. I however fail to understand why a Carlin and Chib is necessary in a mixture context: it seems (from the introduction) that the motivation is that a regular Gibbs sampler [simulating z by a Metropolis-Hastings proposal then m] has difficulties moving between components when those components are well-separated. This is correct but slightly moot, as each component of the mixture can be simulated separately and in advance in z, which leads to a natural construction of (a) the pseudo-priors used in the paper, (b) approximations to the weights of the mixture, and (c) a global mixture independent proposal, which can be used in an independent Metropolis-Hastings mixture proposal that [seems to me to] alleviate(s) the need to simulate the component index m. Both examples used in the paper, a toy two-component two-dimensional Gaussian mixture and another toy two-component one-dimensional Gaussian mixture observed with noise (and in absolute value), do not help in perceiving the definitive need for this Carlin and Chib version. Especially when considering the construction of the pseudo-priors.

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on January 28, 2014 by xi'an

Today was the very last session of our Reading Classics Seminar for the academic year 2013-2014. We listened two presentations, one on the Casella and Strawderman (1984) paper on the estimation of the normal bounded mean. And one on the Hartigan and Wong’s 1979 K-Means Clustering Algorithm paper in JRSS C. The first presentation did not go well as my student had difficulties with the maths behind the paper. (As he did not come to ask me or others for help, it may well be that he put this talk together at the last minute, at a time busy with finals and project deliveries. He also failed to exploit those earlier presentations of the paper.) The innovative part in the talk was the presentation of several R simulations comparing the risk of the minimax Bayes estimator with the one for the MLE. Although the choice of simulating different samples of standard normals for different values of the parameters and even for both estimators made the curves (unnecessarily) all wiggly.

By contrast, the second presentation was very well-designed, with great Beamer slides, interactive features and a software oriented focus. My student Mouna Berrada started from the existing R function kmeans to explain the principles of the algorithm, recycling the interactive presentation of last year as well (with my permission), and creating a dynamic flowchart that was most helpful. So she made the best of this very short paper! Just (predictably) missing the question of the statistical model behind the procedure. During the discussion, I mused why k-medians clustering was not more popular as it offered higher robustness guarantees, albeit further away from a genuine statistical model. And why k-means clustering was not more systematically compared with mixture (EM) estimation.

Here are the slides for the second talk

## back to moments

Posted in Statistics, University life with tags , , , on March 23, 2012 by xi'an

A recent paper posted on arXiv considers afresh the method of moments for mixtures of distributions. (“Afresh”, because the method was introduced by Karl Pearson in the 1890’s…) The authors (Animashree Anandkumar, Daniel Hsu, and Sham Kakade) estimate the parameters of a mixture of multinomial distributions (motivated as a “bag of words document topic” model) via the moment representation of pairwise and triple-wise probabilities. The estimate is obtained by a simple matricial formula using the empirical frequencies for pairs and triplets. The principle also applies for non-multinomial mixtures with components that are defined/parameterised by their mean (or rather first moments?), like Gaussian mixtures.

This is neat, but there are a few caveats: (1) contrary to standard mixtures, the paper assumes that þ observations are made at once from a given component: in other words, components are drawn at random according to a multinomial distribution, then þ observations are generated from this given component. (This is rather unusual, esp. given that þ is the same across all samples. It should be feasible to extend the results in the paper to varying þ‘s…) (2) while the pairwise and triplewise statistics remain low order moments, avoiding the criticism raised against Pearson’s original estimator, those pairwise and even more triplewise frequency estimators are quickly getting poor as the number d of words in the vocabulary/dimension of the parameter increases, since there should be more and more zeros. (For a D dimensional Gaussian mixture with both mean and covariance matrix unknown, the authors consider the dimension is D/þ but this seems strange given the D+D²/2 parameters to estimate for each component…)