a Nice talk

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , on February 20, 2015 by xi'an

Today, I give a talk on our testing paper in Nice, in a workshop run in connection with our Calibration ANR grant:

The slides are directly extracted from the paper but it still took me quite a while to translate the paper into those, during the early hours of our Czech break this week.

One added perk of travelling to Nice is the flight there, as it parallels the entire French Alps, a terrific view in nice weather!

relabelling mixtures (#2)

Posted in Statistics, Travel, University life with tags , , , , , , on February 5, 2015 by xi'an

Following the previous post, I went and had  a (long) look at Puolamäki and Kaski’s paper. I must acknowledge that, despite having several runs through the paper, I still have trouble with the approach… From what I understand, the authors use a Bernoulli mixture pseudo-model to reallocate the observations to components.  That is, given an MCMC output with simulated allocations variables (a.k.a., hidden or latent variables), they create a (TxK)xn matrix of component binary indicators e.g., for a three component mixture,

0 1 0 0 1 0…
1 0 0 0 0 0…
0 0 1 1 0 1…
0 1 0 0 1 1…

and estimate a probability to be in component j for each of the n observations, according to the (pseudo-)likelihood

$\prod_{r=1}^R \sum_{j=1}^K \prod_{i=1}^n \beta_{i,j}^{z_{i,r}}(1-\beta_{i,j})^{1-z_{i,r}}$

It took me a few days, between morning runs and those wee hours when I cannot get back to sleep (!), to make some sense of this Bernoulli modelling. The allocation vectors are used together to estimate the probabilities of being “in” component j together. However the data—which is the outcome of an MCMC simulation and de facto does not originate from that Bernoulli mixture—does not seem appropriate, both because it is produced by an MCMC simulation and is made of blocks of highly correlated rows [which sum up to one]. The Bernoulli likelihood above also defines a new model, with many more parameters than in the original mixture model. And I fail to see why perfect, partial or inexistent label switching [in the MCMC sequence] is not going to impact the estimation of the Bernoulli mixture. And why an argument based on a fixed parameter value (Theorem 3) extends to an MCMC outcome where parameters themselves are subjected to some degree of label switching. Bemused, I remain…

relabelling mixtures

Posted in Books, Statistics with tags , , , , , , on January 30, 2015 by xi'an

Another short paper about relabelling in mixtures was arXived last week by Pauli and Torelli. They refer rather extensively to a previous paper by Puolamäki and Kaski (2009) of which I was not aware, paper attempting to get an unswitching sampler that does not exhibit any label switching, a concept I find most curious as I see no rigorous way to state that a sampler is not switching! This would imply spotting low posterior probability regions that the chain would cross. But I should check the paper nonetheless.

Because the G component mixture posterior is invariant under the G! possible permutations, I am somewhat undeciced as to what the authors of the current paper mean by estimating the difference between two means, like μ12. Since they object to using the output of a perfectly mixing MCMC algorithm and seem to prefer the one associated with a non-switching chain. Or by estimating the probability that a given observation is from a given component, since this is exactly 1/G by the permutation invariance property. In order to identify a partition of the data, they introduce a loss function on the joint allocations of pairs of observations, loss function that sounds quite similar to the one we used in our 2000 JASA paper on the label switching deficiencies of MCMC algorithms. (And makes me wonder why this work of us is not deemed relevant for the approach advocated in the paper!) Still, having read this paper, which I find rather poorly written, I have no clear understanding of how the authors give a precise meaning to a specific component of the mixture distribution. Or how the relabelling has to be conducted to avoid switching. That is, how the authors define their parameter space. Or their loss function. Unless one falls back onto the ordering of the means or the weights which has the drawback of not connecting with the levels sets of a particular mode of the posterior distribution, meaning that imposing the constraints result in a region that contains bits of several modes.

At some point the authors assume the data can be partitioned into K≤G groups such that there is a representative observation within each group never sharing a component (across MCMC iterations) with any of the other representatives. While this notion is label invariant, I wonder whether (a) this is possible on any MCMC outcome; (b) it indicates a positive or negative feature of the MCMC sampler.; and (c) what prevents the representatives to switch in harmony from one component to the next while preserving their perfect mutual exclusion… This however constitutes the advance in the paper, namely that component dependent quantities as estimated as those associated with a particular representative. Note that the paper contains no illustration, hence that the method may prove hard to impossible to implement!

Challis Lectures

Posted in Books, pictures, Statistics, Travel, University life, Wines with tags , , , , , , , on November 23, 2014 by xi'an

I had a great time during this short visit in the Department of Statistics, University of Florida, Gainesville. First, it was a major honour to be the 2014 recipient of the George H. Challis Award and I considerably enjoyed delivering my lectures on mixtures and on ABC with random forests, And chatting with members of the audience about the contents afterwards. Here is the physical award I brought back to my office:

More as a piece of trivia, here is the amount of information about the George H. Challis Award I found on the UF website:

This fund was established in 2000 by Jack M. and Linda Challis Gill and the Gill Foundation of Texas, in memory of Linda’s father, to support faculty and student conference travel awards and the George Challis Biostatistics Lecture Series. George H. Challis was born on December 8, 1911 and was raised in Italy and Indiana. He was the first cousin of Indiana composer Cole Porter. George earned a degree in 1933 from the School of Business at Indiana University in Bloomington. George passed away on May 6, 2000. His wife, Madeline, passed away on December 14, 2009.

Cole Porter, indeed!

On top of this lecturing activity, I had a full academic agenda, discussing with most faculty members and PhD students of the Department, on our respective research themes over the two days I was there and it felt like there was not enough time! And then, during the few remaining hours where I did not try to stay on French time (!), I had a great time with my friends Jim and Maria in Gainesville, tasting a fantastic local IPA beer from Cigar City Brewery and several great (non-local) red wines… Adding to that a pile of new books, a smooth trip both ways, and a chance encounter with Alicia in Atlanta airport, it was a brilliant extended weekend!

importance sampling schemes for evidence approximation [revised]

Posted in Statistics, University life with tags , , , , , , , on November 18, 2014 by xi'an

After a rather intense period of new simulations and versions, Juong Een (Kate) Lee and I have now resubmitted our paper on (some) importance sampling schemes for evidence approximation in mixture models to Bayesian Analysis. There is no fundamental change in the new version but rather a more detailed description of what those importance schemes mean in practice. The original idea in the paper is to improve upon the Rao-Blackwellisation solution proposed by Berkoff et al. (2002) and later by Marin et al. (2005) to avoid the impact of label switching on Chib’s formula. The Rao-Blackwellisation consists in averaging over all permutations of the labels while the improvement relies on the elimination of useless permutations, namely those that produce a negligible conditional density in Chib’s (candidate’s) formula. While the improvement implies truncated the overall sum and hence induces a potential bias (which was the concern of one referee), the determination of the irrelevant permutations after relabelling next to a single mode does not appear to cause any bias, while reducing the computational overload. Referees also made us aware of many recent proposals that conduct to different evidence approximations, albeit not directly related with our purpose. (One was Rodrigues and Walker, 2014, discussed and commented in a recent post.)

back in Gainesville (FL)

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , on November 12, 2014 by xi'an

Today, I am flying to Gainesville, Florida, for the rest of the week, to give a couple of lectures. More precisely, I have actually been nominated the 2014 Challis lecturer by the Department of Statistics there, following an impressive series of top statisticians (most of them close friends, is there a correlation there?!). I am quite excited to meet again with old friends and to be back at George’s University, if only for a little less than three days. (There is a certain trend in those Fall trips as I have been going for a few days and two talks to the USA or Canada for the past three Falls: to Ames and Chicago in 2012, to Pittsburgh (CMU) and Toronto in 2013…)

label switching in Bayesian mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on October 31, 2014 by xi'an

A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

$\sum_{j=1}^k \omega_j f(y|\theta_i)$

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1″ would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1″.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

$p(z_i=j|{\mathbf y})$

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

$f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})$

does not hold when the two permutations σ and τ give different images of zi