I am off to New York City for two days, giving a seminar at Columbia tomorrow and visiting Andrew Gelman there. My talk will be about testing as mixture estimation, with slides similar to the Nice ones below if slightly upgraded and augmented during the flight to JFK. Looking at the past seminar speakers, I noticed we were three speakers from Paris in the last fortnight, with Ismael Castillo and Paul Doukhan (in the Applied Probability seminar) preceding me. Is there a significant bias there?!
Archive for finite mixtures
The slides are directly extracted from the paper but it still took me quite a while to translate the paper into those, during the early hours of our Czech break this week.
One added perk of travelling to Nice is the flight there, as it parallels the entire French Alps, a terrific view in nice weather!
Following the previous post, I went and had a (long) look at Puolamäki and Kaski’s paper. I must acknowledge that, despite having several runs through the paper, I still have trouble with the approach… From what I understand, the authors use a Bernoulli mixture pseudo-model to reallocate the observations to components. That is, given an MCMC output with simulated allocations variables (a.k.a., hidden or latent variables), they create a (TxK)xn matrix of component binary indicators e.g., for a three component mixture,
0 1 0 0 1 0…
1 0 0 0 0 0…
0 0 1 1 0 1…
0 1 0 0 1 1…
and estimate a probability to be in component j for each of the n observations, according to the (pseudo-)likelihood
It took me a few days, between morning runs and those wee hours when I cannot get back to sleep (!), to make some sense of this Bernoulli modelling. The allocation vectors are used together to estimate the probabilities of being “in” component j together. However the data—which is the outcome of an MCMC simulation and de facto does not originate from that Bernoulli mixture—does not seem appropriate, both because it is produced by an MCMC simulation and is made of blocks of highly correlated rows [which sum up to one]. The Bernoulli likelihood above also defines a new model, with many more parameters than in the original mixture model. And I fail to see why perfect, partial or inexistent label switching [in the MCMC sequence] is not going to impact the estimation of the Bernoulli mixture. And why an argument based on a fixed parameter value (Theorem 3) extends to an MCMC outcome where parameters themselves are subjected to some degree of label switching. Bemused, I remain…
Another short paper about relabelling in mixtures was arXived last week by Pauli and Torelli. They refer rather extensively to a previous paper by Puolamäki and Kaski (2009) of which I was not aware, paper attempting to get an unswitching sampler that does not exhibit any label switching, a concept I find most curious as I see no rigorous way to state that a sampler is not switching! This would imply spotting low posterior probability regions that the chain would cross. But I should check the paper nonetheless.
Because the G component mixture posterior is invariant under the G! possible permutations, I am somewhat undeciced as to what the authors of the current paper mean by estimating the difference between two means, like μ1-μ2. Since they object to using the output of a perfectly mixing MCMC algorithm and seem to prefer the one associated with a non-switching chain. Or by estimating the probability that a given observation is from a given component, since this is exactly 1/G by the permutation invariance property. In order to identify a partition of the data, they introduce a loss function on the joint allocations of pairs of observations, loss function that sounds quite similar to the one we used in our 2000 JASA paper on the label switching deficiencies of MCMC algorithms. (And makes me wonder why this work of us is not deemed relevant for the approach advocated in the paper!) Still, having read this paper, which I find rather poorly written, I have no clear understanding of how the authors give a precise meaning to a specific component of the mixture distribution. Or how the relabelling has to be conducted to avoid switching. That is, how the authors define their parameter space. Or their loss function. Unless one falls back onto the ordering of the means or the weights which has the drawback of not connecting with the levels sets of a particular mode of the posterior distribution, meaning that imposing the constraints result in a region that contains bits of several modes.
At some point the authors assume the data can be partitioned into K≤G groups such that there is a representative observation within each group never sharing a component (across MCMC iterations) with any of the other representatives. While this notion is label invariant, I wonder whether (a) this is possible on any MCMC outcome; (b) it indicates a positive or negative feature of the MCMC sampler.; and (c) what prevents the representatives to switch in harmony from one component to the next while preserving their perfect mutual exclusion… This however constitutes the advance in the paper, namely that component dependent quantities as estimated as those associated with a particular representative. Note that the paper contains no illustration, hence that the method may prove hard to impossible to implement!
I had a great time during this short visit in the Department of Statistics, University of Florida, Gainesville. First, it was a major honour to be the 2014 recipient of the George H. Challis Award and I considerably enjoyed delivering my lectures on mixtures and on ABC with random forests, And chatting with members of the audience about the contents afterwards. Here is the physical award I brought back to my office:
More as a piece of trivia, here is the amount of information about the George H. Challis Award I found on the UF website:
This fund was established in 2000 by Jack M. and Linda Challis Gill and the Gill Foundation of Texas, in memory of Linda’s father, to support faculty and student conference travel awards and the George Challis Biostatistics Lecture Series. George H. Challis was born on December 8, 1911 and was raised in Italy and Indiana. He was the first cousin of Indiana composer Cole Porter. George earned a degree in 1933 from the School of Business at Indiana University in Bloomington. George passed away on May 6, 2000. His wife, Madeline, passed away on December 14, 2009.
Cole Porter, indeed!
On top of this lecturing activity, I had a full academic agenda, discussing with most faculty members and PhD students of the Department, on our respective research themes over the two days I was there and it felt like there was not enough time! And then, during the few remaining hours where I did not try to stay on French time (!), I had a great time with my friends Jim and Maria in Gainesville, tasting a fantastic local IPA beer from Cigar City Brewery and several great (non-local) red wines… Adding to that a pile of new books, a smooth trip both ways, and a chance encounter with Alicia in Atlanta airport, it was a brilliant extended weekend!
After a rather intense period of new simulations and versions, Juong Een (Kate) Lee and I have now resubmitted our paper on (some) importance sampling schemes for evidence approximation in mixture models to Bayesian Analysis. There is no fundamental change in the new version but rather a more detailed description of what those importance schemes mean in practice. The original idea in the paper is to improve upon the Rao-Blackwellisation solution proposed by Berkoff et al. (2002) and later by Marin et al. (2005) to avoid the impact of label switching on Chib’s formula. The Rao-Blackwellisation consists in averaging over all permutations of the labels while the improvement relies on the elimination of useless permutations, namely those that produce a negligible conditional density in Chib’s (candidate’s) formula. While the improvement implies truncated the overall sum and hence induces a potential bias (which was the concern of one referee), the determination of the irrelevant permutations after relabelling next to a single mode does not appear to cause any bias, while reducing the computational overload. Referees also made us aware of many recent proposals that conduct to different evidence approximations, albeit not directly related with our purpose. (One was Rodrigues and Walker, 2014, discussed and commented in a recent post.)
Today, I am flying to Gainesville, Florida, for the rest of the week, to give a couple of lectures. More precisely, I have actually been nominated the 2014 Challis lecturer by the Department of Statistics there, following an impressive series of top statisticians (most of them close friends, is there a correlation there?!). I am quite excited to meet again with old friends and to be back at George’s University, if only for a little less than three days. (There is a certain trend in those Fall trips as I have been going for a few days and two talks to the USA or Canada for the past three Falls: to Ames and Chicago in 2012, to Pittsburgh (CMU) and Toronto in 2013…)