## label switching in Bayesian mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on October 31, 2014 by xi'an

A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

$\sum_{j=1}^k \omega_j f(y|\theta_i)$

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1″ would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1″.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

$p(z_i=j|{\mathbf y})$

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

$f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})$

does not hold when the two permutations σ and τ give different images of zi

## posterior likelihood ratio is back

Posted in Statistics, University life with tags , , , , , , , , , on June 10, 2014 by xi'an

“The PLR turns out to be a natural Bayesian measure of evidence of the studied hypotheses.”

Isabelle Smith and André Ferrari just arXived a paper on the posterior distribution of the likelihood ratio. This is in line with Murray Aitkin’s notion of considering the likelihood ratio

$f(x|\theta_0) / f(x|\theta)$

as a prior quantity, when contemplating the null hypothesis that θ is equal to θ0. (Also advanced by Alan Birnbaum and Arthur Dempster.) A concept we criticised (rather strongly) in our Statistics and Risk Modelling paper with Andrew Gelman and Judith Rousseau.  The arguments found in the current paper in defence of the posterior likelihood ratio are quite similar to Aitkin’s:

• defined for (some) improper priors;
• invariant under observation or parameter transforms;
• more informative than tthe posterior mean of the posterior likelihood ratio, not-so-incidentally equal to the Bayes factor;
• avoiding using the posterior mean for an asymmetric posterior distribution;
• achieving some degree of reconciliation between Bayesian and frequentist perspectives, e.g. by being equal to some p-values;
• easily computed by MCMC means (if need be).

One generalisation found in the paper handles the case of composite versus composite hypotheses, of the form

$\int\mathbb{I}\left( p(x|\theta_1)

which brings back an earlier criticism I raised (in Edinburgh, at ICMS, where as one-of-those-coincidences, I read this paper!), namely that using the product of the marginals rather than the joint posterior is no more a standard Bayesian practice than using the data in a prior quantity. And leads to multiple uses of the data. Hence, having already delivered my perspective on this approach in the past, I do not feel the urge to “raise the flag” once again about a paper that is otherwise well-documented and mathematically rich.

## Bayesian variable selection redux

Posted in Statistics, University life with tags , , , , , on July 11, 2011 by xi'an

After a rather long interlude, and just in time for the six month deadline!, we (Gilles Celeux, Mohammed El Anbari, Jean-Michel Marin and myself) have resubmitted (and rearXived) our comparative study of Bayesian and non-Bayesian variable selections procedures to Bayesian Analysis. Why it took us so long is a combination of good and bad reasons: besides being far apart, between Morocco, Paris and Montpellier, and running too many projects at once with Jean-Michel (including the Bayesian Core revision that did not move much since last summer!), we came to realise that my earlier strong stance that invariance on the intercept did not matter was not right and that the (kind) reviewers were correct about the asymptotic impact of the scale of the intercept on the variable selection, so we had first to reconvene and think about it, before running another large round of simulations. We hope the picture is now clearer.

## Thesis defense in València

Posted in Statistics, Travel, University life, Wines with tags , , , , , , on February 25, 2011 by xi'an

On Monday, I took part in the jury of the PhD thesis of Anabel Forte Deltel, in the department of statistics of the Universitat de València. The topic of the thesis was variable selection in Gaussian linear models using an objective Bayes approach. Completely on my own research agenda! I had already discussed with Anabel in Zürich, where she gave a poster and gave me a copy of her thesis, so could concentrate on the fundamentals of her approach during the defense. Her approach extends Liang et al. (2008, JASA) hyper-g prior in a complete analysis of the conditions set by Jeffreys in his book for constructing such priors. She is therefore able to motivate a precise value for most hyperparameters (although some choices were mainly based on computational reasons opposing 2F1 with Appell’s F1 hypergeometric functions). She also defends the use of an improper prior by an invariance argument that leads to the standard Jeffreys’ prior on location-scale. (This is where I prefer the approach in Bayesian Core that does not discriminate between a subset of the covariates including the intercept and the other covariates. Even though it is not invariant by location-scale transforms.) After the defence, Jim Berger pointed out to me that the modelling allowed for the subset to be empty, which would then cancel my above objection! In conclusion, this thesis could well set a reference prior (if not in José Bernardo’s sense of the term!) for Bayesian linear model analysis in the coming years.

## Back from Philly

Posted in R, Statistics, Travel, University life with tags , , , , , , , , , on December 21, 2010 by xi'an

The conference in honour of Larry Brown was quite exciting, with lots of old friends gathered in Philadelphia and lots of great talks either recollecting major works of Larry and coauthors or presenting fairly interesting new works. Unsurprisingly, a large chunk of the talks was about admissibility and minimaxity, with John Hartigan starting the day re-reading Larry masterpiece 1971 paper linking admissibility and recurrence of associated processes, a paper I always had trouble studying because of both its depth and its breadth! Bill Strawderman presented a new if classical minimaxity result on matrix estimation and Anirban DasGupta some large dimension consistency results where the choice of the distance (total variation versus Kullback deviance) was irrelevant. Ed George and Susie Bayarri both presented their recent work on g-priors and their generalisation, which directly relate to our recent paper on that topic. On the afternoon, Holger Dette showed some impressive mathematics based on Elfving’s representation and used in building optimal designs. I particularly appreciated the results of a joint work with Larry presented by Robert Wolpert where they classified all Markov stationary infinitely divisible time-reversible integer-valued processes. It produced a surprisingly small list of four cases, two being trivial.. The final talk of the day was about homology, which sounded a priori rebutting, but Robert Adler made it extremely entertaining, so much that I even failed to resent the powerpoint tricks! The next morning, Mark Low gave a very emotional but also quite illuminating about the first results he got during his PhD thesis at Cornell (completing the thesis when I was using Larry’s office!). Brenda McGibbon went back to the three truncated Poisson papers she wrote with Ian Johnstone (via gruesome 13 hour bus rides from Montréal to Ithaca!) and produced an illuminating explanation of the maths at work for moving from the Gaussian to the Poisson case in a most pedagogical and enjoyable fashion. Larry Wasserman explained the concepts at work behind the lasso for graphs, entertaining us with witty acronyms on the side!, and leaving out about 3/4 of his slides! (The research group involved in this project produced an R package called huge.) Joe Eaton ended up the morning with a very interesting result showing that using the right Haar measure as a prior leads to a matching prior, then showing why the consequences of the result are limited by invariance itself. Unfortunately, it was then time for me to leave and I will miss (in both meanings of the term) the other half of the talks. Especially missing Steve Fienberg’s talk for the third time in three weeks! Again, what I appreciated most during those two days (besides the fact that we were all reunited on the very day of Larry’s birthday!) was the pain most speakers went to to expose older results in a most synthetic and intuitive manner… I also got new ideas about generalising our parallel computing paper for random walk Metropolis-Hastings algorithms and for optimising across permutation transforms.

## Versions of Benford’s Law

Posted in Books, Statistics with tags , , , , on May 20, 2010 by xi'an

A new arXived note by Berger and Hill discusses how [my favourite probability introduction] Feller’s Introduction to Probability Theory (volume 2) gets Benford’s Law “wrong”. While my interest in Benford’s Law is rather superficial, I find the paper of interest as it shows a confusion between different folk theorems! My interpretation of Benford’s Law is that the first significant digit of a random variable (in a basis 10 representation) is distributed as

$f(i) \propto \log_{10}(1+\frac{1}{i})$

and not that $\log(X) \,(\text{mod}\,1)$ is uniform, which is the presentation given in the arXived note…. The former is also the interpretation of William Feller (page 63, Introduction to Probability Theory), contrary to what the arXived note seems to imply on page 2, but Feller indeed mentioned as an informal/heuristic argument in favour of Benford’s Law that when the spread of the rv X is large,  $\log(X)$ is approximately uniformly distributed. (I would no call this a “fundamental flaw“.) The arXived note is then right in pointing out the lack of foundation for Feller’s heuristic, if muddling the issue by defining several non-equivalent versions of Benford’s Law. It is also funny that this arXived note picks at the scale-invariant characterisation of Benford’s Law when Terry Tao’s entry represents it as a special case of Haar measure!

## More on Benford’s Law

Posted in Statistics with tags , , , , on July 10, 2009 by xi'an

In connection with an earlier post on Benford’s Law, i.e. the probability that the first digit of a random variable X is$1\le k\le 9$is approximately$\log\{(k+1)/k\}$—you can easily check that the sum of those probabilities is 1—, I want to signal a recent entry on Terry Tiao’s impressive blog. Terry points out that Benford’s Law is the Haar measure in that setting, but he also highlights a very peculiar absorbing property which is that, if$X$follows Benford’s Law, then$XY$also follows Benford’s Law for any random variable$Y$that is independent from$X$… Now, the funny thing is that, if you take a normal sample$x_1,\ldots,x_n$and check whether or not Benford’s Law applies to this sample, it does not. But if you take a second normal sample$y_1,\ldots,y_n$and consider the product sample$x_1\times y_1,\ldots,x_n\times y_n$, then Benford’s Law applies almost exactly. If you repeat the process one more time, it is difficult to spot the difference. Here is the [rudimentary—there must be a more elegant way to get the first significant digit!] R code to check this:

x=abs(rnorm(10^6))
b=trunc(log10(x)) -(log(x)<0)
plot(hist(trunc(x/10^b),breaks=(0:9)+.5)$den,log10((2:10)/(1:9)), xlab="Frequency",ylab="Benford's Law",pch=19,col="steelblue") abline(a=0,b=1,col="tomato",lwd=2) x=abs(rnorm(10^6)*x) b=trunc(log10(x)) -(log(x)<0) points(hist(trunc(x/10^b),breaks=(0:9)+.5,plot=F)$den,log10((2:10)/(1:9)),
pch=19,col="steelblue2")
x=abs(rnorm(10^6)*x)
b=trunc(log10(x)) -(log(x)<0)
points(hist(trunc(x/10^b),breaks=(0:9)+.5,plot=F)\$den,log10((2:10)/(1:9)),
pch=19,col="steelblue3")

Even better, if you change rnorm to another generator like rcauchy or rexp at any of the three stages, the same pattern occurs.