## Overfitting Bayesian mixture models with an unknown number of components

Posted in Statistics with tags , , , , , , , , on March 4, 2015 by xi'an

During my Czech vacations, Zoé van Havre, Nicole White, Judith Rousseau, and Kerrie Mengersen1 posted on arXiv a paper on overfitting mixture models to estimate the number of components. This is directly related with Judith and Kerrie’s 2011 paper and with Zoé’s PhD topic. The paper also returns to the vexing (?) issue of label switching! I very much like the paper and not only because the author are good friends!, but also because it brings a solution to an approach I briefly attempted with Marie-Anne Gruet in the early 1990’s, just before finding about the reversible jump MCMC algorithm of Peter Green at a workshop in Luminy and considering we were not going to “beat the competition”! Hence not publishing the output of our over-fitted Gibbs samplers that were nicely emptying extra components… It also brings a rebuke about a later assertion of mine’s at an ICMS workshop on mixtures, where I defended the notion that over-fitted mixtures could not be detected, a notion that was severely disputed by David McKay…

What is so fantastic in Rousseau and Mengersen (2011) is that a simple constraint on the Dirichlet prior on the mixture weights suffices to guarantee that asymptotically superfluous components will empty out and signal they are truly superfluous! The authors here cumulate the over-fitted mixture with a tempering strategy, which seems somewhat redundant, the number of extra components being a sort of temperature, but eliminates the need for fragile RJMCMC steps. Label switching is obviously even more of an issue with a larger number of components and identifying empty components seems to require a lack of label switching for some components to remain empty!

When reading through the paper, I came upon the condition that only the priors of the weights are allowed to vary between temperatures. Distinguishing the weights from the other parameters does make perfect sense, as some representations of a mixture work without those weights. Still I feel a bit uncertain about the fixed prior constraint, even though I can see the rationale in not allowing for complete freedom in picking those priors. More fundamentally, I am less and less happy with independent identical or exchangeable priors on the components.

Our own recent experience with almost zero weights mixtures (and with Judith, Kaniav, and Kerrie) suggests not using solely a Gibbs sampler there as it shows poor mixing. And even poorer label switching. The current paper does not seem to meet the same difficulties, maybe thanks to (prior) tempering.

The paper proposes a strategy called Zswitch to resolve label switching, which amounts to identify a MAP for each possible number of components and a subsequent relabelling. Even though I do not entirely understand the way the permutation is constructed. I wonder in particular at the cost of the relabelling.

## the density that did not exist…

Posted in Kids, R, Statistics, University life with tags , , , , on January 27, 2015 by xi'an

On Cross Validated, I had a rather extended discussion with a user about a probability density

$f(x_1,x_2)=\left(\dfrac{x_1}{x_2}\right)\left(\dfrac{\alpha}{x_2}\right)^{x_1-1}\exp\left\{-\left(\dfrac{\alpha}{x_2}\right)^{x_1} \right\}\mathbb{I}_{\mathbb{R}^*_+}(x_1,x_2)$

as I thought it could be decomposed in two manageable conditionals and simulated by Gibbs sampling. The first component led to a Gumbel like density

$g(y|x_2)\propto ye^{-y-e^{-y}} \quad\text{with}\quad y=\left(\alpha/x_2 \right)^{x_1}\stackrel{\text{def}}{=}\beta^{x_1}$

wirh y being restricted to either (0,1) or (1,∞) depending on β. The density is bounded and can be easily simulated by an accept-reject step. The second component leads to

$g(t|x_1)\propto \exp\{-\gamma ~ t \}~t^{-{1}/{x_1}} \quad\text{with}\quad t=\dfrac{1}{{x_2}^{x_1}}$

which offers the slight difficulty that it is not integrable when the first component is less than 1! So the above density does not exist (as a probability density).

What I found interesting in this question was that, for once, the Gibbs sampler was the solution rather than the problem, i.e., that it pointed out the lack of integrability of the joint. (What I found less interesting was that the user did not acknowledge a lengthy discussion that we had previously about the Gibbs implementation and that he erased, that he lost interest in the question by not following up on my answer, a seemingly common feature of his‘, and that he did not provide neither source nor motivation for this zombie density.)

## testing MCMC code

Posted in Books, Statistics, University life with tags , , , , , , , , on December 26, 2014 by xi'an

A title that caught my attention on arXiv: testing MCMC code by Roger Grosse and David Duvenaud. The paper is in fact a tutorial adapted from blog posts written by Grosse and Duvenaud, on the blog of the Harvard Intelligent Probabilistic Systems group. The purpose is to write code in such a modular way that (some) conditional probability computations can be tested. Using my favourite Gibbs sampler for the mixture model, they advocate computing the ratios

$\dfrac{p(x'|z)}{p(x|z)}\quad\text{and}\quad \dfrac{p(x',z)}{p(x,z)}$

to make sure they are exactly identical. (Where x denotes the part of the parameter being simulated and z anything else.) The paper also mentions an older paper by John Geweke—of which I was curiously unaware!—leading to another test: consider iterating the following two steps:

1. update the parameter θ given the current data x by an MCMC step that preserves the posterior p(θ|x);
2. update the data x given the current parameter value θ from the sampling distribution p(x|θ).

Since both steps preserve the joint distribution p(x,θ), values simulated from those steps should exhibit the same properties as a forward production of (x,θ), i.e., simulating from p(θ) and then from p(x|θ). So with enough simulations, comparison tests can be run. (Andrew has a very similar proposal at about the same time.) There are potential limitations to the first approach, obviously, from being unable to write the full conditionals [an ABC version anyone?!] to making a programming mistake that keep both ratios equal [as it would occur if a Metropolis-within-Gibbs was run by using the ratio of the joints in the acceptance probability]. Further, as noted by the authors it only addresses the mathematical correctness of the code, rather than the issue of whether the MCMC algorithm mixes well enough to provide a pseudo-iid-sample from p(θ|x). (Lack of mixing that could be spotted by Geweke’s test.) But it is so immediately available that it can indeed be added to every and all simulations involving a conditional step. While Geweke’s test requires re-running the MCMC algorithm altogether. Although clear divergence between an iid sampling from p(x,θ) and the Gibbs version above could appear fast enough for a stopping rule to be used. In fine, a worthwhile addition to the collection of checkings and tests built across the years for MCMC algorithms! (Of which the trick proposed by my friend Tobias Rydén to run first the MCMC code with n=0 observations in order to recover the prior p(θ) remains my favourite!)

## an ABC experiment

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on November 24, 2014 by xi'an

In a cross-validated forum exchange, I used the code below to illustrate the working of an ABC algorithm:

#normal data with 100 observations
n=100
x=rnorm(n)
#observed summaries

#normal x gamma prior
priori=function(N){
return(cbind(rnorm(N,sd=10),
1/sqrt(rgamma(N,shape=2,scale=5))))
}

ABC=function(N,alpha=.05){

prior=priori(N) #reference table

#pseudo-data
summ=matrix(0,N,2)
for (i in 1:N){
xi=rnorm(n)*prior[i,2]+prior[i,1]
}

#normalisation factor for the distance
#distance
#selection
posterior=prior[dist<quantile(dist,alpha),]}


Hence I used the median and the mad as my summary statistics. And the outcome is rather surprising, for two reasons: the first one is that the posterior on the mean μ is much wider than when using the mean and the variance as summary statistics. This is not completely surprising in that the latter are sufficient, while the former are not. Still, the (-10,10) range on the mean is way larger… The second reason for surprise is that the true posterior distribution cannot be derived since the joint density of med and mad is unavailable.

After thinking about this for a while, I went back to my workbench to check the difference with using mean and variance. To my greater surprise, I found hardly any difference! Using the almost exact ABC with 10⁶ simulations and a 5% subsampling rate returns exactly the same outcome. (The first row above is for the sufficient statistics (mean,standard deviation) while the second row is for the (median,mad) pair.) Playing with the distance does not help. The genuine posterior output is quite different, as exposed on the last row of the above, using a basic Gibbs sampler since the posterior is not truly conjugate.

## my life as a mixture [BAYSM 2014, Wien]

Posted in Books, Kids, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on September 12, 2014 by xi'an

Next week I am giving a talk at BAYSM in Vienna. BAYSM is the Bayesian Young Statisticians meeting so one may wonder why, but with Chris Holmes and Mike West, we got invited as more… erm… senior speakers! So I decided to give a definitely senior talk on a thread pursued throughout my career so far, namely mixtures. Plus it also relates to works of the other senior speakers. Here is the abstract for the talk:

Mixtures of distributions are fascinating objects for statisticians in that they both constitute a straightforward extension of standard distributions and offer a complex benchmark for evaluating statistical procedures, with a likelihood both computable in a linear time and enjoying an exponential number of local models (and sometimes infinite modes). This fruitful playground appeals in particular to Bayesians as it constitutes an easily understood challenge to the use of improper priors and of objective Bayes solutions. This talk will review some ancient and some more recent works of mine on mixtures of distributions, from the 1990 Gibbs sampler to the 2000 label switching and to later studies of Bayes factor approximations, nested sampling performances, improper priors, improved importance samplers, ABC, and a inverse perspective on the Bayesian approach to testing of hypotheses.

I am very grateful to the scientific committee for this invitation, as it will give me the opportunity to meet the new generation, learn from them and in addition discover Vienna where I have never been, despite several visits to Austria. Including its top, the Großglockner. I will also give a seminar in Linz the day before. In the Institut für Angewandte Statistik.

## recycling accept-reject rejections (#2)

Posted in R, Statistics, University life with tags , , , , , , on July 2, 2014 by xi'an

Following yesterday’s post on Rao’s, Liu’s, and Dunson’s paper on a new approach to intractable normalising constants, and taking advantage of being in Warwick, I tested the method on a toy model, namely the posterior associated with n Student’s t observations with unknown location parameter μ and a flat prior,

$x_1,\ldots,x_n \sim p(x|\mu) \propto \left[ 1+(x-\mu)^2/\nu \right]^{-(\nu+1)/2}$

which is “naturally” bounded by a Cauchy density with scale √ν. The constant M is then easily derived and running the new algorithm follows from a normal random walk proposal targeting the augmented likelihood (R code below).

As shown by the above graph, the completion-by-rejection scheme produces a similar outcome (tomato) as the one based on the sole observations (steelblue). With a similar acceptance rate. However, the computing time is much much degraded:

> system.time(g8())
user  system elapsed
53.751   0.056  54.103
> system.time(g9())
user  system elapsed
1.156   0.000   1.161


when compared with the no-completion version. Here is the entire R code that produced both MCMC samples: Continue reading

## noninformative priors for mixtures

Posted in Books, Statistics, University life with tags , , , , , , , , on May 26, 2014 by xi'an

“A novel formulation of the mixture model is introduced, which includes the prior constraint that each Gaussian component is always assigned a minimal number of data points. This enables noninformative improper priors such as the Jeffreys prior to be placed on the component parameters. We demonstrate difficulties involved in specifying a prior for the standard Gaussian mixture model, and show how the new model can be used to overcome these. MCMC methods are given for efficient sampling from the posterior of this model.” C. Stoneking

Following in the theme of the Jeffreys’ post of two weeks ago, I spotted today a newly arXived paper about using improper priors for mixtures…and surviving it! It is entitled “Bayesian inference of Gaussian mixture models with noninformative priors” and written by Colin Stoneking at ETH Zürich. As mentioned in the previous post, one specificity of our 1990-1994 paper on mixture with Jean Diebolt was to allow for improper priors by imposing at least two observations per component. The above abstract thus puzzled me until I found on page 3 that the paper was indeed related to ours (and Larry’s 2000 validation)! Actually, I should not complain about citations of my earlier works on mixtures as they cover seven different papers, but the bibliography is somewhat missing the paper we wrote with George Casella and Marty Wells in Statistical Methodology in 2004 (this was actually the very first paper of this new journal!), where we show that conjugate priors allow for the integration of the weights, resulting in a close-form expression for the distribution of the partition vector. (This was also extended in the chapter “Exact Bayesian Analysis of Mixtures” I wrote with Kerrie Mengersen in our book Mixtures: Estimation and Applications.)

“There is no well-founded, general method to choose the parameters of a given prior to make it weakly informative for Gaussian mixtures.” C. Stoneking

The first part of the paper shows why looking for weakly informative priors is doomed to fail in this mixture setting: there is no stabilisation as hyperparameters get towards the border (between proper-ness and improper-ness), and on the opposite the frequency of appearances of empty components grows steadily to 100%…  The second part gets to the reassessment of our 1990 exclusion trick, first considering that it is not producing a true posterior, then criticising Larry’s 2000 analysis as building a data-dependent “prior”, and at last proposing a reformulation where the exclusion of the empty components and those with one allocated observation becomes part of the “prior” (albeit a prior on the allocation vector). In fine, the posterior thus constructed remains the same as ours, with a message that if we start our model as the likelihood of the sample excluding empty or single-observation terms, we can produce a proper Bayesian analysis. (Except for a missing if minor renormalisation.) This leads me to wonder about the conclusion that inference about the (unknown) number of components in the mixture being impossible from this perspective. For instance, we could define fractional Bayes factors à la O’Hagan (1995) this way, i.e. starting from the restricted likelihood and taking a fraction of the likelihood to make the posterior proper, then using the remaining fraction to compute a Bayes factor. (Fractional Bayes factors do not work for the regular likelihood of a Gaussian mixture, irrespective of the sample size.)