## a mistake in a 1990 paper

Posted in Kids, Statistics, University life with tags , , , , , , , , on August 7, 2016 by xi'an

As we were working on the Handbook of mixture analysis with Sylvia Früwirth-Schnatter and Gilles Celeux today, near Saint-Germain des Près, I realised that there was a mistake in our 1990 mixture paper with Jean Diebolt [published in 1994], in that when we are proposing to use improper “Jeffreys” priors under the restriction that no component of the Gaussian mixture is “empty”, meaning that there are at least two observations generated from each component, the likelihood needs to be renormalised to be a density for the sample. This normalisation constant only depends on the weights of the mixture, which means that, when simulating from the full conditional distribution of the weights, there should be an extra-acceptance step to account for this correction. Of course, the term is essentially equal to one for a large enough sample but this remains a mistake nonetheless! It is funny that it remained undetected for so long in my most cited paper. Checking on Larry’s 1999 paper exploring the idea of excluding terms from the likelihood to allow for improper priors, I did not spot him using a correction either.

## using mixtures towards Bayes factor approximation

Posted in Statistics, Travel, University life with tags , , , , , , on December 11, 2014 by xi'an

Phil O’Neill and Theodore Kypraios from the University of Nottingham have arXived last week a paper on “Bayesian model choice via mixture distributions with application to epidemics and population process models”. Since we discussed this paper during my visit there earlier this year, I was definitely looking forward the completed version of their work. Especially because there are some superficial similarities with our most recent work on… Bayesian model choice via mixtures! (To the point that I misunderstood at the beginning their proposal for ours…)

The central idea in the paper is that, by considering the mixture likelihood

$\alpha\ell_1(\theta_1|\mathbf{x})+(1-\alpha)\ell_2(\theta_2|\mathbf{x})$

where x corresponds to the entire sample, it is straighforward to relate the moments of α with the Bayes factor, namely

$\mathfrak{B}_{12}=\dfrac{\mathbb{E}[\alpha]-\mathbb{E}[\alpha^2]-\mathbb{E}[\alpha|\mathbf{x}](1-\mathbb{E}[\alpha])}{\mathbb{E}[\alpha]\mathbb{E}[\alpha|\mathbf{x}]-\mathbb{E}[\alpha^2]}$

which means that estimating the mixture weight α by MCMC is equivalent to estimating the Bayes factor.

What puzzled me at first was that the mixture weight is in fine estimated with a single “datapoint”, made of the entire sample. So the posterior distribution on α is hardly different from the prior, since it solely varies by one unit! But I came to realise that this is a numerical tool and that the estimator of α is not meaningful  from a statistical viewpoint (thus differing completely from our perspective). This explains why the Beta prior on α can be freely chosen so that the mixing and stability of the Markov chain is improved: This parameter is solely an algorithmic entity.

There are similarities between this approach and the pseudo-prior encompassing perspective of Carlin and Chib (1995), even though the current version does not require pseudo-priors, using true priors instead. But thinking of weakly informative priors and of the MCMC consequence (see below) leads me to wonder if pseudo-priors would not help in this setting…

Another aspect of the paper that still puzzles me is that the MCMC algorithm mixes at all: indeed, depending on the value of the binary latent variable z, one of the two parameters is updated from the true posterior while the other is updated from the prior. It thus seems unlikely that the value of z would change quickly. Creating a huge imbalance in the prior can counteract this difference, but the same problem occurs once z has moved from 0 to 1 or from 1 to 0. It seems to me that resorting to a common parameter [if possible] and using as a proposal the model-based posteriors for both parameters is the only way out of this conundrum. (We do certainly insist on this common parametrisation in our approach as it is paramount to the use of improper priors.)

“In contrast, we consider the case where there is only one datum.”

The idea in the paper is therefore fully computational and relates to other linkage methods that create bridges between two models. It differs from our new notion of Bayesian testing in that we consider estimating the mixture between the two models in comparison, hence considering instead the mixture

$\prod_{i=1}^n\alpha f_1(x_i|\theta_1)+(1-\alpha) f_2(x_i|\theta_2)$

which is another model altogether and does not recover the original Bayes factor (Bayes factor that we altogether dismiss in favour of the posterior median of α and its entire distribution).

## noninformative priors for mixtures

Posted in Books, Statistics, University life with tags , , , , , , , , on May 26, 2014 by xi'an

“A novel formulation of the mixture model is introduced, which includes the prior constraint that each Gaussian component is always assigned a minimal number of data points. This enables noninformative improper priors such as the Jeffreys prior to be placed on the component parameters. We demonstrate difficulties involved in specifying a prior for the standard Gaussian mixture model, and show how the new model can be used to overcome these. MCMC methods are given for efficient sampling from the posterior of this model.” C. Stoneking

Following in the theme of the Jeffreys’ post of two weeks ago, I spotted today a newly arXived paper about using improper priors for mixtures…and surviving it! It is entitled “Bayesian inference of Gaussian mixture models with noninformative priors” and written by Colin Stoneking at ETH Zürich. As mentioned in the previous post, one specificity of our 1990-1994 paper on mixture with Jean Diebolt was to allow for improper priors by imposing at least two observations per component. The above abstract thus puzzled me until I found on page 3 that the paper was indeed related to ours (and Larry’s 2000 validation)! Actually, I should not complain about citations of my earlier works on mixtures as they cover seven different papers, but the bibliography is somewhat missing the paper we wrote with George Casella and Marty Wells in Statistical Methodology in 2004 (this was actually the very first paper of this new journal!), where we show that conjugate priors allow for the integration of the weights, resulting in a close-form expression for the distribution of the partition vector. (This was also extended in the chapter “Exact Bayesian Analysis of Mixtures” I wrote with Kerrie Mengersen in our book Mixtures: Estimation and Applications.)

“There is no well-founded, general method to choose the parameters of a given prior to make it weakly informative for Gaussian mixtures.” C. Stoneking

The first part of the paper shows why looking for weakly informative priors is doomed to fail in this mixture setting: there is no stabilisation as hyperparameters get towards the border (between proper-ness and improper-ness), and on the opposite the frequency of appearances of empty components grows steadily to 100%…  The second part gets to the reassessment of our 1990 exclusion trick, first considering that it is not producing a true posterior, then criticising Larry’s 2000 analysis as building a data-dependent “prior”, and at last proposing a reformulation where the exclusion of the empty components and those with one allocated observation becomes part of the “prior” (albeit a prior on the allocation vector). In fine, the posterior thus constructed remains the same as ours, with a message that if we start our model as the likelihood of the sample excluding empty or single-observation terms, we can produce a proper Bayesian analysis. (Except for a missing if minor renormalisation.) This leads me to wonder about the conclusion that inference about the (unknown) number of components in the mixture being impossible from this perspective. For instance, we could define fractional Bayes factors à la O’Hagan (1995) this way, i.e. starting from the restricted likelihood and taking a fraction of the likelihood to make the posterior proper, then using the remaining fraction to compute a Bayes factor. (Fractional Bayes factors do not work for the regular likelihood of a Gaussian mixture, irrespective of the sample size.)

## Jeffreys prior with improper posterior

Posted in Books, Statistics, University life with tags , , , , , , , , , , on May 12, 2014 by xi'an

In a complete coincidence with my visit to Warwick this week, I became aware of the paper “Inference in two-piece location-scale models with Jeffreys priors” recently published in Bayesian Analysis by Francisco Rubio and Mark Steel, both from Warwick. Paper where they exhibit a closed-form Jeffreys prior for the skewed distribution

$\dfrac{2\epsilon}{\sigma_1}f(\{x-\mu\}/\sigma_1)\mathbb{I}_{x<\mu}+\dfrac{2(1-\epsilon)}{\sigma_2}f(\{x-\mu\}/\sigma_2) \mathbb{I}_{x>\mu}$

where f is a symmetric density, namely

$\pi(\mu,\sigma_1,\sigma_2) \propto 1 \big/ \sigma_1\sigma_2\{\sigma_1+\sigma_2\}\,,$

where

$\epsilon=\sigma_1/\{\sigma_1+\sigma_2\}\,.$

only to show  immediately after that this prior does not allow for a proper posterior, no matter what the sample size is. While the above skewed distribution can always be interpreted as a mixture, being a weighted sum of two terms, it is not strictly speaking a mixture, if only because the “component” can be identified from the observation (depending on which side of μ is stands). The likelihood is therefore a product of simple terms rather than a product of a sum of two terms.

As a solution to this conundrum, the authors consider the alternative of the “independent Jeffreys priors”, which are made of a product of conditional Jeffreys priors, i.e., by computing the Jeffreys prior one parameter at a time with all other parameters considered to be fixed. Which differs from the reference prior, of course, but would have been my second choice as well. Despite criticisms expressed by José Bernardo in the discussion of the paper… The difficulty (in my opinion) resides in the choice (and difficulty) of the parameterisation of the model, since those priors are not parameterisation-invariant. (Xinyi Xu makes the important comment that even those priors incorporate strong if hidden information. Which relates to our earlier discussion with Kaniav Kamari on the “dangers” of prior modelling.)

Although the outcome is puzzling, I remain just slightly sceptical of the income, namely Jeffreys prior and the corresponding Fisher information: the fact that the density involves an indicator function and is thus discontinuous in the location μ at the observation x makes the likelihood function not differentiable and hence the derivation of the Fisher information not strictly valid. Since the indicator part cannot be differentiated. Not that I am seeing the Jeffreys prior as the ultimate grail for non-informative priors, far from it, but there is definitely something specific in the discontinuity in the density. (In connection with the later point, Weiss and Suchard deliver a highly critical commentary on the non-need for reference priors and the preference given to a non-parametric Bayes primary analysis. Maybe making the point towards a greater convergence of the two perspectives, objective Bayes and non-parametric Bayes.)

This paper and the ensuing discussion about the properness of the Jeffreys posterior reminded me of our earliest paper on the topic with Jean Diebolt. Where we used improper priors on location and scale parameters but prohibited allocations (in the Gibbs sampler) that would lead to less than two observations per components, thereby ensuring that the (truncated) posterior was well-defined. (This feature also remained in the Series B paper, submitted at the same time, namely mid-1990, but only published in 1994!)  Larry Wasserman proved ten years later that this truncation led to consistent estimators, but I had not thought about it in very long while. I still like this notion of forcing some (enough) datapoints into each component for an allocation (of the latent indicator variables) to be an acceptable Gibbs move. This is obviously not compatible with the iid representation of a mixture model, but it expresses the requirement that components all have a meaning in terms of the data, namely that all components contributed to generating a part of the data. This translates as a form of weak prior information on how much we trust the model and how meaningful each component is (in opposition to adding meaningless extra-components with almost zero weights or almost identical parameters).

As a marginalia, the insistence in Rubio and Steel’s paper that all observations in the sample be different also reminded me of a discussion I wrote for one of the Valencia proceedings (Valencia 6 in 1998) where Mark presented a paper with Carmen Fernández on this issue of handling duplicated observations modelled by absolutely continuous distributions. (I am afraid my discussion is not worth the \$250 price tag given by amazon!)