## using mixtures towards Bayes factor approximation

Posted in Statistics, Travel, University life with tags , , , , , , on December 11, 2014 by xi'an

Phil O’Neill and Theodore Kypraios from the University of Nottingham have arXived last week a paper on “Bayesian model choice via mixture distributions with application to epidemics and population process models”. Since we discussed this paper during my visit there earlier this year, I was definitely looking forward the completed version of their work. Especially because there are some superficial similarities with our most recent work on… Bayesian model choice via mixtures! (To the point that I misunderstood at the beginning their proposal for ours…)

The central idea in the paper is that, by considering the mixture likelihood

$\alpha\ell_1(\theta_1|\mathbf{x})+(1-\alpha)\ell_2(\theta_2|\mathbf{x})$

where x corresponds to the entire sample, it is straighforward to relate the moments of α with the Bayes factor, namely

$\mathfrak{B}_{12}=\dfrac{\mathbb{E}[\alpha]-\mathbb{E}[\alpha^2]-\mathbb{E}[\alpha|\mathbf{x}](1-\mathbb{E}[\alpha])}{\mathbb{E}[\alpha]\mathbb{E}[\alpha|\mathbf{x}]-\mathbb{E}[\alpha^2]}$

which means that estimating the mixture weight α by MCMC is equivalent to estimating the Bayes factor.

What puzzled me at first was that the mixture weight is in fine estimated with a single “datapoint”, made of the entire sample. So the posterior distribution on α is hardly different from the prior, since it solely varies by one unit! But I came to realise that this is a numerical tool and that the estimator of α is not meaningful  from a statistical viewpoint (thus differing completely from our perspective). This explains why the Beta prior on α can be freely chosen so that the mixing and stability of the Markov chain is improved: This parameter is solely an algorithmic entity.

There are similarities between this approach and the pseudo-prior encompassing perspective of Carlin and Chib (1995), even though the current version does not require pseudo-priors, using true priors instead. But thinking of weakly informative priors and of the MCMC consequence (see below) leads me to wonder if pseudo-priors would not help in this setting…

Another aspect of the paper that still puzzles me is that the MCMC algorithm mixes at all: indeed, depending on the value of the binary latent variable z, one of the two parameters is updated from the true posterior while the other is updated from the prior. It thus seems unlikely that the value of z would change quickly. Creating a huge imbalance in the prior can counteract this difference, but the same problem occurs once z has moved from 0 to 1 or from 1 to 0. It seems to me that resorting to a common parameter [if possible] and using as a proposal the model-based posteriors for both parameters is the only way out of this conundrum. (We do certainly insist on this common parametrisation in our approach as it is paramount to the use of improper priors.)

“In contrast, we consider the case where there is only one datum.”

The idea in the paper is therefore fully computational and relates to other linkage methods that create bridges between two models. It differs from our new notion of Bayesian testing in that we consider estimating the mixture between the two models in comparison, hence considering instead the mixture

$\prod_{i=1}^n\alpha f_1(x_i|\theta_1)+(1-\alpha) f_2(x_i|\theta_2)$

which is another model altogether and does not recover the original Bayes factor (Bayes factor that we altogether dismiss in favour of the posterior median of α and its entire distribution).

## my life as a mixture [slides]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , on September 18, 2014 by xi'an

Here are the slides of my talk today at the BAYSM’14 conference in Vienna. Mostly an overview of some of my papers on mixtures, with the most recent stuff…

## my life as a mixture [BAYSM 2014, Wien]

Posted in Books, Kids, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , on September 12, 2014 by xi'an

Next week I am giving a talk at BAYSM in Vienna. BAYSM is the Bayesian Young Statisticians meeting so one may wonder why, but with Chris Holmes and Mike West, we got invited as more… erm… senior speakers! So I decided to give a definitely senior talk on a thread pursued throughout my career so far, namely mixtures. Plus it also relates to works of the other senior speakers. Here is the abstract for the talk:

Mixtures of distributions are fascinating objects for statisticians in that they both constitute a straightforward extension of standard distributions and offer a complex benchmark for evaluating statistical procedures, with a likelihood both computable in a linear time and enjoying an exponential number of local models (and sometimes infinite modes). This fruitful playground appeals in particular to Bayesians as it constitutes an easily understood challenge to the use of improper priors and of objective Bayes solutions. This talk will review some ancient and some more recent works of mine on mixtures of distributions, from the 1990 Gibbs sampler to the 2000 label switching and to later studies of Bayes factor approximations, nested sampling performances, improper priors, improved importance samplers, ABC, and a inverse perspective on the Bayesian approach to testing of hypotheses.

I am very grateful to the scientific committee for this invitation, as it will give me the opportunity to meet the new generation, learn from them and in addition discover Vienna where I have never been, despite several visits to Austria. Including its top, the Großglockner. I will also give a seminar in Linz the day before. In the Institut für Angewandte Statistik.

## a day for comments

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on April 21, 2014 by xi'an

As I was flying over Skye (with [maybe] a first if hazy perspective on the Cuillin ridge!) to Iceland, three long sets of replies to some of my posts appeared on the ‘Og:

Thanks to them for taking the time to answer my musings…

## MCMC for sampling from mixture models

Posted in Kids, Statistics, University life with tags , , on April 17, 2014 by xi'an

Randal Douc, Florian Maire, and Jimmy Olsson recently arXived a paper on the use of Markov chain Monte Carlo methods for the sampling of mixture models, which contains the recourse to Carlin and Chib (1995) pseudo-priors to simulate from a mixture distribution (and not from the posterior distribution associated with a mixture sampling model). As reported earlier, I was in the thesis defence of Florian Maire and this approach had already puzzled me at the time. In short, a mixture structure

$\pi(z)\propto\sum_{m=1}^k \tilde\pi(m,z)$

gives rises to as many auxiliary variables as there are components, minus one: namely, if a simulation z is generated from a given component i of the mixture, one can create pseudo-simulations u from all the other components, using pseudo-priors à la Carlin and Chib. A Gibbs sampler based on this augmented state-space can then be implemented:  (a) simulate a new component index m given (z,u);  (b) simulate a new value of (z,u) given m. One version (MCC) of the algorithm simulates z given m from the proper conditional posterior by a Metropolis step, while another one (FCC) only simulate the u‘s. The paper shows that MCC has a smaller asymptotic variance than FCC. I however fail to understand why a Carlin and Chib is necessary in a mixture context: it seems (from the introduction) that the motivation is that a regular Gibbs sampler [simulating z by a Metropolis-Hastings proposal then m] has difficulties moving between components when those components are well-separated. This is correct but slightly moot, as each component of the mixture can be simulated separately and in advance in z, which leads to a natural construction of (a) the pseudo-priors used in the paper, (b) approximations to the weights of the mixture, and (c) a global mixture independent proposal, which can be used in an independent Metropolis-Hastings mixture proposal that [seems to me to] alleviate(s) the need to simulate the component index m. Both examples used in the paper, a toy two-component two-dimensional Gaussian mixture and another toy two-component one-dimensional Gaussian mixture observed with noise (and in absolute value), do not help in perceiving the definitive need for this Carlin and Chib version. Especially when considering the construction of the pseudo-priors.

## On the use of marginal posteriors in marginal likelihood estimation via importance-sampling

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on November 20, 2013 by xi'an

Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.

After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers  2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis.  (There is no comparison with other methods.)

As a side note, among the many references provided by this paper, I did not find trace of Skilling’s nested sampling or of safe harmonic means (as exposed in our own survey on the topic).

## open problem

Posted in R, Statistics with tags , , , , , on October 24, 2013 by xi'an

On the plane back from Warwick, I was reading an ABC arXived paper by Umberto Picchini and Julie Forman, “Accelerating inference for diffusions observed with measurement error and large sample sizes using Approximate Bayesian Computation: A case study” and came upon this open problem:

“A closed-form expression for generating percentiles from a fi nite-components Gaussian mixture is unavailable.” (p.5)

which means solving

$\alpha\Phi(x)+(1-\alpha)\Phi(\{x-\mu)/\sigma) = \beta$

is not possible in closed form. (Of course it could also be argued that the equation Φ(x)=β is unavailable in closed-form ie that the analytic solution x=Φ-1(β) is formal…) While I can think of several numerical approaches, a few minutes with a sheet of paper let me convinced that indeed this is not solvable (hence not an open problem, contrary to the title of the post!).

Just for R practice (and my R course students!), here is a basic R code:

mixant=function(alpha=0.5,beta=0.95,mu,sig=1,prec=1/10^4){
onmal=1-alpha
qbeta=qnorm(beta)

# initial bounds
omb=min(qbeta,mu+sig*qbeta)
omB=max(qbeta,mu+sig*qbeta)
if (beta<alpha){
omB=min(omB,qnorm(beta/alpha))
}else{
omb=max(omb,mu+sig*qnorm((beta-alpha)/onmal))}
if (beta<onmal){
omB=min(omB,mu+sig*qnorm(beta/onmal))
}else{
omb=max(omb,qnorm((beta-onmal)/alpha))}

# iterations
for (t in 1:5){
ranj=seq(omb,omB,len=17)
cfs=alpha*pnorm(ranj)+onmal*pnorm((ranj-mu)/sig)
omb=max(ranj[cfs<=beta])
omB=min(ranj[cfs>=beta])

if ((omB-omb)<prec)
break()}
return(.5*(omb+omB))}