## On the use of marginal posteriors in marginal likelihood estimation via importance-sampling

Posted in R, Statistics, University life with tags , , , , , , , , , , , , , on November 20, 2013 by xi'an

Perrakis, Ntzoufras, and Tsionas just arXived a paper on marginal likelihood (evidence) approximation (with the above title). The idea behind the paper is to base importance sampling for the evidence on simulations from the product of the (block) marginal posterior distributions. Those simulations can be directly derived from an MCMC output by randomly permuting the components. The only critical issue is to find good approximations to the marginal posterior densities. This is handled in the paper either by normal approximations or by Rao-Blackwell estimates. the latter being rather costly since one importance weight involves B.L computations, where B is the number of blocks and L the number of samples used in the Rao-Blackwell estimates. The time factor does not seem to be included in the comparison studies run by the authors, although it would seem necessary when comparing scenarii.

After a standard regression example (that did not include Chib’s solution in the comparison), the paper considers  2- and 3-component mixtures. The discussion centres around label switching (of course) and the deficiencies of Chib’s solution against the current method and Neal’s reference. The study does not include averaging Chib’s solution over permutations as in Berkoff et al. (2003) and Marin et al. (2005), an approach that does eliminate the bias. Especially for a small number of components. Instead, the authors stick to the log(k!) correction, despite it being known for being quite unreliable (depending on the amount of overlap between modes). The final example is Diggle et al. (1995) longitudinal Poisson regression with random effects on epileptic patients. The appeal of this model is the unavailability of the integrated likelihood which implies either estimating it by Rao-Blackwellisation or including the 58 latent variables in the analysis.  (There is no comparison with other methods.)

As a side note, among the many references provided by this paper, I did not find trace of Skilling’s nested sampling or of safe harmonic means (as exposed in our own survey on the topic).

## open problem

Posted in R, Statistics with tags , , , , , on October 24, 2013 by xi'an

On the plane back from Warwick, I was reading an ABC arXived paper by Umberto Picchini and Julie Forman, “Accelerating inference for diffusions observed with measurement error and large sample sizes using Approximate Bayesian Computation: A case study” and came upon this open problem:

“A closed-form expression for generating percentiles from a fi nite-components Gaussian mixture is unavailable.” (p.5)

which means solving

$\alpha\Phi(x)+(1-\alpha)\Phi(\{x-\mu)/\sigma) = \beta$

is not possible in closed form. (Of course it could also be argued that the equation Φ(x)=β is unavailable in closed-form ie that the analytic solution x=Φ-1(β) is formal…) While I can think of several numerical approaches, a few minutes with a sheet of paper let me convinced that indeed this is not solvable (hence not an open problem, contrary to the title of the post!).

Just for R practice (and my R course students!), here is a basic R code:

mixant=function(alpha=0.5,beta=0.95,mu,sig=1,prec=1/10^4){
onmal=1-alpha
qbeta=qnorm(beta)

# initial bounds
omb=min(qbeta,mu+sig*qbeta)
omB=max(qbeta,mu+sig*qbeta)
if (beta<alpha){
omB=min(omB,qnorm(beta/alpha))
}else{
omb=max(omb,mu+sig*qnorm((beta-alpha)/onmal))}
if (beta<onmal){
omB=min(omB,mu+sig*qnorm(beta/onmal))
}else{
omb=max(omb,qnorm((beta-onmal)/alpha))}

# iterations
for (t in 1:5){
ranj=seq(omb,omB,len=17)
cfs=alpha*pnorm(ranj)+onmal*pnorm((ranj-mu)/sig)
omb=max(ranj[cfs<=beta])
omB=min(ranj[cfs>=beta])

if ((omB-omb)<prec)
break()}
return(.5*(omb+omB))}


## Dear Sir, I am unable to understand…

Posted in Statistics, University life with tags , , , , , , on January 30, 2013 by xi'an

Here is an email I received a few days ago, similar to many other emails I/we receive on a regular basis:

I am working on Markov Chain Monte Carlo methods as part of my Masters project. I have to estimate mean, variance from a Gaussian mixture using metropolis method.  I came across your paper ‘Bayesian Modelling and Inference on Mixtures of Distributions’. I am unable to understand how to obtain the new sample for mean, variance etc… I am using uniform distribution as proposal distribution. Should it be random numbers for the proposal distribution.
I have been working and trying to understand this for a long time. I would be grateful for any help.

While I felt sorry for the Master student, I consider it is the responsibility of his/her advisor to give her/him the proper directions for understanding the paper. (Given the contents of the email, it sounds as if the student would require proper training in both Bayesian statistics [uniform priors on unbounded parameters?] and simulation [the question about random numbers does not make sense]…) This is what I replied to the student, hopefully in a positive tone.

## estimating the measure and hence the constant

Posted in pictures, Running, Statistics, University life with tags , , , , , , , on December 6, 2012 by xi'an

As mentioned on my post about the final day of the ICERM workshop, Xiao-Li Meng addresses this issue of “estimating the constant” in his talk. It is even his central theme. Here are his (2011) slides as he sent them to me (with permission to post them!):

He therefore points out in slide #5 why the likelihood cannot be expressed in terms of the normalising constant because this is not a free parameter. Right! His explanation for the approximation of the unknown constant is then to replace the known but intractable dominating measure—in the sense that it cannot compute the integral—with a discrete (or non-parametric) measure supported by the sample. Because the measure is defined up to a constant, this leads to sample weights being proportional to the inverse density. Of course, this representation of the problem is open to criticism: why focus only on measures supported by the sample? The fact that it is the MLE is used as an argument in Xiao-Li’s talk, but this can alternatively be seen as a drawback: I remember reviewing Dankmar Böhning’s Computer-Assisted Analysis of Mixtures and being horrified when discovering this feature! I am currently more agnostic since this appears as an alternative version of empirical likelihood. There are still questions about the measure estimation principle: for instance, when handling several samples from several distributions, why should they all contribute to a single estimate of μ rather than to a product of measures? (Maybe because their models are all dominated by the same measure μ.) Now, getting back to my earlier remark, and as a possible answer to Larry’s quesiton, there could well be a Bayesian version of the above, avoiding the rough empirical likelihood via Gaussian or Drichlet process prior modelling.

## AMOR at 5000ft in a water tank…

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , , , , , , , on November 22, 2012 by xi'an

On Monday, I attended the thesis defence of Rémi Bardenet in Orsay as a member (referee) of his thesis committee. While this was a thesis in computer science, which took place in the Linear Accelerator Lab in Orsay, it was clearly rooted in computational statistics, hence justifying my presence in the committee. The justification (!) for the splashy headline of this post is that Rémi’s work was motivated by the Pierre-Auger experiment on ultra-high-energy cosmic rays, where particles are detected through a network of 1600 water tanks spread over the Argentinian Pampa Amarilla on an area the size of Rhode Island (where I am incidentally going next week).

The part of Rémi’s thesis presented during the defence concentrated on his AMOR algorithm, arXived in a paper written with Olivier Cappé and Gersende Fort. AMOR stands for adaptive Metropolis online relabelling and combines adaptive MCMC techniques with relabelling strategies to fight label-switching (e.g., in mixtures). I have been interested in mixtures for eons (starting in 1987 in Ottawa with applying Titterington, Smith, and Makov to chest radiographs) and in label switching for ages (starting at the COMPSTAT conférence in Bristol in 1998). Rémi’s approach to the label switching problem follows the relabelling path, namely a projection of the original parameter space into a smaller subspace (that is also a quotient space) to avoid permutation invariance and lack of identifiability. (In the survey I wrote with Kate Lee, Jean-Michel Marin and Kerrie Mengersen, we suggest using the mode as a pivot to determine which permutation to use on the components of the mixture.) The paper suggests using an Euclidean distance to a mean determined adaptively, μt, with a quadratic form Σt also determined on-the-go, minimising (Pθ-μt)TΣt(Pθ-μt) over all permutations P at each step of the algorithm. The intuition behind the method is that the posterior over the restricted space should look like a roughly elliptically symmetric distribution, or at least like a unimodal distribution, rather than borrowing bits and pieces from different modes. While I appreciate the technical tour de force represented by the proof of convergence of the AMOR algorithm, I remain somehow sceptical about the approach and voiced the following objections during the defence: first, the assumption that the posterior becomes unimodal under an appropriate restriction is not necessarily realistic. Secondary modes often pop in with real data (as in the counter-example we used in our paper with Alessandra Iacobucci and Jean-Michel Marin). Next, the whole apparatus of fighting multiple modes and non-identifiability, i.e. fighting label switching, is to fall back on posterior means as Bayes estimators. As stressed in our JASA paper with Gilles Celeux and Merrilee Hurn, there is no reason for doing so and there are several reasons for not doing so:

• it breaks down under model specification, i.e., when the number of components is not correct
• it does not improve the speed of convergence but, on the opposite, restricts the space visited by the Markov chain
• it may fall victim to the fatal attraction of secondary modes by fitting too small an ellipse around one of those modes
• it ultimately depends on the parameterisation of the model
• there is no reason for using posterior means in mixture problems, posterior modes or cluster centres can be used instead

I am therefore very much more in favour of producing a posterior distribution that is as label switching as possible (since the true posterior is completely symmetric in this respect). Post-processing the resulting sample can be done by using off-the-shelf clustering in the component space, derived from the point process representation used by Matthew Stephens in his thesis and subsequent papers. It also allows for a direct estimation of the number of components.

In any case, this was a defence worth-attending that led me to think afresh about the label switching problem, with directions worth exploring next month while Kate Lee is visiting from Auckland. Rémi Bardenet is now headed for a postdoc in Oxford, a perfect location to discuss further label switching and to engage into new computational statistics research!