## using mixtures towards Bayes factor approximation

Posted in Statistics, Travel, University life with tags , , , , , , on December 11, 2014 by xi'an

Phil O’Neill and Theodore Kypraios from the University of Nottingham have arXived last week a paper on “Bayesian model choice via mixture distributions with application to epidemics and population process models”. Since we discussed this paper during my visit there earlier this year, I was definitely looking forward the completed version of their work. Especially because there are some superficial similarities with our most recent work on… Bayesian model choice via mixtures! (To the point that I misunderstood at the beginning their proposal for ours…)

The central idea in the paper is that, by considering the mixture likelihood

$\alpha\ell_1(\theta_1|\mathbf{x})+(1-\alpha)\ell_2(\theta_2|\mathbf{x})$

where x corresponds to the entire sample, it is straighforward to relate the moments of α with the Bayes factor, namely

$\mathfrak{B}_{12}=\dfrac{\mathbb{E}[\alpha]-\mathbb{E}[\alpha^2]-\mathbb{E}[\alpha|\mathbf{x}](1-\mathbb{E}[\alpha])}{\mathbb{E}[\alpha]\mathbb{E}[\alpha|\mathbf{x}]-\mathbb{E}[\alpha^2]}$

which means that estimating the mixture weight α by MCMC is equivalent to estimating the Bayes factor.

What puzzled me at first was that the mixture weight is in fine estimated with a single “datapoint”, made of the entire sample. So the posterior distribution on α is hardly different from the prior, since it solely varies by one unit! But I came to realise that this is a numerical tool and that the estimator of α is not meaningful  from a statistical viewpoint (thus differing completely from our perspective). This explains why the Beta prior on α can be freely chosen so that the mixing and stability of the Markov chain is improved: This parameter is solely an algorithmic entity.

There are similarities between this approach and the pseudo-prior encompassing perspective of Carlin and Chib (1995), even though the current version does not require pseudo-priors, using true priors instead. But thinking of weakly informative priors and of the MCMC consequence (see below) leads me to wonder if pseudo-priors would not help in this setting…

Another aspect of the paper that still puzzles me is that the MCMC algorithm mixes at all: indeed, depending on the value of the binary latent variable z, one of the two parameters is updated from the true posterior while the other is updated from the prior. It thus seems unlikely that the value of z would change quickly. Creating a huge imbalance in the prior can counteract this difference, but the same problem occurs once z has moved from 0 to 1 or from 1 to 0. It seems to me that resorting to a common parameter [if possible] and using as a proposal the model-based posteriors for both parameters is the only way out of this conundrum. (We do certainly insist on this common parametrisation in our approach as it is paramount to the use of improper priors.)

“In contrast, we consider the case where there is only one datum.”

The idea in the paper is therefore fully computational and relates to other linkage methods that create bridges between two models. It differs from our new notion of Bayesian testing in that we consider estimating the mixture between the two models in comparison, hence considering instead the mixture

$\prod_{i=1}^n\alpha f_1(x_i|\theta_1)+(1-\alpha) f_2(x_i|\theta_2)$

which is another model altogether and does not recover the original Bayes factor (Bayes factor that we altogether dismiss in favour of the posterior median of α and its entire distribution).

## amazonish thanks (& repeated warning)

Posted in Books, Kids, R, Statistics with tags , , , , , on December 9, 2014 by xi'an

Once again, books I reviewed, positively or negatively, were among the top purchases… Like a dozen Monte Carlo simulation and resampling methods for social science , a few copies of Naked Statistics. And again a few of The Cartoon Introduction to Statistics. (Despite a most critical review.) Thanks to all of you using those links and feeding further my book addiction, with the drawback of inducing even more fantasy book reviews.

## an ABC experiment

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on November 24, 2014 by xi'an

In a cross-validated forum exchange, I used the code below to illustrate the working of an ABC algorithm:

#normal data with 100 observations
n=100
x=rnorm(n)
#observed summaries

#normal x gamma prior
priori=function(N){
return(cbind(rnorm(N,sd=10),
1/sqrt(rgamma(N,shape=2,scale=5))))
}

ABC=function(N,alpha=.05){

prior=priori(N) #reference table

#pseudo-data
summ=matrix(0,N,2)
for (i in 1:N){
xi=rnorm(n)*prior[i,2]+prior[i,1]
}

#normalisation factor for the distance
#distance
#selection
posterior=prior[dist<quantile(dist,alpha),]}


Hence I used the median and the mad as my summary statistics. And the outcome is rather surprising, for two reasons: the first one is that the posterior on the mean μ is much wider than when using the mean and the variance as summary statistics. This is not completely surprising in that the latter are sufficient, while the former are not. Still, the (-10,10) range on the mean is way larger… The second reason for surprise is that the true posterior distribution cannot be derived since the joint density of med and mad is unavailable.

After thinking about this for a while, I went back to my workbench to check the difference with using mean and variance. To my greater surprise, I found hardly any difference! Using the almost exact ABC with 10⁶ simulations and a 5% subsampling rate returns exactly the same outcome. (The first row above is for the sufficient statistics (mean,standard deviation) while the second row is for the (median,mad) pair.) Playing with the distance does not help. The genuine posterior output is quite different, as exposed on the last row of the above, using a basic Gibbs sampler since the posterior is not truly conjugate.

## Sequentially Constrained Monte Carlo

Posted in Books, Mountains, pictures, Statistics, University life with tags , , , , , , , , , , on November 7, 2014 by xi'an

This newly arXived paper by S. Golchi and D. Campbell from Vancouver (hence the above picture) considers the (quite) interesting problem of simulating from a target distribution defined by a constraint. This is a question that have bothered me for a long while as I could not come up with a satisfactory solution all those years… Namely, when considering a hard constraint on a density, how can we find a sequence of targets that end up with the restricted density? This is of course connected with the zero measure case posted a few months ago. For instance, how do we efficiently simulate a sample from a Student’s t distribution with a fixed sample mean and a fixed sample variance?

“The key component of SMC is the filtering sequence of distributions through which the particles evolve towards the target distribution.” (p.3)

This is indeed the main issue! The paper considers using a sequence of intermediate targets hardening progressively the constraint(s), along with an SMC sampler, but this recommendation remains rather vague and hence I am at loss as to how to make it work when the exact constraint implies a change of measure. The first example is monotone regression where y has mean f(x) and f is monotone. (Everything is unidimensional here.) The sequence is then defined by adding a multiplicative term that is a function of ∂f/∂x, for instance

Φ(τ∂f/∂x),

with τ growing to infinity to make the constraint moving from soft to hard. An interesting introduction, even though the hard constraint does not imply a change of parameter space or of measure. The second example is about estimating the parameters of an ODE, with the constraint being the ODE being satisfied exactly. Again, not exactly what I was looking for. But with an exotic application to deaths from the 1666 Black (Death) plague.

And then the third example is about ABC and the choice of summary statistics! The sequence of constraints is designed to keep observed and simulated summary statistics close enough when the dimension of those summaries increases, which means they are considered simultaneously rather than jointly. (In the sense of Ratmann et al., 2009. That is, with a multidimensional distance.) The model used for the application of the SMC is the dynamic model of Wood (2010, Nature). The outcome of this specific implementation is not that clear compared with alternatives… And again sadly does not deal with the/my zero measure issue.

## density normalization for MCMC algorithms

Posted in Statistics, University life with tags , , , , , , , , on November 6, 2014 by xi'an

Another paper addressing the estimation of the normalising constant and the wealth of available solutions just came out on arXiv, with the full title of “Target density normalization for Markov chain Monte Carlo algorithms“, written by Allen Caldwell and Chang Liu. (I became aware of it by courtesy of Ewan Cameron, as it appeared in the physics section of arXiv. It is actually a wee bit annoying that papers in the subcategory “Data Analysis, Statistics and Probability” of physics do not get an automated reposting on the statistics lists…)

In this paper, the authors compare three approaches to the problem of finding

$\mathfrak{I} = \int_\Omega f(\lambda)\,\text{d}\lambda$

when the density f is unormalised, i.e., in more formal terms, when f is proportional to a probability density (and available):

1. an “arithmetic mean”, which is an importance sampler based on (a) reducing the integration volume to a neighbourhood ω of the global mode. This neighbourhood is chosen as an hypercube and the importance function turns out to be the uniform over this hypercube. The corresponding estimator is then a rescaled version of the average of f over uniform simulations in ω.
2.  an “harmonic mean”, of all choices!, with again an integration over the neighbourhood ω of the global mode in order to avoid the almost sure infinite variance of harmonic mean estimators.
3. a Laplace approximation, using the target at the mode and the Hessian at the mode as well.

The paper then goes to comparing those three solutions on a few examples, demonstrating how the diameter of the hypercube can be calibrated towards a minimum (estimated) uncertainty. The rather anticlimactic conclusion is that the arithmetic mean is the most reliable solution as harmonic means may fail in larger dimension and more importantly fail to signal its failure, while Laplace approximations only approximate well quasi-Gaussian densities…

What I find most interesting in this paper is the idea of using only one part of the integration space to compute the integral, even though it is not exactly new. Focussing on a specific region ω has pros and cons, the pros being that the reduction to a modal region reduces needs for absolute MCMC convergence and helps in selecting alternative proposals and also prevents from the worst consequences of using a dreaded harmonic mean, the cons being that the region needs be well-identified, which means requirements on the MCMC kernel, and that the estimate is a product of two estimates, the frequency being driven by a Binomial noise.  I also like very much the idea of calibrating the diameter Δof the hypercube ex-post by estimating the uncertainty.

As an aside, the paper mentions most of the alternative solutions I just presented in my Monte Carlo graduate course two days ago (like nested or bridge or Rao-Blackwellised sampling, including our proposal with Darren Wraith), but dismisses them as not “directly applicable in an MCMC setting”, i.e., without modifying this setting. I unsurprisingly dispute this labelling, both because something like the Laplace approximation requires extra-work on the MCMC output (and once done this work can lead to advanced Laplace methods like INLA) and because other methods could be considered as well (for instance, bridge sampling over several hypercubes). As shown in the recent paper by Mathieu Gerber and Nicolas Chopin (soon to be discussed at the RSS!), MCqMC has also become a feasible alternative that would compete well with the methods studied in this paper.

Overall, this is a paper that comes in a long list of papers on constant approximations. I do not find the Markov chain of MCMC aspect particularly compelling or specific, once the effective sample size is accounted for. It would be nice to find generic ways of optimising the visit to the hypercube ω and to estimate efficiently the weight of ω. The comparison is solely run over examples, but they all rely on a proper characterisation of the hypercube and the ability to simulate efficiently f over that hypercube.

## postdoc in Paris?

Posted in Kids, Statistics, Travel, University life with tags , , , , , , , on November 4, 2014 by xi'an

There is an open call of the Fondation Sciences Mathématiques de Paris (FSMP) about a postdoctoral funding program with 18 position-years available for staying in Université Paris-Dauphine (and other participating universities). The net support is quite decent  (wrt French terms and academic salaries) and the application form easy to fill. So, if you are interested in coming to Paris to work on ABC, MCMC, Bayesian model choice, &tc., feel free to contact me (or another Parisian statistician) and to apply! The deadline is December 01, 2014.  And the decision will be made by January 15, 2015. The starting date for the postdoc is October 01, 2015.

## I am cold all over…

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on October 29, 2014 by xi'an

An email from one of my Master students who sent his problem sheet (taken from Monte Carlo Statistical Methods) late:

Bonsoir Professeur
Je « suis » votre cours du mercredi dont le formalisme mathématique me fait froid partout
Avec beaucoup de difficulté je vous envoie mes exercices du premier chapitre de votre livre.

which translates as

Good evening Professor,
I “follow” your Wednesday class which mathematical formalism makes me cold all over. With much hardship, I send you the first batch of problems from your book.

I know that winter is coming, but, still, making students shudder from mathematical cold is not my primary goal when teaching Monte Carlo methods!