## latest interviews on the philosophy of religion(s)

Posted in Books, Kids with tags , , , , , , , , , , on November 1, 2014 by xi'an

“But is the existence of God just a philosophical question, like, say, the definition of knowledge or the existence of Plato’s forms?” Gary Gutting, NYT

Although I stopped following The Stone‘s interviews of philosophers about their views on religion, six more took place and Gary Gutting has now closed the series he started a while ago with a self-interview. On this occasion, I went quickly through the last interviews, which had the same variability in depth and appeal as the earlier ones. A lot of them were somewhat misplaced in trying to understand or justify the reasons for believing in a god (a.k.a., God), which sounds more appropriate for a psychology or sociology perspective. I presume that what I was expecting from the series was more a “science vs. religion” debate, rather than entries into the metaphysics of various religions… Continue reading

## snapshot from Salem

Posted in Kids, pictures, Travel with tags , , , , , , , on October 31, 2014 by xi'an

## label switching in Bayesian mixture models

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on October 31, 2014 by xi'an

A referee of our paper on approximating evidence for mixture model with Jeong Eun Lee pointed out the recent paper by Carlos Rodríguez and Stephen Walker on label switching in Bayesian mixture models: deterministic relabelling strategies. Which appeared this year in JCGS and went beyond, below or above my radar.

Label switching is an issue with mixture estimation (and other latent variable models) because mixture models are ill-posed models where part of the parameter is not identifiable. Indeed, the density of a mixture being a sum of terms

$\sum_{j=1}^k \omega_j f(y|\theta_i)$

the parameter (vector) of the ω’s and of the θ’s is at best identifiable up to an arbitrary permutation of the components of the above sum. In other words, “component #1 of the mixture” is not a meaningful concept. And hence cannot be estimated.

This problem has been known for quite a while, much prior to EM and MCMC algorithms for mixtures, but it is only since mixtures have become truly estimable by Bayesian approaches that the debate has grown on this issue. In the very early days, Jean Diebolt and I proposed ordering the components in a unique way to give them a meaning. For instant, “component #1″ would then be the component with the smallest mean or the smallest weight and so on… Later, in one of my favourite X papers, with Gilles Celeux and Merrilee Hurn, we exposed the convergence issues related with the non-identifiability of mixture models, namely that the posterior distributions were almost always multimodal, with a multiple of k! symmetric modes in the case of exchangeable priors, and therefore that Markov chains would have trouble to visit all those modes in a symmetric manner, despite the symmetry being guaranteed from the shape of the posterior. And we conclude with the slightly provocative statement that hardly any Markov chain inferring about mixture models had ever converged! In parallel, time-wise, Matthew Stephens had completed a thesis at Oxford on the same topic and proposed solutions for relabelling MCMC simulations in order to identify a single mode and hence produce meaningful estimators. Giving another meaning to the notion of “component #1″.

And then the topic began to attract more and more researchers, being both simple to describe and frustrating in its lack of definitive answer, both from simulation and inference perspectives. Rodriguez’s and Walker’s paper provides a survey on the label switching strategies in the Bayesian processing of mixtures, but its innovative part is in deriving a relabelling strategy. Which consists of finding the optimal permutation (at each iteration of the Markov chain) by minimising a loss function inspired from k-means clustering. Which is connected with both Stephens’ and our [JASA, 2000] loss functions. The performances of this new version are shown to be roughly comparable with those of other relabelling strategies, in the case of Gaussian mixtures. (Making me wonder if the choice of the loss function is not favourable to Gaussian mixtures.) And somehow faster than Stephens’ Kullback-Leibler loss approach.

“Hence, in an MCMC algorithm, the indices of the parameters can permute multiple times between iterations. As a result, we cannot identify the hidden groups that make [all] ergodic averages to estimate characteristics of the components useless.”

One section of the paper puzzles me, albeit it does not impact the methodology and the conclusions. In Section 2.1 (p.27), the authors consider the quantity

$p(z_i=j|{\mathbf y})$

which is the marginal probability of allocating observation i to cluster or component j. Under an exchangeable prior, this quantity is uniformly equal to 1/k for all observations i and all components j, by virtue of the invariance under permutation of the indices… So at best this can serve as a control variate. Later in Section 2.2 (p.28), the above sentence does signal a problem with those averages but it seem to attribute it to MCMC behaviour rather than to the invariance of the posterior (or to the non-identifiability of the components per se). At last, the paper mentions that “given the allocations, the likelihood is invariant under permutations of the parameters and the allocations” (p.28), which is not correct, since eqn. (8)

$f(y_i|\theta_{\sigma(z_i)}) =f(y_i|\theta_{\tau(z_i)})$

does not hold when the two permutations σ and τ give different images of zi

## Relevant statistics for Bayesian model choice [hot off the press!]

Posted in Books, Statistics, University life with tags , , , , , , on October 30, 2014 by xi'an

Our paper about evaluating statistics used for ABC model choice has just appeared in Series B! It somewhat paradoxical that it comes out just a few days after we submitted our paper on using random forests for Bayesian model choice, thus bypassing the need for selecting those summary statistics by incorporating all statistics available and letting the trees automatically rank those statistics in term of their discriminating power. Nonetheless, this paper remains an exciting piece of work (!) as it addresses the more general and pressing question of the validity of running a Bayesian analysis with only part of the information contained in the data. Quite usefull in my (biased) opinion when considering the emergence of approximate inference already discussed on this ‘Og…

[As a trivial aside, I had first used fresh from the press(es) as the bracketted comment, before I realised the meaning was not necessarily the same in English and in French.]

## I am cold all over…

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on October 29, 2014 by xi'an

An email from one of my Master students who sent his problem sheet (taken from Monte Carlo Statistical Methods) late:

Bonsoir Professeur
Je « suis » votre cours du mercredi dont le formalisme mathématique me fait froid partout
Avec beaucoup de difficulté je vous envoie mes exercices du premier chapitre de votre livre.

which translates as

Good evening Professor,
I “follow” your Wednesday class which mathematical formalism makes me cold all over. With much hardship, I send you the first batch of problems from your book.

I know that winter is coming, but, still, making students shudder from mathematical cold is not my primary goal when teaching Monte Carlo methods!

## reliable ABC model choice via random forests

Posted in pictures, R, Statistics, University life with tags , , , , , , , on October 29, 2014 by xi'an

After a somewhat prolonged labour (!), we have at last completed our paper on ABC model choice with random forests and submitted it to PNAS for possible publication. While the paper is entirely methodological, the primary domain of application of ABC model choice methods remains population genetics and the diffusion of this new methodology to the users is thus more likely via a media like PNAS than via a machine learning or statistics journal.

When compared with our recent update of the arXived paper, there is not much different in contents, as it is mostly an issue of fitting the PNAS publication canons. (Which makes the paper less readable in the posted version [in my opinion!] as it needs to fit the main document within the compulsory six pages, relegated part of the experiments and of the explanations to the Supplementary Information section.)

## projective covariate selection

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on October 28, 2014 by xi'an

While I was in Warwick, Dan Simpson [newly arrived from Norway on a postdoc position] mentioned to me he had attended a talk by Aki Vehtari in Norway where my early work with Jérôme Dupuis on projective priors was used. He gave me the link to this paper by Peltola, Havulinna, Salomaa and Vehtari that indeed refers to the idea that a prior on a given Euclidean space defines priors by projections on all subspaces, despite the zero measure of all those subspaces. (This notion first appeared in a joint paper with my friend Costas Goutis, who alas died in a diving accident a few months later.) The projection further allowed for a simple expression of the Kullback-Leibler deviance between the corresponding models and for a Pythagorean theorem on the additivity of the deviances between embedded models. The weakest spot of this approach of ours was, in my opinion and unsurprisingly, about deciding when a submodel was too far from the full model. The lack of explanatory power introduced therein had no absolute scale and later discussions led me to think that the bound should depend on the sample size to ensure consistency. (The recent paper by Nott and Leng that was expanding on this projection has now appeared in CSDA.)

“Specifically, the models with subsets of covariates are found by maximizing the similarity of their predictions to this reference as proposed by Dupuis and Robert [12]. Notably, this approach does not require specifying priors for the submodels and one can instead focus on building a good reference model. Dupuis and Robert (2003) suggest choosing the size of the covariate subset based on an acceptable loss of explanatory power compared to the reference model. We examine using cross-validation based estimates of predictive performance as an alternative.” T. Peltola et al.

The paper also connects with the Bayesian Lasso literature, concluding on the horseshoe prior being more informative than the Laplace prior. It applies the selection approach to identify biomarkers with predictive performances in a study of diabetic patients. The authors rank model according to their (log) predictive density at the observed data, using cross-validation to avoid exploiting the data twice. On the MCMC front, the paper implements the NUTS version of HMC with STAN.