## ABC model choice via random forests [and no fire]

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , on September 4, 2015 by xi'an

While my arXiv newspage today had a puzzling entry about modelling UFOs sightings in France, it also broadcast our revision of Reliable ABC model choice via random forests, version that we resubmitted today to Bioinformatics after a quite thorough upgrade, the most dramatic one being the realisation we could also approximate the posterior probability of the selected model via another random forest. (With no connection with the recent post on forest fires!) As discussed a little while ago on the ‘Og. And also in conjunction with our creating the abcrf R package for running ABC model choice out of a reference table. While it has been an excruciatingly slow process (the initial version of the arXived document dates from June 2014, the PNAS submission was rejected for not being enough Bayesian, and the latest revision took the whole summer), the slow maturation of our thoughts on the model choice issues led us to modify the role of random forests in the ABC approach to model choice, in that we reverted our earlier assessment that they could only be trusted for selecting the most likely model, by realising this summer the corresponding posterior could be expressed as a posterior loss and estimated by a secondary forest. As first considered in Stoehr et al. (2014). (In retrospect, this brings an answer to one of the earlier referee’s comments.) Next goal is to incorporate those changes in DIYABC (and wait for the next version of the software to appear). Another best-selling innovation due to Arnaud: we added a practical implementation section in the format of FAQ for issues related with the calibration of the algorithms.

## SPA 2015 Oxford

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on July 14, 2015 by xi'an

Today I gave a talk on Approximate Bayesian model choice via random forests at the yearly SPA (Stochastic Processes and their Applications) 2015 conference, taking place in Oxford (a nice town near Warwick) this year. In Keble College more precisely. The slides are below and while they are mostly repetitions of earlier slides, there is a not inconsequential novelty in the presentation, namely that I included our most recent and current perspective on ABC model choice. Indeed, when travelling to Montpellier two weeks ago, we realised that there was a way to solve our posterior probability conundrum!

Despite the heat wave that rolled all over France that week, we indeed figured out a way to estimate the posterior probability of the selected (MAP) model, way that we had deemed beyond our reach in previous versions of the talk and of the paper. The fact that we could not provide an estimate of this posterior probability and had to rely instead on a posterior expected loss was one of the arguments used by the PNAS reviewers in rejecting the paper. While the posterior expected loss remains a quantity worth approximating and reporting, the idea that stemmed from meeting together in Montpellier is that (i) the posterior probability of the MAP is actually related to another posterior loss, when conditioning on the observed summary statistics and (ii) this loss can be itself estimated via a random forest, since it is another function of the summary statistics. A posteriori, this sounds trivial but we had to have a new look at the problem to realise that using ABC samples was not the only way to produce an estimate of the posterior probability! (We are now working on the revision of the paper for resubmission within a few week… Hopefully before JSM!)

## the maths of Jeffreys-Lindley paradox

Posted in Books, Kids, Statistics with tags , , , , , , , on March 26, 2015 by xi'an

Cristiano Villa and Stephen Walker arXived on last Friday a paper entitled On the mathematics of the Jeffreys-Lindley paradox. Following the philosophical papers of last year, by Ari Spanos, Jan Sprenger, Guillaume Rochefort-Maranda, and myself, this provides a more statistical view on the paradox. Or “paradox”… Even though I strongly disagree with the conclusion, namely that a finite (prior) variance σ² should be used in the Gaussian prior. And fall back on classical Type I and Type II errors. So, in that sense, the authors avoid the Jeffreys-Lindley paradox altogether!

The argument against considering a limiting value for the posterior probability is that it converges to 0, 21, or an intermediate value. In the first two cases it is useless. In the medium case. achieved when the prior probability of the null and alternative hypotheses depend on variance σ². While I do not want to argue in favour of my 1993 solution

$\rho(\sigma) = 1\big/ 1+\sqrt{2\pi}\sigma$

since it is ill-defined in measure theoretic terms, I do not buy the coherence argument that, since this prior probability converges to zero when σ² goes to infinity, the posterior probability should also go to zero. In the limit, probabilistic reasoning fails since the prior under the alternative is a measure not a probability distribution… We should thus abstain from over-interpreting improper priors. (A sin sometimes committed by Jeffreys himself in his book!)

## the demise of the Bayes factor

Posted in Books, Kids, Statistics, Travel, University life with tags , , , , , , , , on December 8, 2014 by xi'an

With Kaniav Kamary, Kerrie Mengersen, and Judith Rousseau, we have just arXived (and submitted) a paper entitled “Testing hypotheses via a mixture model”. (We actually presented some earlier version of this work in Cancũn, Vienna, and Gainesville, so you may have heard of it already.) The notion we advocate in this paper is to replace the posterior probability of a model or an hypothesis with the posterior distribution of the weights of a mixture of the models under comparison. That is, given two models under comparison,

$\mathfrak{M}_1:x\sim f_1(x|\theta_1) \text{ versus } \mathfrak{M}_2:x\sim f_2(x|\theta_2)$

we propose to estimate the (artificial) mixture model

$\mathfrak{M}_{\alpha}:x\sim\alpha f_1(x|\theta_1) + (1-\alpha) f_2(x|\theta_2)$

and in particular derive the posterior distribution of α. One may object that the mixture model is neither of the two models under comparison but this is the case at the boundary, i.e., when α=0,1. Thus, if we use prior distributions on α that favour the neighbourhoods of 0 and 1, we should be able to see the posterior concentrate near 0 or 1, depending on which model is true. And indeed this is the case: for any given Beta prior on α, we observe a higher and higher concentration at the right boundary as the sample size increases. And establish a convergence result to this effect. Furthermore, the mixture approach offers numerous advantages, among which [verbatim from the paper]:

## reliable ABC model choice via random forests

Posted in pictures, R, Statistics, University life with tags , , , , , , , on October 29, 2014 by xi'an

After a somewhat prolonged labour (!), we have at last completed our paper on ABC model choice with random forests and submitted it to PNAS for possible publication. While the paper is entirely methodological, the primary domain of application of ABC model choice methods remains population genetics and the diffusion of this new methodology to the users is thus more likely via a media like PNAS than via a machine learning or statistics journal.

When compared with our recent update of the arXived paper, there is not much different in contents, as it is mostly an issue of fitting the PNAS publication canons. (Which makes the paper less readable in the posted version [in my opinion!] as it needs to fit the main document within the compulsory six pages, relegated part of the experiments and of the explanations to the Supplementary Information section.)