**W**hile my arXiv newspage today had a puzzling entry about modelling UFOs sightings in France, it also broadcast our revision of Reliable ABC model choice via random forests, version that we resubmitted today to Bioinformatics after a quite thorough upgrade, the most dramatic one being the realisation we could also approximate the posterior probability of the selected model via another random forest. (With no connection with the recent post on forest fires!) As discussed a little while ago on the ‘Og. And also in conjunction with our creating the abcrf R package for running ABC model choice out of a reference table. While it has been an excruciatingly slow process (the initial version of the arXived document dates from June 2014, the PNAS submission was rejected for not being enough Bayesian, and the latest revision took the whole summer), the slow maturation of our thoughts on the model choice issues led us to modify the role of random forests in the ABC approach to model choice, in that we reverted our earlier assessment that they could only be trusted for selecting the most likely model, by realising this summer the corresponding posterior could be expressed as a posterior loss and estimated by a secondary forest. As first considered in Stoehr et al. (2014). (In retrospect, this brings an answer to one of the earlier referee’s comments.) Next goal is to incorporate those changes in DIYABC (and wait for the next version of the software to appear). Another best-selling innovation due to Arnaud: we added a practical implementation section in the format of FAQ for issues related with the calibration of the algorithms.

## Archive for DIYABC

## ABC model choice via random forests [and no fire]

Posted in Books, pictures, R, Statistics, University life with tags ABC model choice, abcrf, Bayesian model choice, DIYABC, France, model posterior probabilities, PNAS, R, random forests, UFOs on September 4, 2015 by xi'an## ABC model choice via random forests [expanded]

Posted in Statistics, University life with tags 1000 Genomes Project, ABC, ABC model choice, admixture, bootstrap, DIYABC, Human evolution, logistic regression, out-of -Africa scenario, posterior predictive, prior predictive, random forests on October 1, 2014 by xi'an**T**oday, we arXived a second version of our paper on ABC model choice with random forests. Or maybe [A]BC model choice with random forests. Since the random forest is built on a simulation from the prior predictive and no further approximation is used in the process. Except for the computation of the posterior [predictive] error rate. The update wrt the earlier version is that we ran massive simulations throughout the summer, on existing and new datasets. In particular, we have included a Human dataset extracted from the 1000 Genomes Project. Made of 51,250 SNP loci. While this dataset is not used to test new evolution scenarios, we compared six out-of-Africa scenarios, with a possible admixture for Americans of African ancestry. The scenario selected by a random forest procedure posits a single out-of-Africa colonization event with a secondary split into a European and an East Asian population lineages, and a recent genetic admixture between African and European lineages, for Americans of African origin. The procedure reported a high level of confidence since the estimated posterior error rate is equal to zero. The SNP loci were carefully selected using the following criteria: (i) all individuals have a genotype characterized by a quality score (GQ)>10, (ii) polymorphism is present in at least one of the individuals in order to fit the SNP simulation algorithm of Hudson (2002) used in DIYABC V2 (Cornuet et al., 2014), (iii) the minimum distance between two consecutive SNPs is 1 kb in order to minimize linkage disequilibrium between SNP, and (iv) SNP loci showing significant deviation from Hardy-Weinberg equilibrium at a 1% threshold in at least one of the four populations have been removed.

In terms of random forests, we optimised the size of the bootstrap subsamples for all of our datasets. While this optimisation requires extra computing time, it is negligible when compared with the enormous time taken by a logistic regression, which is [yet] the standard ABC model choice approach. Now the data has been gathered, it is only a matter of days before we can send the paper to a journal

## ABC in Cancún

Posted in Books, Kids, R, Statistics, Travel, University life with tags ABC, abc package, abctools package, Bayesian methodology, Cancún, DIYABC, ISBA 2014, Mexico, pygmies on July 11, 2014 by xi'an**H**ere are our slides for the ABC [very] short course Jean-Michel and I give at ISBA 2014 in Cancún next Monday (if your browser can manage Slideshare…) Although I may switch the pictures from Iceland to Mexico, on Sunday, there will be not much change on those slides we both have previously used in previous short courses. (With a few extra slides borrowed from Richard Wilkinson’s tutorial at NIPS 2013!) Jean-Michel will focus his share of the course on software implementations, from R packages like abc and abctools and our population genetics software DIYABC. With an illustration on SNPs data from pygmies populations.

## ABC model choice by random forests

Posted in pictures, R, Statistics, Travel, University life with tags ABC, ABC model choice, arXiv, Asian lady beetle, CART, classification, DIYABC, machine learning, model posterior probabilities, Montpellier, posterior predictive, random forests, SNPs, using the data twice on June 25, 2014 by xi'an**A**fter more than a year of collaboration, meetings, simulations, delays, switches, visits, more delays, more simulations, discussions, and a final marathon wrapping day last Friday, Jean-Michel Marin, Pierre Pudlo, and I at last completed our latest collaboration on ABC, with the central arguments that (a) using random forests is a good tool for choosing the most appropriate model and (b) evaluating the posterior misclassification error rather than the posterior probability of a model is an appropriate paradigm shift. The paper has been co-signed with our population genetics colleagues, Jean-Marie Cornuet and Arnaud Estoup, as they provided helpful advice on the tools and on the genetic illustrations and as they plan to include those new tools in their future analyses and DIYABC software. ABC model choice via random forests is now arXived and very soon to be submitted…

**O**ne scientific reason for this fairly long conception is that it took us several iterations to understand the intrinsic nature of the random forest tool and how it could be most naturally embedded in ABC schemes. We first imagined it as a filter from a set of summary statistics to a subset of significant statistics (hence the automated ABC advertised in some of my past or future talks!), with the additional appeal of an associated distance induced by the forest. However, we later realised that (a) further ABC steps were counterproductive once the model was selected by the random forest and (b) including more summary statistics was always beneficial to the performances of the forest and (c) the connections between (i) the true posterior probability of a model, (ii) the ABC version of this probability, (iii) the random forest version of the above, were at best very loose. The above picture is taken from the paper: it shows how the true and the ABC probabilities (do not) relate in the example of an MA(q) model… We thus had another round of discussions and experiments before deciding the unthinkable, namely to give up the attempts to approximate the posterior probability in this setting and to come up with another assessment of the uncertainty associated with the decision. This led us to propose to compute a posterior predictive error as the error assessment for ABC model choice. This is mostly a classification error but (a) it is based on the ABC posterior distribution rather than on the prior and (b) it does not require extra-computations when compared with other empirical measures such as cross-validation, while avoiding the sin of using the data twice!

## R.I.P. Emile…

Posted in Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags ABC, ANR, DIYABC, EMILE, INLA, Languedoc wines, Mauguio, MCMC, Montpellier, Mortiès, Pic Saint Loup, SNPs on July 5, 2013 by xi'an**I** was thus in Montpellier for a few days, working with Jean-Michel Marin and attending the very final meeting of our ANR research group called Emile… The very same group that introduced us to ABC in 2005. We had a great time, discussing about DIYABC.2, ABC for SNPs, and other extensions with our friend Arnaud Estoup, enjoying an outdoor dinner on the slopes of Pic Saint-Loup and a wine tasting on the way there, listening to ecological modelling this morning from elephant tracking [using INLA] to shell decoration in snails [using massive MCMC], running around Crès lake in the warm rain, and barely escaping the Tour de France on my way to the airport!!!

## checking ABC convergence via coverage

Posted in pictures, Statistics, Travel, University life with tags ABC, arXiv, Bayesian calibration, confidence sets, credible intervals, DIYABC, p-values, uniformity test on January 24, 2013 by xi'an**D**ennis Prangle, Michael Blum, G. Popovic and Scott Sisson just arXived a paper on diagnostics for ABC validation via coverage diagnostics. Getting valid approximation diagnostics for ABC is clearly and badly needed and this was the last slide of my talk yesterday at the Winter Workshop in Gainesville. When simulation time is not an issue (!), our DIYABC software does implement a limited coverage assessment by computing the type I error, i.e. by simulating data under the null model and evaluating the number of time it is rejected at the 5% level (see sections 2.11.3 and 3.8 in the documentation). The current paper builds on a similar perspective.

**T**he idea in the paper is that a (Bayesian) credible interval at a given credible level α should have a similar confidence level (at least asymptotically and even more for matching priors) and that simulating pseudo-data with a known parameter value allows for a Monte-Carlo evaluation of the credible interval “true” coverage, hence for a calibration of the tolerance. The delicate issue is about the generation of those “known” parameters. For instance, if the pair (θ_{,} y) is generated from the joint distribution prior x likelihood, and if the credible region is also based on the true posterior, the average coverage is the nominal one. On the other hand, if the credible interval is based on a poor (ABC) approximation to the posterior, the average coverage should differ from the nominal one. Given that ABC is *always* wrong, however, this may fail to be a powerful diagnostic. In particular, when using *insufficient* (summary) statistics, the discrepancy should make testing for uniformity harder, shouldn’t it? Continue reading