## ABC model choice by random forests

**A**fter more than a year of collaboration, meetings, simulations, delays, switches, visits, more delays, more simulations, discussions, and a final marathon wrapping day last Friday, Jean-Michel Marin, Pierre Pudlo, and I at last completed our latest collaboration on ABC, with the central arguments that (a) using random forests is a good tool for choosing the most appropriate model and (b) evaluating the posterior misclassification error rather than the posterior probability of a model is an appropriate paradigm shift. The paper has been co-signed with our population genetics colleagues, Jean-Marie Cornuet and Arnaud Estoup, as they provided helpful advice on the tools and on the genetic illustrations and as they plan to include those new tools in their future analyses and DIYABC software. ABC model choice via random forests is now arXived and very soon to be submitted…

**O**ne scientific reason for this fairly long conception is that it took us several iterations to understand the intrinsic nature of the random forest tool and how it could be most naturally embedded in ABC schemes. We first imagined it as a filter from a set of summary statistics to a subset of significant statistics (hence the automated ABC advertised in some of my past or future talks!), with the additional appeal of an associated distance induced by the forest. However, we later realised that (a) further ABC steps were counterproductive once the model was selected by the random forest and (b) including more summary statistics was always beneficial to the performances of the forest and (c) the connections between (i) the true posterior probability of a model, (ii) the ABC version of this probability, (iii) the random forest version of the above, were at best very loose. The above picture is taken from the paper: it shows how the true and the ABC probabilities (do not) relate in the example of an MA(q) model… We thus had another round of discussions and experiments before deciding the unthinkable, namely to give up the attempts to approximate the posterior probability in this setting and to come up with another assessment of the uncertainty associated with the decision. This led us to propose to compute a posterior predictive error as the error assessment for ABC model choice. This is mostly a classification error but (a) it is based on the ABC posterior distribution rather than on the prior and (b) it does not require extra-computations when compared with other empirical measures such as cross-validation, while avoiding the sin of using the data twice!

September 22, 2014 at 12:07 am

Hello Xi and congrates for this paper! I hope it will have a strong impact.

I am not a statistician nor a mathematician, just a biologist interested in ABC for practical issues, and I feel like I did wrong until now …

I am convinced that random forests are more relevant and less arbitrary than previous approaches for model choice.

However, I have a stupid question … Once I have all of my simulations, meaning, a huge array of (iterations)x(statistics), how can I apply to it the good practices you recommend?

Sorry for the very pragmatic question.

All the best, and congrates again for your contribution to ABC.

T.

September 22, 2014 at 6:19 pm

Thanks T. Once you have your learning or reference table made of model index x parameter x statistics, you can call the down-the-shelf random forest algorithm to explain the model index by the statistics. Once the algorithm has built the forest you can (a) enter the observed statistics which will return a most frequent model index, which will be the estimated model and (b) derive the simulated statistics closest to the observed ones, i.e the most often in the same leaf of the trees of the forest, which will give you an ABC posterior sample, based on the random forest. I hope this helps. X

June 25, 2014 at 5:17 am

From the abstract: “We thus strongly alters (sic) how Bayesian model selection is both understood and operated, since we advocate completely abandoning the use of posterior probabilities of the models under comparison as evidence tools.”

Does this mean you have now come full circle, Christian? :-)

June 25, 2014 at 8:47 am

Full circle??? Not really as I started from decision theory with strong Bayesian inclinations. This move is induced by the realisation we cannot currently produce a trustworthy approximation to posterior probabilities, while a machine-learning tool can rank models with (a) some degree of efficiency and (b) still a Bayesian flavour. Plus, I am having more and more qualms about using posterior probabilities per se. Thanks for pointing out the typo. And you are welcome to a guest post if you feel like it. Best wishes for ABC in Syndey!

June 25, 2014 at 2:41 am

Great, I look forward to reading this (once I’ve finished the delayed acceptance/prefetching paper!) By the way the paper links above point to http://arxiv.org/list/stat/new and I found the paper at http://arxiv.org/abs/1406.6288 instead.

June 25, 2014 at 8:49 am

Thanks Dennis. Explanation: I wrote this post before the paper was arXived. And the arXiv team no longer provides the arXiv number until acceptance. Hence put the link to new postings… Feel free to write a guest post, by the way!!