## ABC model choice by random forests [guest post]

Posted in Statistics, University life, pictures, R with tags , , , , , , , , , , on August 11, 2014 by xi'an

[Dennis Prangle sent me his comments on our ABC model choice by random forests paper. Here they are! And I appreciate very much contributors commenting on my paper or others, so please feel free to join.]

This paper proposes a new approach to likelihood-free model choice based on random forest classifiers. These are fit to simulated model/data pairs and then run on the observed data to produce a predicted model. A novel “posterior predictive error rate” is proposed to quantify the degree of uncertainty placed on this prediction. Another interesting use of this is to tune the threshold of the standard ABC rejection approach, which is outperformed by random forests.

The paper has lots of thought-provoking new ideas and was an enjoyable read, as well as giving me the encouragement I needed to read another chapter of the indispensable Elements of Statistical Learning However I’m not fully convinced by the approach yet for a few reasons which are below along with other comments.

Alternative schemes

The paper shows that random forests outperform rejection based ABC. I’d like to see a comparison to more efficient ABC model choice algorithms such as that of Toni et al 2009. Also I’d like to see if the output of random forests could be used as summary statistics within ABC rather than as a separate inference method.

Posterior predictive error rate (PPER)

This is proposed to quantify the performance of a classifier given a particular data set. The PPER is the proportion of times the classifier’s most favoured model is incorrect for simulated model/data pairs drawn from an approximation to the posterior predictive. The approximation is produced by a standard ABC analysis.

Misclassification could be due to (a) a poor classifier or (b) uninformative data, so the PPER aggregrates these two sources of uncertainty. I think it is still very desirable to have an estimate of the uncertainty due to (b) only i.e. a posterior weight estimate. However the PPER is useful. Firstly end users may sometimes only care about the aggregated uncertainty. Secondly relative PPER values for a fixed dataset are a useful measure of uncertainty due to (a), for example in tuning the ABC threshold. Finally, one drawback of the PPER is the dependence on an ABC estimate of the posterior: how robust are the results to the details of how this is obtained?

Classification

This paper illustrates an important link between ABC and machine learning classification methods: model choice can be viewed as a classification problem. There are some other links: some classifiers make good model choice summary statistics (Prangle et al 2014) or good estimates of ABC-MCMC acceptance ratios for parameter inference problems (Pham et al 2014). So the good performance random forests makes them seem a generally useful tool for ABC (indeed they are used in the Pham et al al paper).

## ABC model choice by random forests

Posted in pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , on June 25, 2014 by xi'an

After more than a year of collaboration, meetings, simulations, delays, switches,  visits, more delays, more simulations, discussions, and a final marathon wrapping day last Friday, Jean-Michel Marin, Pierre Pudlo,  and I at last completed our latest collaboration on ABC, with the central arguments that (a) using random forests is a good tool for choosing the most appropriate model and (b) evaluating the posterior misclassification error rather than the posterior probability of a model is an appropriate paradigm shift. The paper has been co-signed with our population genetics colleagues, Jean-Marie Cornuet and Arnaud Estoup, as they provided helpful advice on the tools and on the genetic illustrations and as they plan to include those new tools in their future analyses and DIYABC software.  ABC model choice via random forests is now arXived and very soon to be submitted…

One scientific reason for this fairly long conception is that it took us several iterations to understand the intrinsic nature of the random forest tool and how it could be most naturally embedded in ABC schemes. We first imagined it as a filter from a set of summary statistics to a subset of significant statistics (hence the automated ABC advertised in some of my past or future talks!), with the additional appeal of an associated distance induced by the forest. However, we later realised that (a) further ABC steps were counterproductive once the model was selected by the random forest and (b) including more summary statistics was always beneficial to the performances of the forest and (c) the connections between (i) the true posterior probability of a model, (ii) the ABC version of this probability, (iii) the random forest version of the above, were at best very loose. The above picture is taken from the paper: it shows how the true and the ABC probabilities (do not) relate in the example of an MA(q) model… We thus had another round of discussions and experiments before deciding the unthinkable, namely to give up the attempts to approximate the posterior probability in this setting and to come up with another assessment of the uncertainty associated with the decision. This led us to propose to compute a posterior predictive error as the error assessment for ABC model choice. This is mostly a classification error but (a) it is based on the ABC posterior distribution rather than on the prior and (b) it does not require extra-computations when compared with other empirical measures such as cross-validation, while avoiding the sin of using the data twice!

## posterior predictive p-values

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , on February 4, 2014 by xi'an

Bayesian Data Analysis advocates in Chapter 6 using posterior predictive checks as a way of evaluating the fit of a potential model to the observed data. There is a no-nonsense feeling to it:

“If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution.”

And it aims at providing an answer to the frustrating (frustrating to me, at least) issue of Bayesian goodness-of-fit tests. There are however issues with the implementation, from deciding on which aspect of the data or of the model is to be examined, to the “use of the data twice” sin. Obviously, this is an exploratory tool with little decisional backup and it should be understood as a qualitative rather than quantitative assessment. As mentioned in my tutorial on Sunday (I wrote this post in Duke during O’Bayes 2013), it reminded me of Ratmann et al.’s ABCμ in that they both give reference distributions against which to calibrate the observed data. Most likely with a multidimensional representation. And the “use of the data twice” can be argued for or against, once a data-dependent loss function is built.

“One might worry about interpreting the significance levels of multiple tests or of tests chosen by inspection of the data (…) We do not make [a multiple test] adjustment, because we use predictive checks to see how particular aspects of the data would be expected to appear in replications. If we examine several test variables, we would not be surprised for some of them not to be fitted by the model-but if we are planning to apply the model, we might be interested in those aspects of the data that do not appear typical.”

The natural objection that having a multivariate measure of discrepancy runs into multiple testing is answered within the book with the reply that the idea is not to run formal tests. I still wonder how one should behave when faced with a vector of posterior predictive p-values (ppp).

The above picture is based on a normal mean/normal prior experiment I ran where the ratio prior-to-sampling variance increases from 100 to 10⁴. The ppp is based on the Bayes factor against a zero mean as a discrepancy. It thus grows away from zero very quickly and then levels up around 0.5, reaching only values close to 1 for very large values of x (i.e. never in practice). I find the graph interesting because if instead of the Bayes factor I use the marginal (numerator of the Bayes factor) then the picture is the exact opposite. Which, I presume, does not make a difference for Bayesian Data Analysis, since both extremes are considered as equally toxic… Still, still, still, we are is the same quandary as when using any kind of p-value: what is extreme? what is significant? Do we have again to select the dreaded 0.05?! To see how things are going, I then simulated the behaviour of the ppp under the “true” model for the pair (θ,x). And ended up with the histograms below:

which shows that under the true model the ppp does concentrate around .5 (surprisingly the range of ppp’s hardly exceeds .5 and I have no explanation for this). While the corresponding ppp does not necessarily pick any wrong model, discrepancies may be spotted by getting away from 0.5…

“The p-value is to the u-value as the posterior interval is to the confidence interval. Just as posterior intervals are not, in general, classical confidence intervals, Bayesian p-values are not generally u-values.”

Now, Bayesian Data Analysis also has this warning about ppp’s being not uniform under the true model (u-values), which is just as well considering the above example, but I cannot help wondering if the authors had intended a sort of subliminal message that they were not that far from uniform. And this brings back to the forefront the difficult interpretation of the numerical value of a ppp. That is, of its calibration. For evaluation of the fit of a model. Or for decision-making…

## JSM [4]

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , on August 3, 2011 by xi'an

A new day at JSM 2011, admittedly not as tense as Monday, but still full. After a long run in the early hours when I took this picture, I started the day with the Controversies in the philosophy of Bayesian statistics with Jim Berger and Andrew Gelman, Rob Kass and Cosma Shalizi being unable to make it. From my point of view it was a fun session, even though I wish I had been more incisive! But I agreed with most of Jim said, so… It is too bad we could not cover his last point about the Bayesian procedures that were not Bayesianly justified (like posterior predictives) as I was quite interested in the potential discussion in this matter (incl. the position of the room on ABC!). Anyway, I am quite thankful to Andrew for setting up this session.As Jum said, we should have those more often, especially when the attendance was large enough to fill a double room at 8:30am.

Incidentally, I managed to have a glaring typo in my slides, pointed out by Susie Bayarri: Bayes theorem was written as

$\pi(\theta) \propto \pi(\theta) f(x|\theta)$

Aie, aie, aie! Short of better scapegoats, I will blame the AF plane for this… (This was a good way to start a controversy, however no one raised to the bait!) A more serious question reminded me of the debate surrounding A Search for Certainty: It was whether frequentist and subjective Bayes approaches had more justifications than the objective Bayes approach, in the light of von Mises‘ and personalistic (read, de Finetti) interpretations of probability.

While there were many possible alternatives for the next session, I went to attend Sylvia Richardson’s Medallion Lecture. This made sense on many levels, the primary one being that Sylvia and I worked and are working on rather close topics, from mixtures of distributions, to variable selection, to ABC. So I was looking forward the global picture she would provide on those topics. I particularly enjoyed the way she linked mixtures with more general modelling structures, through extensions in the distribution of the latent variables. (This is also why I am attending Chris Holmes’ Memorial Lecture tomorrow, with the exciting title of Loss, Actions, Decisions: Bayesian Analysis in High-Throughput Genomics.)

In the afternoon, I only attended one talk by David Nott, Efficient MCMC Schemes for Computationally Expensive Posterior Distribution, which involved hybrid Monte Carlo on complex likelihoods. This was quite interesting, as hybrid Monte Carlo is indeed the solution to diminish the number of likelihood evaluations, since it moves along iso-density slices… After this, we went working on ABC model choice with Jean-Michel Marin and Natesh Pillai. Before joining the fun at the Section for Bayesian statistical mixer, where the Savage and Mitchell and student awards were presented. This was the opportunity to see friends, meet new Bayesians, and congratulate the winners, including Julien Cornebise and Robin Ryder of course.

## About Fig. 4 of Fagundes et al. (2007)

Posted in R, Statistics, University life with tags , , , , , , , , on July 13, 2011 by xi'an

Yesterday, we had a meeting of our EMILE network on statistics for population genetics (in Montpellier) and we were discussing our respective recent advances in ABC model choice. One of our colleagues mentioned the constant request (from referees) to include the post-ABC processing devised by Fagundes et al. in their 2007 ABC paper. (This paper contains a wealth of statistical innovations, but I only focus here on this post-checking device.)

The method centres around the above figure, with the attached caption

Fig. 4. Empirical distributions of the estimated relative probabilities of the AFREG model when the AFREG (solid line), MREBIG (dashed line), and ASEG (dotted line) models are the true models. Here, we simulated 1,000 data sets under the AFREG, MREBIG, and ASEG models by drawing random parameter values from the priors. The density estimates of the three models at the AFREG posterior probability = 0.781 (vertical line) were used to compute the probability that AFREG is the correct model given our observation that PAFREG = 0.781. This probability is equal to 0.817.

which aims at computing a p-value based on the ABC estimate of the posterior probability of a model.

I am somehow uncertain about the added value of this computation and about the paradox of the sentence “the probability that AFREG is the correct model [given] the AFREG posterior probability (..) is equal to 0.817″… If I understand correctly the approach followed by Fagundes et al., they simulate samples from the joint distribution over parameter and (pseudo-)data conditional on each model, then approximate the density of the [ABC estimated] posterior probabilities of the AFREG model by a non parametric density estimate, presumably density(), which means in Bayesian terms the marginal likelihoods (or evidences) of the posterior probability of  the AFREG model under each of the models under comparison. The “probability that AFREG is the correct model given our observation that PAFREG = 0.781″ is then completely correct in the sense that it is truly a posterior probability for this model based on the sole observation of the transform (or statistic) of the data x equal to PAFREG(x). However, if we only look at the Bayesian perspective and do not consider the computational aspects, there is no rationale in moving from the data (or from the summary statistics) to a single statistic equal to PAFREG(x), as this induces a loss of information. (Furthermore, it seems to me that the answer is not invariant against the choice of the model whose posterior probability is computed, if more than two models are compared. In other words, the posterior probability of the AFREG model given the sole observation of PAFREG(x). is not necessarily the same as the posterior probability of the AFREG model given the sole observation of PASEG(x)…) Although this is not at all advised by the paper, it seems to me that some users of this processing opt instead for simulations of the parameter taken from the ABC posterior, which amounts to using the “data twice“, i.e. the squared likelihood instead of the likelihood…  So, while the procedure is formally correct (despite Templeton’s arguments against it), it has no added value. Obviously, one could alternatively argue that the computational precision in approximating the marginal likelihoods is higher with the (non-parametric) solution based on PAFREG(x) than the (ABC) solution based on x, but this is yet to be demonstrated (and weighted against the information loss).

Just as a side remark on the polychotomous logistic regression approximation to the posterior probabilities introduced in Fagundes et al.: the idea is quite enticing, as a statistical regularisation of ABC simulations. It could be exploited further by using a standard model selection strategy in order to pick the summary statistics that are truly contributed to explain the model index.

## Bayesian model selection

Posted in Books, R, Statistics with tags , , , , , , , , , on December 8, 2010 by xi'an

Last week, I received a box of books from the International Statistical Review, for reviewing them. I thus grabbed the one whose title was most appealing to me, namely Bayesian Model Selection and Statistical Modeling by Tomohiro Ando. I am indeed interested in both the nature of testing hypotheses or more accurately of assessing models, as discussed in both my talk at the Seminar of philosophy of mathematics at Université Paris Diderot a few days ago and the post on Murray Aitkin’s alternative, and the computational aspects of the resulting Bayesian procedures, including evidence, the Savage-Dickey paradox, nested sampling, harmonic mean estimators, and more…

After reading through the book, I am alas rather disappointed. What I consider to be innovative or at least “novel” parts with comparison with existing books (like Chen, Shao and Ibrahim, 2000, which remains a reference on this topic) is based on papers written by the author over the past five years and it is mostly a sort of asymptotic Bayes analysis that I do not see as particularly Bayesian, because involving the “true” distribution of the data. The coverage of the existing literature on Bayesian model choice is often incomplete and sometimes misses the point, as discussed below. This is especially true for the computational aspects that are generally mistreated or at least not treated in a way from which a newcomer to the field would benefit. The author often takes complex econometric examples for illustration, which is nice; however, he does not pursue the details far enough for the reader to be able to replicate the study without further reading. (An example is given by the coverage of stochastic volatility in Section 4.5.1, pages 83-84.) The few exercises at the end of each chapter are rather unhelpful, often sounding rather like notes than true problems (an extreme case is Exercise 6 pages 196-197 which introduces the Metropolis-Hastings algorithm within the exercise (although it has already been defined on pages 66-67) and then asks to derive the marginal likelihood estimator. Another such exercise on page 164-165 introduces the theory of DNA microarrays and gene expression in ten lines (which are later repeated verbatim on page 227), then asks to identify marker genes responsible for a certain trait.) The overall feeling after reading this book is thus that the contribution to the field of Bayesian Model Selection and Statistical Modeling is too limited and disorganised for the book to be recommended as “helping you choose the right Bayesian model” (backcover).