## ABC model choice by random forests [guest post]

Posted in pictures, R, Statistics, University life with tags , , , , , , , , , , on August 11, 2014 by xi'an

[Dennis Prangle sent me his comments on our ABC model choice by random forests paper. Here they are! And I appreciate very much contributors commenting on my paper or others, so please feel free to join.]

This paper proposes a new approach to likelihood-free model choice based on random forest classifiers. These are fit to simulated model/data pairs and then run on the observed data to produce a predicted model. A novel “posterior predictive error rate” is proposed to quantify the degree of uncertainty placed on this prediction. Another interesting use of this is to tune the threshold of the standard ABC rejection approach, which is outperformed by random forests.

The paper has lots of thought-provoking new ideas and was an enjoyable read, as well as giving me the encouragement I needed to read another chapter of the indispensable Elements of Statistical Learning However I’m not fully convinced by the approach yet for a few reasons which are below along with other comments.

Alternative schemes

The paper shows that random forests outperform rejection based ABC. I’d like to see a comparison to more efficient ABC model choice algorithms such as that of Toni et al 2009. Also I’d like to see if the output of random forests could be used as summary statistics within ABC rather than as a separate inference method.

Posterior predictive error rate (PPER)

This is proposed to quantify the performance of a classifier given a particular data set. The PPER is the proportion of times the classifier’s most favoured model is incorrect for simulated model/data pairs drawn from an approximation to the posterior predictive. The approximation is produced by a standard ABC analysis.

Misclassification could be due to (a) a poor classifier or (b) uninformative data, so the PPER aggregrates these two sources of uncertainty. I think it is still very desirable to have an estimate of the uncertainty due to (b) only i.e. a posterior weight estimate. However the PPER is useful. Firstly end users may sometimes only care about the aggregated uncertainty. Secondly relative PPER values for a fixed dataset are a useful measure of uncertainty due to (a), for example in tuning the ABC threshold. Finally, one drawback of the PPER is the dependence on an ABC estimate of the posterior: how robust are the results to the details of how this is obtained?

Classification

This paper illustrates an important link between ABC and machine learning classification methods: model choice can be viewed as a classification problem. There are some other links: some classifiers make good model choice summary statistics (Prangle et al 2014) or good estimates of ABC-MCMC acceptance ratios for parameter inference problems (Pham et al 2014). So the good performance random forests makes them seem a generally useful tool for ABC (indeed they are used in the Pham et al al paper).

## Bangalore workshop [ಬೆಂಗಳೂರು ಕಾರ್ಯಾಗಾರ]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , on July 30, 2014 by xi'an

First day at the Indo-French Centre for Applied Mathematics and the get-together (or speed-dating!) workshop. The campus of the Indian Institute of Science of Bangalore where we all stay is very pleasant with plenty of greenery in the middle of a very busy city. Plus, being at about 1000m means the temperature remains tolerable for me, to the point of letting me run in the morning.Plus, staying in a guest house in the campus also means genuine and enjoyable south Indian food.

The workshop is a mix of statisticians and of mathematicians of neurosciences, from both India and France, and we are few enough to have a lot of opportunities for discussion and potential joint projects. I gave the first talk this morning (hence a fairly short run!) on ABC model choice with random forests and, given the mixed audience, may have launched too quickly into the technicalities of the forests. Even though I think I kept the statisticians on-board for most of the talk. While the mathematical biology talks mostly went over my head (esp. when I could not resist dozing!), I enjoyed the presentation of Francis Bach of a fast stochastic gradient algorithm, where the stochastic average is only updated one term at a time, for apparently much faster convergence results. This is related with a joint work with Éric Moulines that both Éric and Francis presented in the past month. And makes me wonder at the intuition behind the major speed-up. Shrinkage to the mean maybe?

## posterior predictive checks for admixture models

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on July 8, 2014 by xi'an

In a posting coincidence, just a few days after we arXived our paper on ABC model choice with random forests, where we use posterior predictive errors for assessing the variability of the random forest procedure, David Mimno, David Blei, and Barbara Engelhardt arXived a paper on posterior predictive checks to quantify lack-of-fit in admixture models of latent population structure, which deals with similar data and models, while also using the posterior predictive as a central tool. (Marginalia: the paper is a wee bit difficult to read [esp. with France-Germany playing in the airport bar!] as the modelling is only clearly described at the very end. I suspect this arXived version was put together out of a submission to a journal like Nature or PNAS, with mentions of a Methods section that does not appear here and of Supplementary Material that turned into subsections of the Discussion section.)

The dataset are genomic datasets made of SNPs (single nucleotide polymorphisms). For instance, the first (HapMap) dataset corresponds to 1,043 individuals and 468,167 SNPs. The model is simpler than Kingman’s coalescent, hence its likelihood does not require ABC steps to run inference. The admixture model in the paper is essentially a mixture model over ancestry indices with individual dependent weights with Bernoulli observations, hence resulting into a completed likelihood of the form

$\prod_{i=1}^n\prod_{\ell=1}^L\prod_j \phi_{\ell,z_{i,\ell,j}}^{x_{i,\ell,j}}(1-\phi_{\ell,z_{i,\ell,j}})^{1-x_{i,\ell,j}}\theta_{i,z_{i,\ell,j}}$

(which looks more formidable than it truly is!). Regular Bayesian inference is thus possible in this setting, implementing e.g. Gibbs sampling. The authors chose instead to rely on EM and thus derived the maximum likelihood estimators of the (many) parameters of the admixture. And of the latent variables z. Their posterior predictive check is based on the simulation of pseudo-observations (as in ABC!) from the above likelihood, with parameters and latent variables replaced with their EM estimates (unlike ABC). There is obviously some computational reason in doing this instead of simulating from the posterior, albeit implicit in the paper. I am however slightly puzzled by the conditioning on the latent variable estimate , as its simulation is straightforward and as a latent variable is more a missing observation than a parameter. Given those 30 to 100 replications of the data, an empirical distribution of a discrepancy function is used to assess whether or not the equivalent discrepancy for the observation is an outlier. If so, the model is not appropriate for the data. (Interestingly, the discrepancy is measured via the Bayes factor of z-scores.)

The connection with our own work is that the construction of discrepancy measures proposed in this paper could be added to our already large collection of summary statistics to check to potential impact in model comparison, i.e. for a significant contribution to the random forest nodes.  Conversely, the most significant summary statistics could then be tested as discrepancy measures. Or, more in tune with our Series B paper on the proper selection of summary variables, the distribution of those discrepancy measures could be compared across potential models. Assuming this does not take too much computing power…

## ABC model choice by random forests

Posted in pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , on June 25, 2014 by xi'an

After more than a year of collaboration, meetings, simulations, delays, switches,  visits, more delays, more simulations, discussions, and a final marathon wrapping day last Friday, Jean-Michel Marin, Pierre Pudlo,  and I at last completed our latest collaboration on ABC, with the central arguments that (a) using random forests is a good tool for choosing the most appropriate model and (b) evaluating the posterior misclassification error rather than the posterior probability of a model is an appropriate paradigm shift. The paper has been co-signed with our population genetics colleagues, Jean-Marie Cornuet and Arnaud Estoup, as they provided helpful advice on the tools and on the genetic illustrations and as they plan to include those new tools in their future analyses and DIYABC software.  ABC model choice via random forests is now arXived and very soon to be submitted…

One scientific reason for this fairly long conception is that it took us several iterations to understand the intrinsic nature of the random forest tool and how it could be most naturally embedded in ABC schemes. We first imagined it as a filter from a set of summary statistics to a subset of significant statistics (hence the automated ABC advertised in some of my past or future talks!), with the additional appeal of an associated distance induced by the forest. However, we later realised that (a) further ABC steps were counterproductive once the model was selected by the random forest and (b) including more summary statistics was always beneficial to the performances of the forest and (c) the connections between (i) the true posterior probability of a model, (ii) the ABC version of this probability, (iii) the random forest version of the above, were at best very loose. The above picture is taken from the paper: it shows how the true and the ABC probabilities (do not) relate in the example of an MA(q) model… We thus had another round of discussions and experiments before deciding the unthinkable, namely to give up the attempts to approximate the posterior probability in this setting and to come up with another assessment of the uncertainty associated with the decision. This led us to propose to compute a posterior predictive error as the error assessment for ABC model choice. This is mostly a classification error but (a) it is based on the ABC posterior distribution rather than on the prior and (b) it does not require extra-computations when compared with other empirical measures such as cross-validation, while avoiding the sin of using the data twice!

## model selection by likelihood-free Bayesian methods

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , on May 29, 2014 by xi'an

Just glanced at the introduction of this arXived paper over breakfast, back from my morning run: the exact title is “Model Selection for Likelihood-free Bayesian Methods Based on Moment Conditions: Theory and Numerical Examples” by Cheng Li and Wenxin Jiang. (The paper is 81 pages long.) I selected the paper for its title as it connected with an interrogation of ours on the manner to extend our empirical likelihood [A]BC work to model choice. We looked at this issue with Kerrie Mengersen and Judith Rousseau the last time Kerrie visited Paris but could not spot a satisfying entry… The current paper is of a theoretical nature, considering a moment defined model

$\mathbb{E}[g(D,\theta)]=0,$

where D denotes the data, as the dimension p of the parameter θ grows with n, the sample size. The approximate model is derived from a prior on the parameter θ and of a Gaussian quasi-likelihood on the moment estimating function g(D,θ). Examples include single index longitudinal data, quantile regression and partial correlation selection. The model selection setting is one of variable selection, resulting in 2p models to compare, with p growing to infinity… Which makes the practical implementation rather delicate to conceive. And the probability one of hitting the right model a fairly asymptotic concept. (At least after a cursory read from my breakfast table!)

## improved approximate-Bayesian model-choice method for estimating shared evolutionary history

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on May 14, 2014 by xi'an

“An appealing approach would be a comparative, Bayesian model-choice method for inferring the probability of competing divergence histories while integrating over uncertainty in mutational and ancestral processes via models of nucleotide substitution and lineage coalescence.” (p.2)

Jamies Oaks arXived (a few months ago now) a rather extensive Monte-Carlo study on the impact of prior modelling on the model-choice performances of ABC model choice. (Of which I only became aware recently.) As in the earlier paper I commented on the Óg, the issue here has much more to do with prior assessment and calibration than with ABC implementation per se. For instance, the above quote recaps the whole point of conducting Bayesian model choice. (As missed by Templeton.)

“This causes divergence models with more divergence-time parameters to integrate over a much greater parameter space with low likelihood yet high prior density, resulting in small marginal likelihoods relative to models with fewer divergence-time parameters.” (p.2)

This second quote is essentially stressing the point with Occam’s razor argument. Which I deem [to be] a rather positive feature of Bayesian model choice. A reflection on the determination of the prior distribution, getting away from uniform priors, thus sounds most timely! The current paper takes place within a rather extensive exchange between Oak’s group and Hickerson’s group on what makes Bayesian model choice (and the associated software msBayes) pick or not the correct model. Oak and coauthors objected to the use of “narrow, empirically informed uniform priors”, arguing that this leads to a bias towards models with less parameters, a “statistical issue” in their words, while Hickerson et al. (2014) think this is due to msBayes way of selecting models and their parameters at random. However it refrains from reproducing earlier criticisms of or replies to Hickerson et al.

The current paper claims to have reached a satisfactory prior modelling with ¨improved robustness, accuracy, and power” (p.3).  If I understand correctly, the changes are in replacing a uniform distribution with a Gamma or a Dirichlet prior. Which means introducing a seriously large and potentially crippling number of hyperparameters into the picture. Having a lot of flexibility in the prior also means a lot of variability in the resulting inference… In other words, with more flexibility comes more responsibility, to paraphrase Voltaire.

“I have introduced a new approximate-Bayesian model choice method.” (p.21)

The ABC part is rather standard, except for the strange feature that the divergence times are used to construct summary statistics (p.10). Strange because these times are not observed for the actual data. So I must be missing something. (And I object to the above quote and to the title of the paper since there is no new ABC technique there, simply a different form of prior.)

“ABC methods in general are known to be biased for model choice.” (p.21)

I do not understand much the part about (reshuffling) introducing bias as detailed on p.11: every approximate method gives a “biased” answer in the sense this answer is not the true and proper posterior distribution. Using a different (re-ordered) vector of statistics provides a different ABC outcome,  hence a different approximate posterior, for which it seems truly impossible to check whether or not it increases the discrepancy from the true posterior, compared with the other version. I must admit I always find annoying to see the word bias used in a vague meaning and esp. within a Bayesian setting. All Bayesian methods are biased. End of the story. Quoting our PNAS paper as concluding that ABC model choice is biased is equally misleading: the intended warning represented by the paper was that Bayes factors and posterior probabilities could be quite unrelated with those based on the whole dataset. That the proper choice of summary statistics leads to a consistent model choice shows ABC model choice is not necessarily “biased”… Furthermore, I also fail to understand why the posterior probability of model i should be distributed as a uniform (“If the method is unbiased, the points should fall near the identity line”) when the data is from model i: this is not a p-value but a posterior probability and the posterior probability is not the frequentist coverage…

My overall problem is that, all in all, this is a single if elaborate Monte Carlo study and, as such, it does not carry enough weight to validate an approach that remains highly subjective in the selection of its hyperparameters. Without raising any doubt about an hypothetical “fixing” of those hyperparameters, I think this remains a controlled experiment with simulated data where the true parameters are know and the prior is “true”. This obviously helps in getting better performances.

“With improving numerical methods (…), advances in Monte Carlo techniques and increasing efficiency of likelihood calculations, analyzing rich comparative phylo-geographical models in a full-likelihood Bayesian framework is becoming computationally feasible.” (p.21)

This conclusion of the paper sounds over-optimistic and rather premature. I do not know of any significant advance in computing the observed likelihood for the population genetics models ABC is currently handling. (The SMC algorithm of Bouchard-Côté, Sankaraman and Jordan, 2012, does not apply to Kingman’s coalescent, as far as I can tell.) This is certainly a goal worth pursuing and borrowing strength from multiple techniques cannot hurt, but it remains so far a lofty goal, still beyond our reach… I thus think the major message of the paper is to reinforce our own and earlier calls for caution when interpreting the output of an ABC model choice (p.20), or even of a regular Bayesian analysis, agreeing that we should aim at seeing “a large amount of posterior uncertainty” rather than posterior probability values close to 0 and 1.

## AISTATS 2014 / MLSS tutorial

Posted in Mountains, R, Statistics, University life with tags , , , , , , , , , , on April 26, 2014 by xi'an

Here are the slides of the tutorial on ABC methods I gave yesterday at both AISTAST 2014 and MLSS. (I actually gave a tutorial at another MLSS a few years ago, on the pretty island of Berder in Brittany, next to Vannes.) They are definitely similar to previous talks and tutorials I delivered on this topic of ABC algorithms, with only the last part being original (if unpublished yet). And even then: as Michael Gutmann from the University of Helsinki pointed out to me at the end of my talk, there are similarities between the classification method he exposed at MCMSki 4 in Chamonix and our use of random forests. Before my talk, I attended the tutorial of Roderick Murray-Smith from the University of Glasgow, on Machine learning and Human Computer Interaction, which was just stunning in its breadth, range of applications, and mastering of multimedia tools. Making me feel like a perfectly inadequate follower…