While my arXiv newspage today had a puzzling entry about modelling UFOs sightings in France, it also broadcast our revision of Reliable ABC model choice via random forests, version that we resubmitted today to Bioinformatics after a quite thorough upgrade, the most dramatic one being the realisation we could also approximate the posterior probability of the selected model via another random forest. (With no connection with the recent post on forest fires!) As discussed a little while ago on the ‘Og. And also in conjunction with our creating the abcrf R package for running ABC model choice out of a reference table. While it has been an excruciatingly slow process (the initial version of the arXived document dates from June 2014, the PNAS submission was rejected for not being enough Bayesian, and the latest revision took the whole summer), the slow maturation of our thoughts on the model choice issues led us to modify the role of random forests in the ABC approach to model choice, in that we reverted our earlier assessment that they could only be trusted for selecting the most likely model, by realising this summer the corresponding posterior could be expressed as a posterior loss and estimated by a secondary forest. As first considered in Stoehr et al. (2014). (In retrospect, this brings an answer to one of the earlier referee’s comments.) Next goal is to incorporate those changes in DIYABC (and wait for the next version of the software to appear). Another best-selling innovation due to Arnaud: we added a practical implementation section in the format of FAQ for issues related with the calibration of the algorithms.
Archive for ABC model choice
In conjunction with our reliable ABC model choice via random forest paper, about to be resubmitted to Bioinformatics, we have contributed an R package called abcrf that produces a most likely model and its posterior probability out of an ABC reference table. In conjunction with the realisation that we could devise an approximation to the (ABC) posterior probability using a secondary random forest. “We” meaning Jean-Michel Marin and Pierre Pudlo, as I only acted as a beta tester!
The package abcrf consists of three functions:
- abcrf, which constructs a random forest from a reference table and returns an object of class `abc-rf’;
- plot.abcrf, which gives both variable importance plot of a model choice abc-rf object and the projection of the reference table on the LDA axes;
- predict.abcrf, which predict the model for new data and evaluate the posterior probability of the MAP.
An illustration from the manual:
data(snp) data(snp.obs) mc.rf <- abcrf(snp[1:1e3, 1], snp[1:1e3, -1]) predict(mc.rf, snp[1:1e3, -1], snp.obs)
Today I gave a talk on Approximate Bayesian model choice via random forests at the yearly SPA (Stochastic Processes and their Applications) 2015 conference, taking place in Oxford (a nice town near Warwick) this year. In Keble College more precisely. The slides are below and while they are mostly repetitions of earlier slides, there is a not inconsequential novelty in the presentation, namely that I included our most recent and current perspective on ABC model choice. Indeed, when travelling to Montpellier two weeks ago, we realised that there was a way to solve our posterior probability conundrum!
Despite the heat wave that rolled all over France that week, we indeed figured out a way to estimate the posterior probability of the selected (MAP) model, way that we had deemed beyond our reach in previous versions of the talk and of the paper. The fact that we could not provide an estimate of this posterior probability and had to rely instead on a posterior expected loss was one of the arguments used by the PNAS reviewers in rejecting the paper. While the posterior expected loss remains a quantity worth approximating and reporting, the idea that stemmed from meeting together in Montpellier is that (i) the posterior probability of the MAP is actually related to another posterior loss, when conditioning on the observed summary statistics and (ii) this loss can be itself estimated via a random forest, since it is another function of the summary statistics. A posteriori, this sounds trivial but we had to have a new look at the problem to realise that using ABC samples was not the only way to produce an estimate of the posterior probability! (We are now working on the revision of the paper for resubmission within a few week… Hopefully before JSM!)
Today, I am taking part in a meeting in Paris, for an exotic change!, at the Bureau international des poids et mesures (BIPM), which looks after a universal reference for measurements. For instance, here is its definition of the kilogram:
The unit of mass, the kilogram, is the mass of the international prototype of the kilogram kept in air under three bell jars at the BIPM. It is a cylinder made of an alloy for which the mass fraction of platinum is 90 % and the mass fraction of iridium is 10 %.
And the BIPM is thus interested in the uncertainty associated with such measurements. Hence the workshop on measurement uncertainties. Tony O’Hagan will also be giving a talk in a session that opposes frequentist and Bayesian approaches, even though I decided to introduce ABC as it seems to me to be a natural notion for measurement problems (as far as I can tell from my prior on measurement problems).
Jean-Michel Marin, Pierre Pudlo and I just arXived a short review on ABC model choice, first version of a chapter for the incoming Handbook of Approximate Bayesian computation edited by Scott Sisson, Yannan Fan, and Mark Beaumont. Except for a new analysis of a Human evolution scenario, this survey mostly argues for the proposal made in our recent paper on the use of random forests and [also argues] about the lack of reliable approximations to posterior probabilities. (Paper that was rejected by PNAS and that is about to be resubmitted. Hopefully with a more positive outcome.) The conclusion of the survey is that
The presumably most pessimistic conclusion of this study is that the connections between (i) the true posterior probability of a model, (ii) the ABC version of this probability, and (iii) the random forest version of the above, are at best very loose. This leaves open queries for acceptable approximations of (i), since the posterior predictive error is instead an error assessment for the ABC RF model choice procedure. While a Bayesian quantity that can be computed at little extra cost, it does not necessarily compete with the posterior probability of a model.
reflecting my hope that we can eventually come up with a proper approximation to the “true” posterior probability…
Yesterday, I was all too briefly in Edinburgh for a few hours, to give a seminar in the School of Mathematics, on the random forests approach to ABC model choice (that was earlier rejected). (The slides are almost surely identical to those used at the NIPS workshop.) One interesting question at the end of the talk was on the potential bias in the posterior predictive expected loss, bias against some model from the collection of models being evaluated for selection. In the sense that the array of summaries used by the random forest could fail to capture features of a particular model and hence discriminate against it. While this is correct, there is no fundamental difference with implementing a posterior probability based on the same summaries. And the posterior predictive expected loss offers the advantage of testing, that is, for representative simulations from each model, of returning the corresponding model prediction error to highlight poor performances on some models. A further discussion over tea led me to ponder whether or not we could expand the use of random forests to Bayesian quantile regression. However, this would imply a monotonicity structure on a collection of random forests, which sounds daunting…
My stay in Edinburgh was quite brief as I drove to the Highlands after the seminar, heading to Fort William, Although the weather was rather ghastly, the traffic was fairly light and I managed to get there unscathed, without hitting any of the deer of Rannoch Mor (saw one dead by the side of the road though…) or the snow banks of the narrow roads along Loch Lubnaig. And, as usual, it still was a pleasant feeling to drive through those places associated with climbs and hikes, Crianlarich, Tyndrum, Bridge of Orchy, and Glencoe. And to get in town early enough to enjoy a quick dinner at The Grog & Gruel, reflecting I must have had half a dozen dinners there with friends (or not) over the years. And drinking a great heather ale to them!
Our random forest paper was alas rejected last week. Alas because I think the approach is a significant advance in ABC methodology when implemented for model choice, avoiding the delicate selection of summary statistics and the report of shaky posterior probability approximation. Alas also because the referees somewhat missed the point, apparently perceiving random forests as a way to project a large collection of summary statistics on a limited dimensional vector as in the Read Paper of Paul Fearnhead and Dennis Prarngle, while the central point in using random forests is the avoidance of a selection or projection of summary statistics. They also dismissed ou approach based on the argument that the reduction in error rate brought by random forests over LDA or standard (k-nn) ABC is “marginal”, which indicates a degree of misunderstanding of what the classification error stand for in machine learning: the maximum possible gain in supervised learning with a large number of classes cannot be brought arbitrarily close to zero. Last but not least, the referees did not appreciate why we mostly cannot trust posterior probabilities produced by ABC model choice and hence why the posterior error loss is a valuable and almost inevitable machine learning alternative, dismissing the posterior expected loss as being not Bayesian enough (or at all), for “averaging over hypothetical datasets” (which is a replicate of Jeffreys‘ famous criticism of p-values)! Certainly a first time for me to be rejected based on this argument!