Archive for random forests

ABC variable selection

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , on July 18, 2018 by xi'an

Prior to the ISBA 2018 meeting, Yi Liu, Veronika Ročková, and Yuexi Wang arXived a paper on relying ABC for finding relevant variables, which is a very original approach in that ABC is not as much the object as it is a tool. And which Veronika considered during her Susie Bayarri lecture at ISBA 2018. In other words, it is not about selecting summary variables for running ABC but quite the opposite, selecting variables in a non-linear model through an ABC step. I was going to separate the two selections into algorithmic and statistical selections, but it is more like projections in the observation and covariate spaces. With ABC still providing an appealing approach to approximate the marginal likelihood. Now, one may wonder at the relevance of ABC for variable selection, aka model choice, given our warning call of a few years ago. But the current paper does not require low-dimension summary statistics, hence avoids the difficulty with the “other” Bayes factor.

In the paper, the authors consider a spike-and… forest prior!, where the Bayesian CART selection of active covariates proceeds through a regression tree, selected covariates appearing in the tree and others not appearing. With a sparsity prior on the tree partitions and this new ABC approach to select the subset of active covariates. A specific feature is in splitting the data, one part to learn about the regression function, simulating from this function and comparing with the remainder of the data. The paper further establishes that ABC Bayesian Forests are consistent for variable selection.

“…we observe a curious empirical connection between π(θ|x,ε), obtained with ABC Bayesian Forests  and rescaled variable importances obtained with Random Forests.”

The difference with our ABC-RF model choice paper is that we select summary statistics [for classification] rather than covariates. For instance, in the current paper, simulation of pseudo-data will depend on the selected subset of covariates, meaning simulating a model index, and then generating the pseudo-data, acceptance being a function of the L² distance between data and pseudo-data. And then relying on all ABC simulations to find which variables are in more often than not to derive the median probability model of Barbieri and Berger (2004). Which does not work very well if implemented naïvely. Because of the immense size of the model space, it is quite hard to find pseudo-data close to actual data, resulting in either very high tolerance or very low acceptance. The authors get over this difficulty by a neat device that reminds me of fractional or intrinsic (pseudo-)Bayes factors in that the dataset is split into two parts, one that learns about the posterior given the model index and another one that simulates from this posterior to compare with the left-over data. Bringing simulations closer to the data. I do not remember seeing this trick before in ABC settings, but it is very neat, assuming the small data posterior can be simulated (which may be a fundamental reason for the trick to remain unused!). Note that the split varies at each iteration, which means there is no impact of ordering the observations.

machine learning methods are useful for ABC [or my first PCI Evol Biol!]

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , on November 23, 2017 by xi'an

While I am still working on setting a PCI [peer community in] Comput Stats, having secure sponsorship of some societies (ASA, KSS, RSS, SFdS, and hopefully ISBA), my coauthors Jean-Michel Marin and Louis Raynal submitted our paper ABC random forests for Bayesian parameter inference to PCI Evol Biol. And after a few months of review, including a revision accounting for the reviewers’ requests, our paper stood the test and the recommendation by Michael Blum and Dennis Prangle got published there. Great news, and hopefully helpful for our submission within the coming days!

Astrostatistics school

Posted in Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , on October 17, 2017 by xi'an

What a wonderful week at the Astrostat [Indian] summer school in Autrans! The setting was superb, on the high Vercors plateau overlooking both Grenoble [north] and Valence [west], with the colours of the Fall at their brightest on the foliage of the forests rising on both sides of the valley and a perfect green on the fields at the centre, with sun all along, sharp mornings and warm afternoons worthy of a late Indian summer, too many running trails [turning into X country ski trails in the Winter] to contemplate for a single week [even with three hours of running over two days], many climbing sites on the numerous chalk cliffs all around [but a single afternoon for that, more later in another post!]. And of course a group of participants eager to learn about Bayesian methodology and computational algorithms, from diverse [astronomy, cosmology and more] backgrounds, trainings and countries. I was surprised at the dedication of the participants travelling all the way from Chile, Péru, and Hong Kong for the sole purpose of attending the school. David van Dyk gave the first part of the school on Bayesian concepts and MCMC methods, Roberto Trotta the second part on Bayesian model choice and hierarchical models, and myself a third part on, surprise, surprise!, approximate Bayesian computation. Plus practicals on R.

As it happens Roberto had to cancel his participation and I turned for a session into Christian Roberto, presenting his slides in the most objective possible fashion!, as a significant part covered nested sampling and Savage-Dickey ratios, not exactly my favourites for estimating constants. David joked that he was considering postponing his flight to see me talk about these, but I hope I refrained from engaging into controversy and criticisms… If anything because this was not of interest for the participants. Indeed when I started presenting ABC through what I thought was a pedestrian example, namely Rasmus Baath’s socks, I found that the main concern was not running an MCMC sampler or a substitute ABC algorithm but rather an healthy questioning of the construction of the informative prior in that artificial setting, which made me quite glad I had planned to cover this example rather than an advanced model [as, e.g., one of those covered in the packages abc, abctools, or abcrf]. Because it generated those questions about the prior [why a Negative Binomial? why these hyperparameters? &tc.] and showed how programming ABC turned into a difficult exercise even in this toy setting. And while I wanted to give my usual warning about ABC model choice and argue for random forests as a summary selection tool, I feel I should have focussed instead on another example, as this exercise brings out so clearly the conceptual difficulties with what is taught. Making me quite sorry I had to leave one day earlier. [As did missing an extra run!] Coming back by train through the sunny and grape-covered slopes of Burgundy hills was an extra reward [and no one in the train commented about the local cheese travelling in my bag!]


MCM 2017 snapshots [#2]

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , , on July 7, 2017 by xi'an

On the second day of MCM 2017, Emmanuel Gobet (from Polytechnique) gave the morning plenary talk on regression Monte Carlo methods, where he presented several ways of estimating conditional means of rv’s in nested problems where conditioning involves other conditional expectations. While interested in such problems in connection with ABC, I could not see how the techniques developed therein could apply to said problems.

By some of random chance, I ended up attending a hard-core random generation session where the speakers were discussing discrepancies between GNU library generators [I could not understand the target of interest and using MCMC till convergence seemed prone to false positives!], and failed statistical tests of some 64-bit Mersenne Twisters, and low discrepancy on-line subsamples of Uniform samples. Most exciting of all, Josef Leydold gave a talk on ratio-of-uniforms, on which I spent some time a while ago  (till ending up reinventing the wheel!), with highly refined cuts of the original box.

My own 180 slides [for a 50mn talk] somewhat worried my chairman, Art Owen, who kindly enquired the day before at the likelihood I could go through all 184 of them!!! I had appended the ABC convergence slides to an earlier set of slides on ABC with random forests in case of questions about that aspect, although I did not plan to go through those slides [and I mostly covered the 64 other slides] As the talk was in fine more about an inference method than a genuine Monte Carlo technique, plus involved random forests that sounded unfamiliar to many, I did not get many questions from the audience but had several deep discussions with people after the talk. Incidentally, we have just reposted our paper on ABC estimation via random forests, updated the abcrf R package, and submitted it to Peer Community in Evolutionary Biology!

MCM 2017

Posted in Statistics with tags , , , , , , , , , , , , on July 3, 2017 by xi'an

And thus I am back in Montréal, for MCM 2017, located in HEC Montréal, on the campus of Université de Montréal, for three days. My talk is predictably about ABC, what else?!, gathering diverse threads from different talks and papers:

ODOF, not Hodor [statlearn 2017]

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , on April 15, 2017 by xi'an

random forests [reading group]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on March 14, 2017 by xi'an

Here are the slides I prepared (and recycled) over the weekend for the reading group on machine-learning that recently started in Warwick. Where I am for two consecutive weeks.