## likelihood-free inference via classification

**L**ast week, Michael Gutmann, Ritabrata Dutta, Samuel Kaski, and Jukka Corander posted on arXiv the last version of the paper they had presented at MCMSki 4. As indicated by its (above) title, it suggests implementing ABC based on classification tools. Thus making it somewhat connected to our recent random forest paper.

**T**he starting idea in the paper is that datasets generated from distributions with different parameters should be easier to classify than datasets generated from distributions with the same parameters. And that classification accuracy naturally induces a distance between datasets and between the parameters behind those datasets. We had followed some of the same track when starting using random forests, before realising that for our model choice setting, proceeding the entire ABC way once the random forest procedure had been constructed was counter-productive. Random forests are just too deadly as efficient model choice machines to try to compete with them through an ABC postprocessing. Performances are just… Not. As. Good!

**A** side question: I have obviously never thought about that before but why is the naïve Bayes classification rule so called?! It never sounded very Bayesian to me to (a) use the true value of the parameter and (b) average the classification performances. Interestingly, the authors (i) show identical performances of other classification methods (Fig. 2) and (ii) an exception for MA time series: when we first experimented random forests, raw data from an MA(2) model was tested to select between MA(1) and MA(2) models, and the performances of the resulting random forest were quite poor.

**N**ow, an opposition between our two approaches is that Michael and his coauthors also include point estimation within the range of classification-based ABC inference. As we stressed in our paper, we restrict the range to classification and model choice because we do not think those machine learning tools are stable and powerful enough to perform regression and posterior probability approximation. I also see a practical weakness in the estimation scheme proposed in this new paper. Namely that the Monte Carlo gets into the way of the consistency theorem. And possibly of the simulation method itself. Another remark is that, while the authors compare the fit produced by different classification methods, there should be a way to aggregate them towards higher efficiency. Returning once more to our random forest paper, we saw improved performances each time we included a reference method, from LDA to SVMs. It would be interesting to see a (summary) variable selection version of the proposed method. A final remark is that computing time and effort do not seem to get mentioned in the paper (unless Indian jetlag confuses me more than usual). I wonder how fast the computing effort grows with the sample size to reach parametric and quadratic convergence rates.

August 19, 2014 at 9:31 am

On the topic of aggregation of classification rules and the computing time:

The paper actually contains an aggregated classification rule. We call it the “Max-Rule” (see, for example, Section 2.1, and Figures 2 and 3, among others).

We comment on computational aspects in Sections 2.3, 2.4, and 3. In our epidemiology application (Section 2.4), the real bottleneck was sampling from the model so that “the required computation time [of classifier ABC] was about the same as for the expert method” (for the numbers, see p.8, 2nd paragraph).

There is likely room to improve the computational performance. Some ideas are presented in the Discussion.

Please note that there is small but important typo on line 3 of the second paragraph above: It should be “similar parameters” not “different parameters”, i.e. “datasets generated from distributions with different parameters should be easier to classify than datasets generated from distributions with _similar_ parameters.”

Michael