## Bayesian optimization for likelihood-free inference of simulator-based statistical models [guest post]

Posted in Books, Statistics, University life with tags , , , , , , , on February 17, 2015 by xi'an

[The following comments are from Dennis Prangle, about the second half of the paper by Gutmann and Corander I commented last week.]

Here are some comments on the paper of Gutmann and Corander. My brief skim read through this concentrated on the second half of the paper, the applied methodology. So my comments should be quite complementary to Christian’s on the theoretical part!

ABC algorithms generally follow the template of proposing parameter values, simulating datasets and accepting/rejecting/weighting the results based on similarity to the observations. The output is a Monte Carlo sample from a target distribution, an approximation to the posterior. The most naive proposal distribution for the parameters is simply the prior, but this is inefficient if the prior is highly diffuse compared to the posterior. MCMC and SMC methods can be used to provide better proposal distributions. Nevertheless they often still seem quite inefficient, requiring repeated simulations in parts of parameter space which have already been well explored.

The strategy of this paper is to instead attempt to fit a non-parametric model to the target distribution (or in fact to a slight variation of it). Hopefully this will require many fewer simulations. This approach is quite similar to Richard Wilkinson’s recent paper. Richard fitted a Gaussian process to the ABC analogue of the log-likelihood. Gutmann and Corander introduce two main novelties:

1. They model the expected discrepancy (i.e. distance) Δθ between the simulated and observed summary statistics. This is then transformed to estimate the likelihood. This is in contrast to Richard who transformed the discrepancy before modelling. This is the standard ABC approach of weighting the discrepancy depending on how close to 0 it is. The drawback of the latter approach is it requires picking a tuning parameter (the ABC acceptance threshold or bandwidth) in advance of the algorithm. The new approach still requires a tuning parameter but its choice can be delayed until the transformation is performed.
2. They generate the θ values on-line using “Bayesian optimisation”. The idea is to pick θ to concentrate on the region near the minimum of the objective function, and also to reduce uncertainty in the Gaussian process. Thus well explored regions can usually be neglected. This is in contrast to Richard who chose θs using space filling design prior to performing any simulations.

I didn’t read the paper’s theory closely enough to decide whether (1) is a good idea. Certainly the results for the paper’s examples look convincing. Also, one issue with Richard‘s approach was that because the log-likelihood varied over such a wide variety of magnitudes, he needed to fit several “waves” of GPs. It would be nice to know if the approach of modelling the discrepancy has removed this problem, or if a single GP is still sometimes an insufficiently flexible model.

Novelty (2) is a very nice and natural approach to take here. I did wonder why the particular criterion in Equation (45) was used to decide on the next θ. Does this correspond to optimising some information theoretic quantity? Other practical questions were whether it’s possible to parallelise the method (I seem to remember talking to Michael Gutmann about this at NIPS but can’t remember his answer!), and how well the approach scales up with the dimension of the parameters.

## MCqMC 2014 [day #3]

Posted in pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , on April 10, 2014 by xi'an

As the second day at MCqMC 2014, was mostly on multi-level Monte Carlo and quasi-Monte Carlo methods, I did not attend many talks but had a long run in the countryside (even saw a pheasant and a heron), worked at “home” on pressing recruiting evaluations and had a long working session with Pierre Jacob. Plus an evening out sampling (just) a few Belgian beers in the shade of the city hall…

Today was more in my ballpark as there were MCMC talks the whole day! The plenary talk was not about MCMC as Erich Novak presented a survey on the many available results bounding the complexity of approximating an integral based on a fixed number of evaluations of the integrand, some involving the dimension (and its curse), some not, some as fast as √n and some not as fast, all this depending on the regularity and the size of the classes of integrands considered. In some cases, the solution was importance sampling, in other cases, quasi-Monte Carlo, and yet other cases were still unsolved. Then Yves Atchadé gave a new perspective on computing the asymptotic variance in the central limit theorem on Markov chains when truncating the autocovariance, Matti Vihola talked about theoretical orderings of Markov chains that transmuted into the very practical consequence that using more simulations in a pseudo-marginal likelihood approximation improved acceptance rate and asymptotic variances (and this applies to aBC-MCMC as well), Radu Craiu proposed a novel processing of adaptive MCMC by treating various approximations to the true target as food for a multiple-try Metropolis algorithm, and Luca Martino had a go at resuscitating the ARMS algorithm of Gilks and Wild (used for a while in BUGS), although the talk did not dissipate all of my misgivings about the multidimensional version! I had more difficulties following the “Warwick session” which was made of four talks by current or former students from Warwick, although I appreciated the complexity of the results in infinite dimensional settings and novel approximations to diffusion based Metropolis algorithms. No further session this afternoon as the “social” activity was to visit the nearby Stella Artois brewery! This activity made us very social, for certain, even though there was hardly a soul around in this massively automated factory. (Maybe an ‘Og post to come one of those days…)

## Le Monde puzzle [#851]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on February 6, 2014 by xi'an

A more unusual Le Monde mathematical puzzle:

Fifty black and white tokens are set on an equilateral triangle of side 9, black on top and white on bottom. If they can only be turned three by three, determine whether it is possible to produce a triangle with all white sides on top, under each of the following constraints:

• the three tokens must stand on a line;
• the three tokens must stand on a line and be contiguous;
• the three tokens must stand on the summits of an equilateral triangle;
• the three tokens must stand on the summits of an equilateral triangle of side one.

I could not think of a quick fix with an R code so leave it to the interested ‘Og reader… In the next issue of the Science&Médecine leaflet (Jan. 29), which appeared while I was in Warwick, there were a few entries of interest. First, the central article was about Big Data (again), but, for a change, the journalist took the pain to include French statisticians and machine learners in the picture, like Stefan Clemençon, Aurélien Garivier, Jean-Michel Loubes, and Nicolas Vayatis. (In a typical French approach, the subtitle was “A challenge for maths”, rather than statistics!) Ignoring the (minor) confusion therein of “small n, large p” with the plague of dimensionality, the article does mention a few important issues like distributed computing, inhomogeneous datasets, overfitting and learning. There are also links to the new masters in data sciences at ENSAE, Telecom-Paritech, and Paris 6-Pierre et Marie Curie. (The one in Paris-Dauphine is still under construction and will not open next year.) As a side column, the journal also wonders about the “end of Science” due to massive data influx and “Big Data” techniques that could predict and explain without requiring theories and deductive or scientific thinking. Somewhat paradoxically, the column ends up by a quote of Jean-Michel Loubes, who states that one could think “our” methods start from effects to end up with causes, but that in fact the models are highly dependent on the data. And on the opinion of experts. Doesn’t that suggest some Bayesian principles at work there?!

Another column is dedicated to Edward Teller‘s “dream” of using nuclear bombs for civil engineering, like in the Chariot project in Alaska. And the last entry is against Kelvin’s “to measure is to know”, with the title “To known is not to measure”, although it does not aim at a general philosophical level but rather objects to the unrestricted intrusion of bibliometrics and other indices brought from marketing. Written by a mathematician, this column is not directed against statistics and the Big Data revolution, but rather the myth that everything can be measured and quantified. (There was also a pointer to a tribune against the pseudo-recruiting of top researchers by Saudi universities in order to improve their Shanghai ranking but I do not have time to discuss it here. And now. Maybe later.)

## dimension reduction in ABC [a review’s review]

Posted in Statistics, University life with tags , , , , , , , , , , , on February 27, 2012 by xi'an

What is very apparent from this study is that there is no single `best’ method of dimension reduction for ABC.

Michael Blum, Matt Nunes, Dennis Prangle and Scott Sisson just posted on arXiv a rather long review of dimension reduction methods in ABC, along with a comparison on three specific models. Given that the choice of the vector of summary statistics is presumably the most important single step in an ABC algorithm and as selecting too large a vector is bound to fall victim of the dimension curse, this is a fairly relevant review! Therein, the authors compare regression adjustments à la Beaumont et al.  (2002), subset selection methods, as in Joyce and Marjoram (2008), and projection techniques, as in Fearnhead and Prangle (2012). They add to this impressive battery of methods the potential use of AIC and BIC. (Last year after ABC in London I reported here on the use of the alternative DIC by Francois and Laval, but the paper is not in the bibliography, I wonder why.) An argument (page 22) for using AIC/BIC is that either provides indirect information about the approximation of p(θ|y) by p(θ|s); this does not seem obvious to me.

The paper also suggests a further regularisation of Beaumont et al.  (2002) by ridge regression, although L1 penalty à la Lasso would be more appropriate in my opinion for removing extraneous summary statistics. (I must acknowledge never being a big fan of ridge regression, esp. in the ad hoc version à la Hoerl and Kennard, i.e. in a non-decision theoretic approach where the hyperparameter λ is derived from the data by X-validation, since it then sounds like a poor man’s Bayes/Stein estimate, just like BIC is a first order approximation to regular Bayes factors… Why pay for the copy when you can afford the original?!) Unsurprisingly, ridge regression does better than plain regression in the comparison experiment when there are many almost collinear summary statistics, but an alternative conclusion could be that regression analysis is not that appropriate with  many summary statistics. Indeed, summary statistics are not quantities of interest but data summarising tools towards a better approximation of the posterior at a given computational cost… (I do not get the final comment, page 36, about the relevance of summary statistics for MCMC or SMC algorithms: the criterion should be the best approximation of p(θ|y) which does not depend on the type of algorithm.)

I find it quite exciting to see the development of a new range of ABC papers like this review dedicated to a better derivation of summary statistics in ABC, each with different perspectives and desideratas, as it will help us to understand where ABC works and where it fails, and how we could get beyond ABC…