Bayesian optimization for likelihood-free inference of simulator-based statistical models [guest post]
Here are some comments on the paper of Gutmann and Corander. My brief skim read through this concentrated on the second half of the paper, the applied methodology. So my comments should be quite complementary to Christian’s on the theoretical part!
ABC algorithms generally follow the template of proposing parameter values, simulating datasets and accepting/rejecting/weighting the results based on similarity to the observations. The output is a Monte Carlo sample from a target distribution, an approximation to the posterior. The most naive proposal distribution for the parameters is simply the prior, but this is inefficient if the prior is highly diffuse compared to the posterior. MCMC and SMC methods can be used to provide better proposal distributions. Nevertheless they often still seem quite inefficient, requiring repeated simulations in parts of parameter space which have already been well explored.
The strategy of this paper is to instead attempt to fit a non-parametric model to the target distribution (or in fact to a slight variation of it). Hopefully this will require many fewer simulations. This approach is quite similar to Richard Wilkinson’s recent paper. Richard fitted a Gaussian process to the ABC analogue of the log-likelihood. Gutmann and Corander introduce two main novelties:
- They model the expected discrepancy (i.e. distance) Δθ between the simulated and observed summary statistics. This is then transformed to estimate the likelihood. This is in contrast to Richard who transformed the discrepancy before modelling. This is the standard ABC approach of weighting the discrepancy depending on how close to 0 it is. The drawback of the latter approach is it requires picking a tuning parameter (the ABC acceptance threshold or bandwidth) in advance of the algorithm. The new approach still requires a tuning parameter but its choice can be delayed until the transformation is performed.
- They generate the θ values on-line using “Bayesian optimisation”. The idea is to pick θ to concentrate on the region near the minimum of the objective function, and also to reduce uncertainty in the Gaussian process. Thus well explored regions can usually be neglected. This is in contrast to Richard who chose θs using space filling design prior to performing any simulations.
I didn’t read the paper’s theory closely enough to decide whether (1) is a good idea. Certainly the results for the paper’s examples look convincing. Also, one issue with Richard‘s approach was that because the log-likelihood varied over such a wide variety of magnitudes, he needed to fit several “waves” of GPs. It would be nice to know if the approach of modelling the discrepancy has removed this problem, or if a single GP is still sometimes an insufficiently flexible model.
Novelty (2) is a very nice and natural approach to take here. I did wonder why the particular criterion in Equation (45) was used to decide on the next θ. Does this correspond to optimising some information theoretic quantity? Other practical questions were whether it’s possible to parallelise the method (I seem to remember talking to Michael Gutmann about this at NIPS but can’t remember his answer!), and how well the approach scales up with the dimension of the parameters.