Marco Banterle, Clara Grazian, Anthony Lee, and myself just arXived our paper “Accelerating Metropolis-Hastings algorithms by delayed acceptance“, which is an major revision and upgrade of our “Delayed acceptance with prefetching” paper of last June. Paper that we submitted at the last minute to NIPS, but which did not get accepted. The difference with this earlier version is the inclusion of convergence results, in particular that, while the original Metropolis-Hastings algorithm dominates the delayed version in Peskun ordering, the later can improve upon the original for an appropriate choice of the early stage acceptance step. We thus included a new section on optimising the design of the delayed step, by picking the optimal scaling à la Roberts, Gelman and Gilks (1997) in the first step and by proposing a ranking of the factors in the Metropolis-Hastings acceptance ratio that speeds up the algorithm. The algorithm thus got adaptive. Compared with the earlier version, we have not pursued the second thread of prefetching as much, simply mentioning that prefetching and delayed acceptance could be merged. We have also included a section on the alternative suggested by Philip Nutzman on the ‘Og of using a growing ratio rather than individual terms, the advantage being the probability of acceptance stabilising when the number of terms grows, with the drawback being that expensive terms are not always computed last. In addition to our logistic and mixture examples, we also study in this version the MALA algorithm, since we can postpone computing the ratio of the proposals till the second step. The gain observed in one experiment is of the order of a ten-fold higher efficiency. By comparison, and in answer to one comment on Andrew’s blog, we did not cover the HMC algorithm, since the preliminary acceptance step would require the construction of a proxy to the acceptance ratio, in order to avoid computing a costly number of derivatives in the discretised Hamiltonian integration.
Archive for NIPS
Bayesian optimization for likelihood-free inference of simulator-based statistical models [guest post]Posted in Books, Statistics, University life with tags ABC, arXiv, Dennis Prangle, dimension curse, Gaussian processes, guest post, NIPS, nonparametric probability density estimation on February 17, 2015 by xi'an
Here are some comments on the paper of Gutmann and Corander. My brief skim read through this concentrated on the second half of the paper, the applied methodology. So my comments should be quite complementary to Christian’s on the theoretical part!
ABC algorithms generally follow the template of proposing parameter values, simulating datasets and accepting/rejecting/weighting the results based on similarity to the observations. The output is a Monte Carlo sample from a target distribution, an approximation to the posterior. The most naive proposal distribution for the parameters is simply the prior, but this is inefficient if the prior is highly diffuse compared to the posterior. MCMC and SMC methods can be used to provide better proposal distributions. Nevertheless they often still seem quite inefficient, requiring repeated simulations in parts of parameter space which have already been well explored.
The strategy of this paper is to instead attempt to fit a non-parametric model to the target distribution (or in fact to a slight variation of it). Hopefully this will require many fewer simulations. This approach is quite similar to Richard Wilkinson’s recent paper. Richard fitted a Gaussian process to the ABC analogue of the log-likelihood. Gutmann and Corander introduce two main novelties:
- They model the expected discrepancy (i.e. distance) Δθ between the simulated and observed summary statistics. This is then transformed to estimate the likelihood. This is in contrast to Richard who transformed the discrepancy before modelling. This is the standard ABC approach of weighting the discrepancy depending on how close to 0 it is. The drawback of the latter approach is it requires picking a tuning parameter (the ABC acceptance threshold or bandwidth) in advance of the algorithm. The new approach still requires a tuning parameter but its choice can be delayed until the transformation is performed.
- They generate the θ values on-line using “Bayesian optimisation”. The idea is to pick θ to concentrate on the region near the minimum of the objective function, and also to reduce uncertainty in the Gaussian process. Thus well explored regions can usually be neglected. This is in contrast to Richard who chose θs using space filling design prior to performing any simulations.
I didn’t read the paper’s theory closely enough to decide whether (1) is a good idea. Certainly the results for the paper’s examples look convincing. Also, one issue with Richard‘s approach was that because the log-likelihood varied over such a wide variety of magnitudes, he needed to fit several “waves” of GPs. It would be nice to know if the approach of modelling the discrepancy has removed this problem, or if a single GP is still sometimes an insufficiently flexible model.
Novelty (2) is a very nice and natural approach to take here. I did wonder why the particular criterion in Equation (45) was used to decide on the next θ. Does this correspond to optimising some information theoretic quantity? Other practical questions were whether it’s possible to parallelise the method (I seem to remember talking to Michael Gutmann about this at NIPS but can’t remember his answer!), and how well the approach scales up with the dimension of the parameters.
“We could choose to not meet at a huge annual meeting in which we take over a city. Every year, each participant going to the meeting uses a quantum of carbon that is more than considerable. Air travel, staying in hotels, all of this creates a way of living on the Earth that is carbon intensive. It could be otherwise.”
While I am not in the least interested in the conference or in the topics covered by this society or yet in the benevolent religious activities suggested as a substitute, the notion of cancelling the behemoths that are our national and international academic meetings holds some appeal. I have posted several times on the topic, especially about JSM, and I have no clear and definitive answer to the question. Still, there lies a lack of efficiency on top of the environmental impact that we could and should try to address. As I was thinking of those issues in the past week, I made another of my numerous “carbon footprints” by attending NIPS across the Atlantic for two workshops than ran in parallel with about twenty others. And hence could have taken place in twenty different places. Albeit without the same exciting feeling of constant intellectual simmering. And without the same mix of highly interactive scholars from all over the planet. (Although the ABC in Montréal workshop seemed predominantly European!) Since workshops are in my opinion the most profitable type of meeting, I would like to experiment with a large meeting made of those (focussed and intense) workshops in such a way that academics would benefit without travelling long distances across the World. One idea would be to have local nodes where a large enough group of researchers could gather to attend video-conferences given from any of the other nodes and to interact locally in terms of discussions and poster presentations. This should even increase the feedback on selected papers as small groups would more readily engage into discussing and criticising papers than a huge conference room. If we could build a World-wide web (!) of such nodes, we could then dream of a non-stop conference, with no central node, no gigantic conference centre, no terrifying beach-ressort…
On Thursday, I will travel to Montréal for the two days of NIPS workshop there. On Friday, there is the ABC in Montréal workshop that I cannot but attend! (First occurrence of an “ABC in…” in North America! Sponsored by ISBA as well.) And on Saturday, there is the 3rd NIPS Workshop on Probabilistic Programming where I am invited to give a talk on… ABC! And maybe will manage to get a sneak at the nearby workshop on Advances in variational inference… (0n a very personal side, I wonder if the weather will remain warm enough to go running in the early morning.)
Today in Warwick, I had a very nice discussion with Michael Betancourt on many statistical and computational issues but at one point in the conversation we came upon the trouble of bridging the gap between the machine learning and statistics communities. While a conference like AISTATS is certainly contributing to this, it does not reach the main bulk of the statistics community. Since, in Reykjavik, we had discussed the corresponding difficulty of people publishing a longer and “more” statistical paper in a “more” statistical journal, once the central idea was published in a machine learning conference proceeding like NIPS or AISTATS. we had this idea that creating a special fast-track in a mainstream statistics journal for a subset of those papers, using for instance a tailor-made committee in that original conference, or creating an annual survey of the top machine learning conference proceedings rewritten in a more” statistical way (and once again selected by an ad hoc committee) would help, at not too much of a cost for inducing machine learners to make the extra-effort of switching to another style. From there, we enlarged the suggestion to enlist a sufficient number of (diverse) bloggers in each major conference towards producing quick but sufficiently informative entries on their epiphany talks (if any), possibly supported by the conference organisers or the sponsoring societies. (I am always happy to welcome any guest blogger in conferences I attend!)