In connection with the official launch of the Alan Turing Institute (or ATI, of which Warwick is a partner), it funded an ATI Scoping workshop
yesterday a week ago in Warwick around the notion(s) of intractable likelihood(s) and how this could/should fit within the themes of the Institute [hence the scoping]. This is one among many such scoping workshops taking place at all partners, as reported on the ATI website. Workshop that was quite relaxed and great fun, if only for getting together with most people (and friends) in the UK interested in the topic. But also pointing out some new themes I had not previously though of as related to ilike. For instance, questioning the relevance of likelihood for inference and putting forward decision theory under model misspecification, connecting with privacy and ethics [hence making intractable “good”!], introducing uncertain likelihood, getting more into network models, RKHS as a natural summary statistic, swarm of solutions for consensus inference… (And thanks to Mark Girolami for this homage to the iconic LP of the Sex Pistols!, that I played maniacally all over 1978…) My own two-cents into the discussion were mostly variations of other discussions, borrowing from ABC (and ABC slides) to call for a novel approach to approximate inference:
Archive for summary statistics
In connection with the official launch of the Alan Turing Institute (or ATI, of which Warwick is a partner), it funded an ATI Scoping workshop
This recently arXived paper by Weixuan Zhu , Juan Miguel Marín, and Fabrizio Leisen proposes an alternative to our empirical likelihood ABC paper of 2013, or BCel. Besides the mostly personal appeal for me to report on a Juan Miguel Marín working [in Madrid] on ABC topics, along my friend Jean-Michel Marin!, this paper is another entry on ABC that connects with yet another statistical perspective, namely bootstrap. The proposal, called BCbl, is based on a reference paper by Davison, Hinkley and Worton (1992) which defines a bootstrap likelihood, a notion that relies on a double-bootstrap step to produce a non-parametric estimate of the distribution of a given estimator of the parameter θ. This estimate includes a smooth curve-fitting algorithm step, for which little description is available from the current paper. The bootstrap non-parametric substitute then plays the role of the actual likelihood, with no correction for the substitution just as in our BCel. Both approaches are convergent, with Monte Carlo simulations exhibiting similar or even identical convergence speeds although [unsurprisingly!] no deep theory is available on the comparative advantage.
An important issue from my perspective is that, while the empirical likelihood approach relies on a choice of identifying constraints that strongly impact the numerical value of the likelihood approximation, the bootstrap version starts directly from a subjectively chosen estimator of θ, which may also impact the numerical value of the likelihood approximation. In some ABC settings, finding a primary estimator of θ may be a real issue or a computational burden. Except when using a preliminary ABC step as in semi-automatic ABC. This would be an interesting crash-test for the BCbl proposal! (This would not necessarily increase the computational cost by a large amount.) In addition, I am not sure the method easily extends to larger collections of summary statistics as those used in ABC, in particular because it necessarily relies on non-parametric estimates, only operating in small enough dimensions where smooth curve-fitting algorithms can be used. Critically, the paper only processes examples with a few parameters.
The comparisons between BCel and BCbl that are produced in the paper show some gain towards BCbl. Obviously, it depends on the respective calibrations of the non-parametric methods and of regular ABC, as well as on the available computing time. I find the population genetic example somewhat puzzling: The paper refers to our composite likelihood to set the moment equations. Since this is a pseudo-likelihood, I wonder how the authors do select their parameter estimates in the double-bootstrap experiment. And for the Ising model, it is not straightforward to conceive of a bootstrap algorithm on an Ising model: (a) how does one subsample pixels and (b) what are the validity guarantees for the estimation procedure.
“The main task of this article is to construct low-dimensional and informative summary statistics for ABC methods.”
The idea in the paper “Learning Summary Statistic for ABC via Deep Neural Network”, arXived a few days ago, is to start from the raw data and build a “deep neural network” (meaning a multiple layer neural network) to provide a non-linear regression of the parameters over the data. (There is a rather militant tone to the justification of the approach, not that unusual with proponents of deep learning approaches, I must add…) Whose calibration never seems an issue. The neural construct is called to produce an estimator (function) of θ, θ(x). Which is then used as the summary statistics. Meaning, if Theorem 1 is to be taken as the proposal, that a different ABC needs to be run for every function of interest. Or, in other words, that the method is not reparameterisation invariant.
The paper claims to achieve the same optimality properties as in Fearnhead and Prangle (2012). These are however moderate optimalities in that they are obtained for the tolerance ε equal to zero. And using the exact posterior expectation as a summary statistic, instead of a non-parametric estimate. And an infinite functional basis in Theorem 2. I thus see little added value in results like Theorem 2 and no real optimality: That the ABC distribution can be arbitrarily close to the exact posterior is not an helpful statement when implementing the method.
The first example in the paper is the posterior distribution associated with the Ising model, which enjoys a sufficient statistic of dimension one. The issue of generating pseudo-data from the Ising model is evacuated by a call to a Gibbs sampler, but remains an intrinsic problem as the convergence of the Gibbs sampler depends on the value of the parameter θ and especially its location wrt the critical point. Both ABC posteriors are shown to be quite close.
The second example is the posterior distribution associated with an MA(2) model, apparently getting into a benchmark in the ABC literature. The comparison between an ABC based on the first two autocorrelations, an ABC based on the semi-automatic solution of Fearnhead and Prangle (2012) [for which collection of summaries?], and the neural network proposal, leads to the dismissal of the semi-automatic solution and the neural net being closest to the exact posterior [with the same tolerance quantile ε for all approaches].
A discussion crucially missing from the paper—from my perspective—is an accounting for size: First, what is the computing cost of fitting and calibrating and storing a neural network for the sole purpose of constructing a summary statistic? Once the neural net is constructed, I would assume most users would see little need in pursuing the experiment any further. (This was also why we stopped at our random forest output rather than using it as a summary statistic.) Second, how do cost and performances evolve as the dimension of the parameter θ grows? I would deem necessary to understand when the method fails. As for instance in latent variable models such as HMMs. Third, how does the size of the sample impact cost and performances? In many realistic cases when ABC applies, it is not possible to use the raw data, given its size, and summary statistics are a given. For such examples, neural networks should be compared with other ABC solutions, using the same reference table.
Today I gave a talk on Approximate Bayesian model choice via random forests at the yearly SPA (Stochastic Processes and their Applications) 2015 conference, taking place in Oxford (a nice town near Warwick) this year. In Keble College more precisely. The slides are below and while they are mostly repetitions of earlier slides, there is a not inconsequential novelty in the presentation, namely that I included our most recent and current perspective on ABC model choice. Indeed, when travelling to Montpellier two weeks ago, we realised that there was a way to solve our posterior probability conundrum!
Despite the heat wave that rolled all over France that week, we indeed figured out a way to estimate the posterior probability of the selected (MAP) model, way that we had deemed beyond our reach in previous versions of the talk and of the paper. The fact that we could not provide an estimate of this posterior probability and had to rely instead on a posterior expected loss was one of the arguments used by the PNAS reviewers in rejecting the paper. While the posterior expected loss remains a quantity worth approximating and reporting, the idea that stemmed from meeting together in Montpellier is that (i) the posterior probability of the MAP is actually related to another posterior loss, when conditioning on the observed summary statistics and (ii) this loss can be itself estimated via a random forest, since it is another function of the summary statistics. A posteriori, this sounds trivial but we had to have a new look at the problem to realise that using ABC samples was not the only way to produce an estimate of the posterior probability! (We are now working on the revision of the paper for resubmission within a few week… Hopefully before JSM!)
“By accepting of having obtained a poor approximation to the posterior, except for the location of its main mode, we switch to maximum likelihood estimation.”
Presumably the first paper ever quoting from the ‘Og! Indeed, Umberto Picchini arXived a paper about a technique merging ABC with prior feedback (rechristened data cloning by S. Lele), where a maximum likelihood estimate is produced by an ABC-MCMC algorithm. For state-space models. This relates to an earlier paper by Fabio Rubio and Adam Johansen (Warwick), who also suggested using ABC to approximate the maximum likelihood estimate. Here, the idea is to use an increasing number of replicates of the latent variables, as in our SAME algorithm, to spike the posterior around the maximum of the (observed) likelihood. An ABC version of this posterior returns a mean value as an approximate maximum likelihood estimate.
“This is a so-called “likelihood-free” approach [Sisson and Fan, 2011], meaning that knowledge of the complete expression for the likelihood function is not required.”
The above remark is sort of inappropriate in that it applies to a non-ABC setting where the latent variables are simulated from the exact marginal distributions, that is, unconditional on the data, and hence their density cancels in the Metropolis-Hastings ratio. This pre-dates ABC by a few years, since this was an early version of particle filter.
“In this work we are explicitly avoiding the most typical usage of ABC, where the posterior is conditional on summary statistics of data S(y), rather than y.”
Another point I find rather negative in that, for state-space models, using the entire time-series as a “summary statistic” is unlikely to produce a good approximation.
The discussion on the respective choices of the ABC tolerance δ and on the prior feedback number of copies K is quite interesting, in that Umberto Picchini suggests setting δ first before increasing the number of copies. However, since the posterior gets more and more peaked as K increases, the consequences on the acceptance rate of the related ABC algorithm are unclear. Another interesting feature is that the underlying MCMC proposal on the parameter θ is an independent proposal, tuned during the warm-up stage of the algorithm. Since the tuning is repeated at each temperature, there are some loose ends as to whether or not it is a genuine Markov chain method. The same question arises when considering that additional past replicas need to be simulated when K increases. (Although they can be considered as virtual components of a vector made of an infinite number of replicas, to be used when needed.)
The simulation study involves a regular regression with 101 observations, a stochastic Gompertz model studied by Sophie Donnet, Jean-Louis Foulley, and Adeline Samson in 2010. With 12 points. And a simple Markov model. Again with 12 points. While the ABC-DC solutions are close enough to the true MLEs whenever available, a comparison with the cheaper ABC Bayes estimates would have been of interest as well.
Our random forest paper was alas rejected last week. Alas because I think the approach is a significant advance in ABC methodology when implemented for model choice, avoiding the delicate selection of summary statistics and the report of shaky posterior probability approximation. Alas also because the referees somewhat missed the point, apparently perceiving random forests as a way to project a large collection of summary statistics on a limited dimensional vector as in the Read Paper of Paul Fearnhead and Dennis Prarngle, while the central point in using random forests is the avoidance of a selection or projection of summary statistics. They also dismissed ou approach based on the argument that the reduction in error rate brought by random forests over LDA or standard (k-nn) ABC is “marginal”, which indicates a degree of misunderstanding of what the classification error stand for in machine learning: the maximum possible gain in supervised learning with a large number of classes cannot be brought arbitrarily close to zero. Last but not least, the referees did not appreciate why we mostly cannot trust posterior probabilities produced by ABC model choice and hence why the posterior error loss is a valuable and almost inevitable machine learning alternative, dismissing the posterior expected loss as being not Bayesian enough (or at all), for “averaging over hypothetical datasets” (which is a replicate of Jeffreys‘ famous criticism of p-values)! Certainly a first time for me to be rejected based on this argument!
A paper on the comparison of emulation methods for Approximate Bayesian Computation was recently arXived by Jabot et al. The idea is to bypass costly simulations of pseudo-data by running cheaper simulation from a pseudo-model or emulator constructed via a preliminary run of the original and costly model. To borrow from the paper introduction, ABC-Emulation runs as follows:
- design a small number n of parameter values covering the parameter space;
- generate n corresponding realisations from the model and store the corresponding summary statistics;
- build an emulator (model) based on those n values;
- run ABC using the emulator in lieu of the original model.
A first emulator proposed in the paper is to use local regression, as in Beaumont et al. (2002), except that it goes the reverse way: the regression model predicts a summary statistics given the parameter value. The second and last emulator relies on Gaussian processes, as in Richard Wilkinson‘s as well as Ted Meeds’s and Max Welling‘s recent work [also quoted in the paper]. The comparison of the above emulators is based on an ecological community dynamics model. The results are that the stochastic version is superior to the deterministic one, but overall not very useful when implementing the Beaumont et al. (2002) correction. The paper however does not define what deterministic and what stochastic mean…
“We therefore recommend the use of local regressions instead of Gaussian processes.”
While I find the conclusions of the paper somewhat over-optimistic given the range of the experiment and the limitations of the emulator options (like non-parametric conditional density estimation), it seems to me that this is a direction to be pursued as we need to be able to simulate directly a vector of summary statistics instead of the entire data process, even when considering an approximation to the distribution of those summaries.