Matt Moores, Tony Pettitt, and Kerrie Mengersen arXived a paper yesterday comparing different computational approaches to the processing of hidden Potts models and of the intractable normalising constant in the Potts model. This is a very interesting paper, first because it provides a comprehensive survey of the main methods used in handling this annoying normalising constant Z(β), namely pseudo-likelihood, the exchange algorithm, path sampling (a.k.a., thermal integration), and ABC. A massive simulation experiment with individual simulation times up to 400 hours leads to select path sampling (what else?!) as the (XL) method of choice. Thanks to a pre-computation of the expectation of the sufficient statistic E[S(Z)|β]. I just wonder why the same was not done for ABC, as in the recent Statistics and Computing paper we wrote with Matt and Kerrie. As it happens, I was actually discussing yesterday in Columbia of potential if huge improvements in processing Ising and Potts models by approximating first the distribution of S(X) for some or all β before launching ABC or the exchange algorithm. (In fact, this is a more generic desiderata for all ABC methods that simulating directly if approximately the summary statistics would being huge gains in computing time, thus possible in final precision.) Simulating the distribution of the summary and sufficient Potts statistic S(X) reduces to simulating this distribution with a null correlation, as exploited in Cucala and Marin (2013, JCGS, Special ICMS issue). However, there does not seem to be an efficient way to do so, i.e. without reverting to simulating the entire grid X…
Archive for Approximate Bayesian computation
A paper on the comparison of emulation methods for Approximate Bayesian Computation was recently arXived by Jabot et al. The idea is to bypass costly simulations of pseudo-data by running cheaper simulation from a pseudo-model or emulator constructed via a preliminary run of the original and costly model. To borrow from the paper introduction, ABC-Emulation runs as follows:
- design a small number n of parameter values covering the parameter space;
- generate n corresponding realisations from the model and store the corresponding summary statistics;
- build an emulator (model) based on those n values;
- run ABC using the emulator in lieu of the original model.
A first emulator proposed in the paper is to use local regression, as in Beaumont et al. (2002), except that it goes the reverse way: the regression model predicts a summary statistics given the parameter value. The second and last emulator relies on Gaussian processes, as in Richard Wilkinson‘s as well as Ted Meeds’s and Max Welling‘s recent work [also quoted in the paper]. The comparison of the above emulators is based on an ecological community dynamics model. The results are that the stochastic version is superior to the deterministic one, but overall not very useful when implementing the Beaumont et al. (2002) correction. The paper however does not define what deterministic and what stochastic mean…
“We therefore recommend the use of local regressions instead of Gaussian processes.”
While I find the conclusions of the paper somewhat over-optimistic given the range of the experiment and the limitations of the emulator options (like non-parametric conditional density estimation), it seems to me that this is a direction to be pursued as we need to be able to simulate directly a vector of summary statistics instead of the entire data process, even when considering an approximation to the distribution of those summaries.
Our paper about evaluating statistics used for ABC model choice has just appeared in Series B! It somewhat paradoxical that it comes out just a few days after we submitted our paper on using random forests for Bayesian model choice, thus bypassing the need for selecting those summary statistics by incorporating all statistics available and letting the trees automatically rank those statistics in term of their discriminating power. Nonetheless, this paper remains an exciting piece of work (!) as it addresses the more general and pressing question of the validity of running a Bayesian analysis with only part of the information contained in the data. Quite usefull in my (biased) opinion when considering the emergence of approximate inference already discussed on this ‘Og…
[As a trivial aside, I had first used fresh from the press(es) as the bracketted comment, before I realised the meaning was not necessarily the same in English and in French.]
Here are comments by Olli following my post:
I think we found a general means to obtain accurate ABC in the sense of matching the posterior mean or MAP exactly, and then minimising the KL distance between the true posterior and its ABC approximation subject to this condition. The construction works on an auxiliary probability space, much like indirect inference. Now, we construct this probability space empirically, this is where our approach differs first from indirect inference and this is where we need the “summary values” (>1 data points on a summary level; see Figure 1 for clarification). Without replication, we cannot model the distribution of summary values but doing so is essential to construct this space. Now, lets focus on the auxiliary space. We can fiddle with the tolerances (on a population level) and m so that on this space, the ABC approximation has the aforesaid properties. All the heavy technical work is in this part. Intuitively, as m increases, the power increases for sufficiently regular tests (see Figure 2) and consequently, for calibrated tolerances, the ABC approximation on the auxiliary space goes tighter. This offsets the broadening effect of the tolerances, so having non-identical lower and upper tolerances is fine and does not hurt the approximation. Now, we need to transport the close-to-exact ABC approximation on the auxiliary space back to the original space. We need some assumptions here, and given our time series example, it seems these are not unreasonable. We can reconstruct the link between the auxiliary space and the original parameter space as we accept/reject. This helps us understand (with the videos!) the behaviour of the transformation and to judge if its properties satisfy the assumptions of Theorems 2-4. While we offer some tools to understand the behaviour of the link function, yes, we think more work could be done here to improve on our first attempt to accurate ABC.
Now some more specific comments:
“The paper also insists over and over on sufficiency, which I fear is a lost cause.” To clarify, all we say is that on the simple auxiliary space, sufficient summaries are easily found. For example, if the summary values are normally distributed, the sample mean and the sample variance are sufficient statistics. Of course, this is not the original parameter space and we only transform the sufficiency problem into a change of variable problem. This is why we think that inspecting and understanding the link function is important.
“Another worry is that the … test(s) rel(y) on an elaborate calibration”. We provide some code here for everyone to try out. In our examples, this did not slow down ABC considerably. We generally suppose that the distribution of the summary values is simple, like Gaussian, Exponential, Gamma, ChiSquare, Lognormal. In these cases, the ABC approximation takes on an easy-enough-to-calibrate-fast functional form on the auxiliary space.
“This Theorem 3 sounds fantastic but makes me uneasy: unbiasedness is a sparse property that is rarely found in statistical problems. … Witness the use of “essentially unbiased” in Fig. 4.” What Theorem 3 says is that if unbiasedness can be achieved on the simple auxiliary space, then there are regularity conditions under which these properties can be transported back to the original parameter space. We hope to illustrate these conditions with our examples, and to show that they hold in quite general cases such as the time series application. The thing in Figure 4 is that the sample autocorrelation is not an unbiased estimator of the population autocorrelation. So unbiasedness does not quite hold on the auxiliary space and the conditions of Theorem 3 are not satisfied. Nevertheless, we found this bias to be rather negligible in our example and the bigger concern was the effect of the link function.
And here are Olli’s slides:
As posted in the previous entry, Olli Ratman, Anton Camacho, Adam Meijer, and Gé Donker arXived their paper on accurate ABC. A paper which [not whose!] avatars I was privy to in the past six months! While I acknowledge the cleverness of the reformulation of the core ABC accept/reject step as a statistical test, and while we discussed the core ideas with Olli and Anton when I visited Gatsby, the paper still eludes me to some respect… Here is why. (Obviously, you should read this rich & challenging paper first for the comments to make any sense! And even then they may make little sense…)
The central idea of this accurate ABC [aABC? A²BC?] is that, if the distribution of the summary statistics is known and if replicas of those summary statistics are available for the true data (and less problematically for the generated data), then a classical statistical test can be turned into a natural distance measure for each statistics and even “natural” bounds can be found on that distance, to the point of recovering most properties of the original posterior distribution… A first worry is this notion that the statistical distribution of a collection of summary statistics is available in closed form: this sounds unrealistic even though it may not constitute a major contention issue. Indeed, replacing a tailored test with a distribution-free test of identical location parameter could not hurt that much. [Just the power. If that matters… See bellow.] The paper also insists over and over on sufficiency, which I fear is a lost cause. In my current understanding of ABC, the loss of some amount of information contained in the data should be acknowledged and given a write-off as a Big Data casualty. (See, e.g., Lemma 1.)
Another worry is that the rephrasing of the acceptance distance as the maximal difference for a particular test relies on an elaborate calibration, incl. α, c+, τ+, &tc. (I am not particularly convinced by the calibration in terms of the power of the test being maximised at the point null value. Power?! See bellow, once again.) When cumulating tests and aiming at a nominal α level, the orthogonality of the test statistics in Theorem 1(iii) is puzzling and I think unrealistic.
The notion of accuracy that is central to the paper and its title corresponds to the power of every test being maximal at the true value of the parameter. And somehow to the ABC approximation being maximises at the true parameter, even though I am lost by then [i.e. around eqn (18)] about the meaning of ρ*… The major result in the paper is however that, under the collection of assumptions made therein, the ABC MLE and MAP versions are equal to their exact counterparts. And that these versions are also unbiased. This Theorem 3 sounds fantastic but makes me uneasy: unbiasedness is a sparse property that is rarely found in statistical problems. Change the parameterisation and you loose unbiasedness. And even the possibility to find an unbiased estimator. Since this difficulty does not appear in the paper, I would conclude that either the assumptions are quite constraining or the result holds in a weaker sense… (Witness the use of “essentially unbiased” in Fig. 4.)
This may be a wee rude comment (even for a Frenchman) but I also felt the paper could be better written in that notations pop in unannounced. For instance, on page 2, x [the data] becomes x1:n becomes sk1:n. This seems to imply that the summary statistics are observed repeatedly over the true sample. Unless n=1, this does not seem realistic. (I do not understand everything in Example 1, in particular the complaint that the ABC solutions were biased for finite values of n. That sounds like an odd criticism of Bayesian estimators. Now, it seems the paper is very intent on achieving unbiasedness! So maybe it should be called the aAnsBC algorithm for “not-so-Bayes!) I am also puzzled by the distinction between summary values and summary statistics. This sounds like insisting on having a large enough iid dataset. Or, on page 5, the discussion that the summary parameters are replaced by estimates seems out of context because this adds an additional layer of notation to the existing summary “stuff”… With the additional difficulty that Lemma 1 assumes reparameterisation of the model in terms of those summary parameters. I also object to the point null hypotheses being written in terms of a point estimate, i.e. of a quantity depending on the data x: it sounds like confusing the test [procedure] with the test [problem]. Another example: I read several times Lemma 5 about the calibration of the number of ABC simulations m but cannot fathom what this m is calibrated against. It seems only a certain value of m achieves the accurate correspondence with the genuine posterior, which sounds counter-intuitive. Last counter-example: some pictures seemed to be missing in the Appendix, but as it happened, it is only my tablet being unable to process them! S2 is actually a movie about link functions, really cool!
In conclusion, this is indeed a rich and challenging paper. I am certain I will get a better understanding by listening to Olli’s talk in Roma. And discussing with him.
Anoop Korattikara, Yutian Chen and Max Welling recently arXived a paper on the appeal of using only part of the data to speed up MCMC. This is different from the growing literature on unbiased estimators of the likelihood exemplified by Andrieu & Roberts (2009). Here, the approximation to the true target is akin to the approximation in ABC algorithms in that a value of the parameter is accepted if the difference in the likelihoods is larger than a given bound. Expressing this perspective as a test on the mean of the log likelihood leads the authors to use instead a subsample from the whole sample. (The approximation level ε is then a bound on the p-value.) While this idea only applies to iid settings, it is quite interesting and sounds a wee bit like a bootstrapped version of MCMC. Especially since it sounds as if it could provide an auto-evaluation of its error.
I came to Venezia today to give a series of two talks on recent advances in ABC, within a workshop on Likelihood, approximate likelihood and nonparametric statistical methods for complex applications, organised by the Ca’ Foscari University, and marking the end of two 2008 research projects.
It is always the same pleasure to stay in Venezia, to roam the back streets away from the crowds, and to discover hidden passageways, magnificent buildings, and unsuspected trattorias with the most exquisite sea foods (being with Italian friends obviously helping a lot!)… I actually did a short run this morning, spending more time waiting for the sun to come behind San Giorggio Maggiore. Here are the slides of my tutorial, mostly for the benefit of the workshop attendees, as they replicate those I gave in Bristol last week. (Plus the two sections on ABC model choice.)