## checking ABC convergence via coverage

**D**ennis Prangle, Michael Blum, G. Popovic and Scott Sisson just arXived a paper on diagnostics for ABC validation via coverage diagnostics. Getting valid approximation diagnostics for ABC is clearly and badly needed and this was the last slide of my talk yesterday at the Winter Workshop in Gainesville. When simulation time is not an issue (!), our DIYABC software does implement a limited coverage assessment by computing the type I error, i.e. by simulating data under the null model and evaluating the number of time it is rejected at the 5% level (see sections 2.11.3 and 3.8 in the documentation). The current paper builds on a similar perspective.

**T**he idea in the paper is that a (Bayesian) credible interval at a given credible level α should have a similar confidence level (at least asymptotically and even more for matching priors) and that simulating pseudo-data with a known parameter value allows for a Monte-Carlo evaluation of the credible interval “true” coverage, hence for a calibration of the tolerance. The delicate issue is about the generation of those “known” parameters. For instance, if the pair (θ_{,} y) is generated from the joint distribution prior x likelihood, and if the credible region is also based on the true posterior, the average coverage is the nominal one. On the other hand, if the credible interval is based on a poor (ABC) approximation to the posterior, the average coverage should differ from the nominal one. Given that ABC is *always* wrong, however, this may fail to be a powerful diagnostic. In particular, when using *insufficient* (summary) statistics, the discrepancy should make testing for uniformity harder, shouldn’t it?

**I** was a wee puzzled by the coverage property found on page 7:

Let g(θ|y) be a density approximating the univariate posterior π(θ|y), and G

_{y}be the corresponding distribution function. Consider a function B(α)[taking values in the set of Borel sets of][0,1] defined on [0, 1] such that the resulting set has Lebesgue measure α. Let C(y,α) = G_{y}^{-1}(B(α)) and H(θ_{0,}y_{0}) be the distribution function for (θ_{0},y_{0}). We say g satisfies the coverage property with respect to distribution H(θ_{0},y_{0}) if for every function B and every α in [0,1], the probability that θ_{0}is in C(y_{0},α ) is α.

as the probability that θ_{0} belongs to C(y_{0},α ) is the probability that G_{y}_{0}(θ_{0}) belongs to B(α), which means the conditional of H(θ_{0,}y_{0}) has to be G_{y}_{0} if the probability is conditional on y_{0}. However, I then realised the paper does consider coverage in frequentist terms, which means that the probability is on the pair (θ_{0},y_{0)}. In this case, the coverage property will be satisfied for *any* distribution on y_{0} if the conditional is g(θ|y). This covers both Result 1 and Result 2 (and it seems to relate [strongly?!] to ABC being “well-calibrated” for every value of the tolerance, even infinity). I actually find the whole section 2.1 vaguely confusing both because of the use of double indexing ((θ_{0},y_{0}) vs. (θ_{},y)) and because of the apparent lack of relevance of the posterior π(θ|y) in the discussion (all that matters is the connection between G and H)… In their implementation (p.12), the authors end up approximating the p-value P(θ_{0}<θ) and checking for uniformity.

**A**s duly noted in the paper (p.9), things get more delicate when *m* the model index itself is involved in this assessment. When integrating the parameters out, the posterior distribution on the model index is a mixture of point masses. Giving e.g. masses .7, .2, and .1 to the three possible values of *m*. I thus fail to understand how this translates into [non-degenerate] intervals: I would not derive from these figures that the posterior gives a “70% credible interval that m=1” (p.9) as there is no interval involved. The posterior probability is a number, uniquely defined given the data, without an associated variability in the Bayesian sense. Now, the definition found in the paper (p.9) is once again one of *calibration* of the ABC distributions, already discussed in Fearnhead and Prangle (2012). (The paper actually makes not mention of calibration.) At last, I am also lost as to why the calibration condition (5) on the posterior distribution of the model index is a testable one: there is a zero probability to observe again a given value of the posterior probability g(*m*|y) when generating a new y. In the following diagnostic, the authors use instead (p.13) a test that the (generated) model index is an outcome from a Bernoulli with parameter the posterior probability,

January 24, 2013 at 6:49 pm

Thanks for taking the time to read and comment on our paper Christian, much appreciated! Some responses are:

“Given that ABC is always wrong, however, this may fail to be a powerful diagnostic” This is an interesting point. Our method can often detect ABC approximation error, which is why we promote it as a useful technique. However it doesn’t estimate how much error remains if our diagnostics do not reject.

We only look at approximation error based on the ABC threshold, rather than that due to summary statistics. In other words we compare the ABC posterior approximation with pi(theta | S(yobs)) rather than with pi(theta | yobs). Possible extensions to consider the effect of summary statistics are mentioned in the discussion, but there would clearly be a lot of extra challenges.

The coverage property and calibration are closely linked. We didn’t explore this in the literature review to keep it brief. Calibration as in Fearnhead and Prangle (2012) effectively uses pi(theta0,y0)=prior*likelihood so tolerances zero and infinity both give calibration, but usually not intermediate values (except in the case of noisy ABC, which we don’t consider in the current paper). One contribution of this paper is to eliminate the “false negative” of the prior.

It’s correct that our definition of coverage does not involve the posterior. The link to the posterior only comes out through Result 1.

The comment about a “70% credible interval that m=1” does refer to a degenerate interval. The statement is trying to be an informal (technically incorrect!) link to the earlier definition of coverage.

In going from equation (5) to the test on page 13, the idea is that we have estimated model probabilities from the various data sets, and now investigate if the true models are consistent with this distribution. Statistic W does this through the log likelihood whereas statistics U and V simplify to a Bernoulli setting, by choosing one model i and comparing only the cases “i” and “not i”.

January 25, 2013 at 2:06 pm

Thank you for the detailed comments, Dennis! My worry about the summary statistics is that, while the overall simulation is on the joint for the parameters and for the (full) observations, the ABC is built upon summary statistics, so there seems to be a discrepancy there.

February 8, 2013 at 5:10 pm

It took a while and a trip to Warwick and a call to Fubini to understand, but I now see why there is not specific issue with the insufficiency of the summary statistics!