Incoherent inference

“The probability of the nested special case must be less than or equal to the probability of the general model within which the special case is nested. Any statistic that assigns greater probability to the special case is incoherent. An example of incoherence is shown for the ABC method.” Alan Templeton, PNAS, 2010

Alan Templeton just published an article in PNAS about “coherent and incoherent inference” (with applications to phylogeography and human evolution). While his (misguided) arguments are mostly those found in an earlier paper of his’ and discussed in this post as well as in the defence of model based inference twenty-two of us published in Molecular Ecology a few months ago, the paper contains a more general critical perspective on Bayesian model comparison, aligning argument after argument about the incoherence of the Bayesian approach (and not of ABC, as presented there). The notion of coherence is borrowed from the 1991 (Bayesian) paper of Lavine and Schervish on Bayes factors, which shows that Bayes factors may be nonmonotonous in the alternative hypothesis (but also that posterior probabilities aren’t!). Templeton’s first argument proceeds from the quote above, namely that larger models should have larger probabilities or else this violates logic and coherence! The author presents the reader with a Venn diagram to explain why a larger set should have a larger measure. Obviously, he does not account for the fact that in model choice, different models induce different parameters spaces and that those spaces are endowed with orthogonal measures, especially if the spaces are of different dimensions. In the larger space, P(\theta_1=0)=0. (This point is not even touching the issue of defining “the” probability over the collection of models that Templeton seems to take for granted but that does not make sense outside a Bayesian framework.) Talking therefore of nested models having a smaller probability than the encompassing model or of “partially overlapping models” does not make sense from a measure theoretic (hence mathematical) perspective. (The fifty-one occurences of coherent/incoherent in the paper do not bring additional weight to the argument!)

“Approximate Bayesian computation (ABC) is presented as allowing statistical comparisons among models. ABC assigns posterior probabilities to a finite set of simulated a priori models.” Alan Templeton, PNAS, 2010

An issue common to all recent criticisms by Templeton is the misleading or misled confusion between the ABC method and the resulting Bayesian inference. For instance, Templeton distinguishes between the incoherence in the ABC model choice procedure from the incoherence in the Bayes factor, when ABC is used as a computational device to approximate the Bayes factor. There is therefore no inferential aspect linked with ABC,  per se, it is simply a numerical tool to approximate Bayesian procedures and, with enough computer power, the approximation can get as precise as one wishes. In this paper, Templeton also reiterates the earlier criticism that marginal likelihoods are not comparable across models, because they “are not adjusted for the dimensionality of the data or the models” (sic!). This point is missing the whole purpose of using marginal likelihoods, namely that they account for the dimensionality of the parameter by providing a natural Ockham’s razor penalising the larger model without requiring to specify a penalty term. (If necessary, BIC is so successful! provides an approximation to this penalty, as well as the alternate DIC.) The second criticism of ABC (i.e. of the Bayesian approach) is that model choice requires a collection of models and cannot decide outside this collection. This is indeed the purpose of a Bayesian model choice and studies like Berger and Sellke (1987, JASA) have shown the difficulty of reasoning within a single model. Furthermore, Templeton advocates the use of a likelihood ratio test, which necessarily implies using two models. Another Venn diagram also explains why Bayes formula when used for model choice is “mathematically and logically incorrect” because marginal likelihoods cannot be added up when models “overlap”: according to him, “there can be no universal denominator, because a simple sum always violates the constraints of logic when logically overlapping models are tested“. Once more, this simply shows a poor understanding of the probabilistic modelling involved in model choice.

“The central equation of ABC is inherently incoherent for three separate reasons, two of which are applicable in every case that deals with overlapping hypotheses.” Alan Templeton, PNAS, 2010

This argument relies on the representation of the “ABC equation” (sic!)

P(H_i|H,S^*) = \dfrac{G_i(||S_i-S^*||) \Pi_i}{\sum_{j=1}^n G_j(||S_j-S^*||) \Pi_j}

where S^* is the observed summary statistic, S_i is “the vector of expected (simulated) summary statistics under model i” and “G_i is a goodness-of-fit measure“. Templeton states that this “fundamental equation is mathematically incorrect in every instance (..) of overlap.” This representation of the ABC approximation is again misleading or misled in that the simulation algorithm ABC produces an approximation to a posterior sample from \pi_i(\theta_i|S^*). The resulting approximation to the marginal likelihood under model M_i is a regular Monte Carlo step that replaces an integral with a weighted sum, not a “goodness-of-fit measure.”  The subsequent argument  of Templeton’s about the goodness-of-fit measures being “not adjusted for the dimensionality of the data” (re-sic!) and the resulting incoherence is therefore void of substance. The following argument repeats an earlier misunderstanding with the probabilistic model involved in Bayesian model choice: the reasoning that, if

\sum_j \Pi_j = 1

the constraints of logic are violated [and] the prior probabilities used in the very first step of their Bayesian analysis are incoherent“, does not assimilate the issue of measures over mutually exclusive spaces.

“ABC is used for parameter estimation in addition to hypothesis testing and another source of incoherence is suggested from the internal discrepancy between the posterior probabilities generated by ABC and the parameter estimates found by ABC.” Alan Templeton, PNAS, 2010

The point corresponding to the above quote is that, while the posterior probability that \theta_1=0 (model M_1) is much higher than the posterior probability of the opposite (model M_2), the Bayes estimate of \theta_1 under model M_2 is “significantly different from zero“. Again, this reflects both a misunderstanding of the probability model, namely that \theta_1=0 is impossible [has measure zero] under model M_2, and a confusion between confidence intervals (that are model specific) and posterior probabilities (that work across models). The concluding message that “ABC is a deeply flawed Bayesian procedure in which ignorance overwhelms data to create massive incoherence” is thus unsubstantiated.

“Incoherent methods, such as ABC, Bayes factor, or any simulation approach that treats all hypotheses as mutually exclusive, should never be used with logically overlapping hypotheses.” Alan Templeton, PNAS, 2010

In conclusion, I am quite surprised at this controversial piece of work being published in PNAS, as the mathematical and statistical arguments of Professor Templeton should have been assessed by referees who are mathematicians and statisticians, in which case they would have spotted the obvious inconsistencies!

23 Responses to “Incoherent inference”

  1. […] q(.), for the prior distribution…  Furthermore, it reproduces the argument found in Templeton that larger evidence should be attributed to larger hypotheses. And it misses our 1992 analysis of […]

  2. […] reply. This reply is unfortunately missing any novelty element compared with the original paper. First, he maintains that the critcism is about ABC (which is, in case you do not know, a […]

  3. […] inference [accepted] The letter we submitted to PNAS about Templeton’s surprising diatribe on Bayesian inference has now been accepted: Title: “Incoherent Phylogeographic […]

  4. […] the astounding publication of Templeton’s pamphlet against Bayesian inference in PNAS last March, Jim Berger, Steve Fienberg, Adrian Raftery and […]

  5. Thank you so much for taking the time to explain that, I appreciate it.

    My description is clumsy because of my lack of familiarity with terms, but I do understand the algorithm. Your answer that this is not an apples-oranges comparison at the level of different (“large” and “small”) models is the key point for me. I need to do some more reading to understand mathematically why you are correct — the Molecular Ecology paper is unequivocal in this claim, but unfortunately doesn’t make it clear which references would help to explain why it is so. The references you cited in your review of Sober seem promising and I will consult those.

    I have my own hypothesis on the last point — the “special” cases fail to “win” in the algorithmic treatment of the model that seems to contain them, because the authors assumed boundaries to the parameters that excluded the special cases. This is a problem with assumptions, not methods, and fortunately it’s testable.

  6. (b) In Fagundes et al. (2007) the number of parameters varies among the eight models, so the prior distributions on those parameters cannot be the same. (A uniform prior on is not the same as a uniform prior on .) Putting the same prior weights on all eight models is another thing (which amounts to using the Bayes factor).

    Thanks, for taking the time to explain. I think this is the point that confuses me.

    Each of their differently-parametered models is run through simulations to find the parameter values that maximize the posterior probability, assuming uniform (or log-uniform) priors within that parameterization. That seems properly Bayesian to me, although I believe they constrained their parameter ranges in ways that systematically excluded relevant regions of the parameter space.

    Now they have eight models, each representing the maximum posterior, given the data, from a particular parameterization.

    So far, so good. But how to choose which of the eight parameterizations we should accept? The authors chose the one with the highest posterior probability. These are apples-oranges comparisons, to which the authors apply no priors at all, which is to say identical priors.

    I could imagine that the calculation of the posteriors had already penalized the models with more paramaters. But this can’t be true in general, because some of the fewer-parameter models are logically special cases of the more-parameter models — so if a fewer-parameter model really had a higher posterior under the same priors, it should already have been chosen in the first round of the analysis!

    • John: I reread the paper this morning to make sure I was not misrepresenting its methodology. The methodology sounds correct to me. Like Templeton, you may object to the Bayesian approach as a whole, in which case there is nothing else I can add (!), but this paper by Fagundes et al.truly follows standard Bayesian model choice methodology (Berger, 1985; Robert, 2001).
      Fagundes et al. jointly simulate parameters and summary statistics from the predictive distribution (modulo the approximation effect due to the ABC algorithm, which should be minor because of the large simulation size). From those simulations, they derive a convergent approximation to \mathbb{P}_m(S=S^\star) which is the marginal likelihood of the summary statistic. From there, a proper Bayesian reasoning derives the posterior probability of the model \mathbb{P}(M=m|S=S^*). Therefore, I do not find in the paper a step where Fagundes et al. find the parameter values that maximize the posterior probability, because they integrate out the parameters in a completely kosher Bayesian fashion. The effect of the parameterisation that you mention is furthermore only felt thru the choice of the priors, as a change of parameters does not modify the value of \mathbb{P}_m(S=S^\star).
      The posterior probabilities of the model or equivalently the marginal likelihoods of the summary statistic integrate the whole Bayesian modelling, including the priors. As explained in our Molecular Ecology paper, this means they are (a) statistically comparable, not an apple-orange comparison! and (b) already penalised for differences in parameter spaces, incl. dimensions.
      As to the latest of your points, about fewer parameter models [being] logically special cases of the more-parameter models, I have addressed it in the post. This is one of Popper’s misunderstandings about model choice, as also addressed in the review of Sober’s Evidence and Evolution: The Logic Behind the Science.

  7. Stuart J.E. Baird Says:

    There are those at PNAS who have been trying to shut down the non-peer reviewed channels by which members or their friends can submit pieces.

    I would hope that this submission by Templeton will finally convince PNAS or its contributors/readers that the journal cannot be held in high regard while great peer-reviewed work is forced to appear side by side with with misleading/misled nonscience.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: