## over-confident about mis-specified models?

**Z**iheng Yang and Tianqui Zhu published a paper in PNAS last year that criticises Bayesian posterior probabilities used in the comparison of models under misspecification as “overconfident”. The paper is written from a phylogeneticist point of view, rather than from a statistician’s perspective, as shown by the Editor in charge of the paper [although I thought that, after Steve Fienberg‘s intervention!, a statistician had to be involved in a submission relying on statistics!] a paper , but the analysis is rather problematic, at least seen through my own lenses… With no statistical novelty, apart from looking at the distribution of posterior probabilities in toy examples. The starting argument is that Bayesian model comparison is often reporting posterior probabilities in favour of a particular model that are close or even equal to 1.

“The Bayesian method is widely used to estimate species phylogenies using molecular sequence data. While it has long been noted to produce spuriously high posterior probabilities for trees or clades, the precise reasons for this over confidence are unknown. Here we characterize the behavior of Bayesian model selection when the compared models are misspecified and demonstrate that when the models are nearly equally wrong, the method exhibits unpleasant polarized behaviors,supporting one model with high confidence while rejecting others. This provides an explanation for the empirical observation of spuriously high posterior probabilities in molecular phylogenetics.”

The paper focus on the behaviour of posterior probabilities to strongly support a model against others when the sample size is large enough, “even when” all models are wrong, the argument being apparently that the correct output should be one of equal probability between models, or maybe a uniform distribution of these model probabilities over the probability simplex. Why should it be so?! The construction of the posterior probabilities is based on a meta-model that assumes the generating model to be part of a list of mutually exclusive models. It does not account for cases where “all models are wrong” or cases where “all models are right”. The reported probability is furthermore epistemic, in that it is relative to the measure defined by the prior modelling, not to a promise of a frequentist stabilisation in a ill-defined asymptotia. By which I mean that a 99.3% probability of model M¹ being “true”does not have a universal and objective meaning. (*Moderation note:* the high polarisation of posterior probabilities was instrumental in our investigation of model choice with ABC tools and in proposing instead error rates in ABC random forests.)

The notion that two models are equally wrong because they are both exactly at the same Kullback-Leibler distance from the generating process (when optimised over the parameter) is such a formal [or cartoonesque] notion that it does not make much sense. There is always one model that is slightly closer and eventually takes over. It is also bizarre that the argument does not account for the complexity of each model and the resulting (Occam’s razor) penalty. Even two models with a single parameter are not necessarily of intrinsic dimension one, as shown by DIC. And thus it is not a surprise if the posterior probability mostly favours one versus the other. In any case, an healthily sceptic approach to Bayesian model choice means looking at the behaviour of the procedure (Bayes factor, posterior probability, posterior predictive, mixture weight, &tc.) under various assumptions (model M¹, M², &tc.) to calibrate the numerical value, rather than taking it at face value. By which I do not mean a frequentist evaluation of this procedure. Actually, it is rather surprising that the authors of the PNAS paper do not jump on the case when the posterior probability of model M¹ say is uniformly distributed, since this would be a perfect setting when the posterior probability is a p-value. (This is also what happens to the bootstrapped version, see the last paragraph of the paper on p.1859, the year Darwin published his Origin of Species.)

May 2, 2019 at 3:09 pm

I don’t know the authors of this paper. But I have read it pretty carefully and I have to disagree on a number of points.

> no statistical novelty, apart from looking at the distribution of posterior probabilities in toy examples

There are a number of quite general asymptotic results buried in the supplementary materials. So I think there is some statistical novelty here.

> It is also bizarre that the argument does not account for the complexity of each model and the resulting (Occam’s razor) penalty

Their general theory in the supplementary material does cover the case of models of different dimensionality and there is in fact an Occam’s razor penalty. It’s summarized in Table S.3.

> The notion that two models are equally wrong because they are both exactly at the same Kullback-Leibler distance from the generating process (when optimised over the parameter) is such a formal [or cartoonesque] notion that it does not make much sense.

But that’s exactly what determines which models could be selected asymptotically. It’s hardly an arbitrary choice they are making!

> There is always one model that is slightly closer and eventually takes over.

The authors seem well aware of this reality, which is why they back up the asymptotic theory with simulation experiments to demonstrate the same behavior occurs non-asymptotically when multiple models have similar KL divergence (see Fig 3).

Stepping back: in his Bayesian Analysis paper on misspecification, Peter Grunwald talks about “benign” vs “bad” misspecification for Bayesian inference in the predictive setting. Yang and Zhu are investigating a similar phenomenon in the model selection setting. In both the Grunwald and Yang/Zhu settings, bad misspecification leads to overconfident posteriors/posterior predictives. Understanding when this can happen and how to fix it seem like worthwhile endeavors to me!

May 2, 2019 at 9:36 pm

I am sure a lot of people have a positive impression about this paper, it was actually sent to me by my friend and colleague Judith who clearly appreciated it.

April 30, 2019 at 4:35 am

The Bayesian paradigm can not be criticised for failing to place adequate post-data/posterior probabilities (summing to 1) on models in a set not containing the true model, because the problem is fundamentally ill-posed(!)

However, it is not really designed to decide what models should be included in a reasonable set of wrong but potentially useful models (Gelman and many others appear to concede that ground to empiricalism or frequentism) and hence possibly facilitate a sensitivity analysis over such a set of models.