## Bayesian p-values

“The posterior probability of the null hypothesis is a point probability and thus not appropriate to use.”

**T**his morning, during my “lazy Sunday” breakfast, due to the extended May first weekend, I was already done with the newspapers of the week and I thus moved to the pile of unread Statistics journals. One intriguing paper was ** Bayesian p-values** by Lin et al. in the latest issue of JRSS Series C. The very concept of a Bayesian

*p*-value is challenging and the very idea would make many (most?) Bayesian cringe! In the most standard perspective, a p-value is a frequentist concept that is incredibly easy to misuse as

*“the probability that the null hypothesis is true”*but also that is intrinsically misguided. There are a lot of papers and books dealing with this issue, including Berger and Wolpert’s (1988) Likelihood Principle (available for free!), and I do not want to repeat here the arguments about the lack of validity of this pervasive quantification of statistical tests. In the case of , the Bayesian

*p*-values were simply defined as tail posterior probabilities and thus were not

*p*-values per se… The goal of the paper was not to define a “new” methodology but to evaluate the impact of the Dirichlet prior parameters on those posterior probabilities. I still do not understand the above quote since, while there is some degree of controversy about using Bayes factors with improper priors, point null hypotheses may still be properly defined…

**T**here is however an interesting larger issue related to this issue. Namely, Bayesian model evaluation: in a standard Bayesian modelling, both sides of an hypothesis are part of the modelling and the Bayes factor evaluates the likelihood of one side versus the other, bypassing the prior probabilities of each side (and hence missing the decision theoretic goal of maximising an utility function). There is however a debate whether or not this should always be the case and about the relevance of deciding about the “truth” of an hypothesis by simply looking at how far in the tails the observation is, as shown for instance by the recent paper of Templeton whose main criticism was on this point. Following this track may lead to different kinds of *p*-values, as for instance in Verdinelli and Wasserman’s paper (1998) and in our unpublished tech report with Judith Rousseau… Templeton argues that seeing a value that is too far in the tails (whether in a frequentist or a Bayesian sense) is enough to conclude about the falsity of a theory (calling, guess who?!, on Popper himself!). But, even in a scientific perspective, the rejection of an hypothesis must be followed by some kind of action, whose impact should be included within the test itself.

**Ps**-As a coincidence, the paper * Allocation of Resources* by Metcalf et al. in the same issue of Series C also contains some reflections about Bayesian

*p*-values, using a predictive posterior probability of divergence as the central quantity

where *obs* and *rep* denote the observed and replicated data and where D is the discrepancy function, equal to zero when the replicated data equals the observed data. The result they obtain on their dataset is a perfect symmetry with a Bayesian *p*-value of 0.5. They conclude (rightly) that *“despite being potentially useful as an informal diagnostic, there is little formal (decision theoretic) justification for the use of Bayesian p-values”* (p.169).

May 4, 2009 at 6:25 am

Andrew, I would be indeed interested in a deeper debate on the nature of predictive checking within a Bayesian framework. It seems to me this is an important missing item in the Bayesian toolbox and one easy entry for critics of the Bayesian paradigm.

May 4, 2009 at 3:04 am

Christian:

Hmmm . . . I actually think the whole p-value thing is a distraction. When checking models, graphs are the way to go: my own changing perspective can be seen by comparing chapter 6 of the first edition of BDA and my 1996 paper with Meng and Stern to chapter 6 of the second edition of BDA.

Regarding p-values in particular, I think one problem here is that people don’t always define them carefully. I think it’s helpful to distinguish p-values (posterior probabilities comparing replicated to observed data) and u-values (uniformly-distributed statistics); see, for example, section 3 of this article (or my 2003 International Statistical Review article for more on this).

The point is that the goal of these checks is not, and should not be, to reject models. I can reject my models before I see a single data point. All my models are false! The point is to understand ways in which the model does not fit the data. It’s not about rejection.

This is not to dismiss all the mathematical results that you refer to, but I question some of its claimed relevance to statistical practice. I have no problem with people computing Bayes factors–they make sense within their (limited) domain. But I do get annoyed when people (not you!) go around telling people _not_ to do predictive checks and _not_ to check their models. The argument seems to be: (1) Posterior predictive checks have low power, (2) So don’t use posterior predictive checks, (3) So don’t seriously check your model. Thus, “low power” is used as an excuse to be uncritical of models!

Umm . . . . I guess I should write something on my own blog about this. I occasionally write these articles for statistics journals (as noted above) but they don’t seem to convince all the orthodox Bayesians. Maybe there’s another way to say the same thing? Or I should be happy knowing that people actually do posterior predictive checks even if they’re somehow viewed as illegitimate?