“The posterior probability of the null hypothesis is a point probability and thus not appropriate to use.”
This morning, during my “lazy Sunday” breakfast, due to the extended May first weekend, I was already done with the newspapers of the week and I thus moved to the pile of unread Statistics journals. One intriguing paper was Bayesian p-values by Lin et al. in the latest issue of JRSS Series C. The very concept of a Bayesian p-value is challenging and the very idea would make many (most?) Bayesian cringe! In the most standard perspective, a p-value is a frequentist concept that is incredibly easy to misuse as “the probability that the null hypothesis is true” but also that is intrinsically misguided. There are a lot of papers and books dealing with this issue, including Berger and Wolpert’s (1988) Likelihood Principle (available for free!), and I do not want to repeat here the arguments about the lack of validity of this pervasive quantification of statistical tests. In the case of , the Bayesian p-values were simply defined as tail posterior probabilities and thus were not p-values per se… The goal of the paper was not to define a “new” methodology but to evaluate the impact of the Dirichlet prior parameters on those posterior probabilities. I still do not understand the above quote since, while there is some degree of controversy about using Bayes factors with improper priors, point null hypotheses may still be properly defined…
There is however an interesting larger issue related to this issue. Namely, Bayesian model evaluation: in a standard Bayesian modelling, both sides of an hypothesis are part of the modelling and the Bayes factor evaluates the likelihood of one side versus the other, bypassing the prior probabilities of each side (and hence missing the decision theoretic goal of maximising an utility function). There is however a debate whether or not this should always be the case and about the relevance of deciding about the “truth” of an hypothesis by simply looking at how far in the tails the observation is, as shown for instance by the recent paper of Templeton whose main criticism was on this point. Following this track may lead to different kinds of p-values, as for instance in Verdinelli and Wasserman’s paper (1998) and in our unpublished tech report with Judith Rousseau… Templeton argues that seeing a value that is too far in the tails (whether in a frequentist or a Bayesian sense) is enough to conclude about the falsity of a theory (calling, guess who?!, on Popper himself!). But, even in a scientific perspective, the rejection of an hypothesis must be followed by some kind of action, whose impact should be included within the test itself.
Ps-As a coincidence, the paper Allocation of Resources by Metcalf et al. in the same issue of Series C also contains some reflections about Bayesian p-values, using a predictive posterior probability of divergence as the central quantity
where obs and rep denote the observed and replicated data and where D is the discrepancy function, equal to zero when the replicated data equals the observed data. The result they obtain on their dataset is a perfect symmetry with a Bayesian p-value of 0.5. They conclude (rightly) that “despite being potentially useful as an informal diagnostic, there is little formal (decision theoretic) justification for the use of Bayesian p-values” (p.169).