Bayesian p-values (2)
“What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.” H. Jeffreys, Theory of Probability
Looking a bit further into the literature about Bayesian p-values, I read the Statistical Science paper of Bayarri and Castellanos on Bayesian Checking of the Second Levels of Hierarchical Models, which considers extensions of the surprise measures of Bayarri and Berger (2000, JASA, see preprint) in the setting of a normal hierarchical model. (Here is a link to a very preliminary version of the paper.) While I quite appreciate the advances contained in those papers, I am still rather reticent to put forward those measures of surprise as decision tools and, given the propensity of users to hijack tools towards their own use, I fear they would end up being used exactly as p-values, namely interpreted as probabilities of the null hypothesis.
Here are some comments on the paper that I made during my trip back from Montpellier yesterday evening. The most natural tool in building a Bayesian p-value seems to me to use the probability of a tail event under the predictive
and using to define the “surprise” means that
-
the tail event is evaluated under a distribution that is “most” favourable to x, since it is based on the posterior distribution of
given x. [This point somehow relates to the “using the data twice” argument that I do not really understand in this setting: conditional on x, this is a proper Bayesian way of evaluating the probability of the event
. (What one does with this evaluation is another issue!)]
-
following Andrew Gelman’s discussion, there is no accounting for the fact that “all models are wrong” and that we are working from within a model trying to judge of the adequacy of this model, in a Munchausen’s way of pull oneself up to the Moon. Again, there is a fair danger in using
as the posterior probability of the model being “true”…
-
I think the approach does not account for the (decisional) uses of the numerical evaluation, hence is lacking calibration: Is a value of 10¯² small?! Is a value of 10¯³ very small?! Are those absolute or relative values?! And if the value is used to decide for or against a model, what are the consequences of this decision?
-
the choice of the summary statistic t(x) is quite relevant for the value of the surprise measure and there is no intrinsic choice. For instance, I first thought using the marginal likelihood m(x) would be a relevant choice, but alas this is not invariant under a change of variables.
-
another [connected] point that is often neglected in model comparison and model evaluation is that sufficient statistics are only sufficient within a given model, but not for comparing nor evaluating this model. For instance, when comparing a Poisson model with a negative binomial model, the sum of the observation is sufficient in both cases but not in the comparison!
September 11, 2009 at 12:28 am
[…] twice“, which is a recurring argument in the criticisms of predictive Bayes inference. I have difficulties with the concept in general and, in the present case, there is no difficulty with using to predict […]
May 9, 2009 at 3:27 am
Yes, I think the “using the data twice” argument is obnoxious. What irritates me is that it is not a mathematical argument at all! But it is sometimes used by people who seem to think of themselves as mathematically rigorous.
I think the distinction between p-values and u-values is helpful here. Although, as I’ve noted elsewhere, in recent years I’ve been less interested in the p-value as a summary of a model check.