posterior predictive p-values

Bayesian Data Analysis advocates in Chapter 6 using posterior predictive checks as a way of evaluating the fit of a potential model to the observed data. There is a no-nonsense feeling to it:

“If the model fits, then replicated data generated under the model should look similar to observed data. To put it another way, the observed data should look plausible under the posterior predictive distribution.”

And it aims at providing an answer to the frustrating (frustrating to me, at least) issue of Bayesian goodness-of-fit tests. There are however issues with the implementation, from deciding on which aspect of the data or of the model is to be examined, to the “use of the data twice” sin. Obviously, this is an exploratory tool with little decisional backup and it should be understood as a qualitative rather than quantitative assessment. As mentioned in my tutorial on Sunday (I wrote this post in Duke during O’Bayes 2013), it reminded me of Ratmann et al.’s ABCμ in that they both give reference distributions against which to calibrate the observed data. Most likely with a multidimensional representation. And the “use of the data twice” can be argued for or against, once a data-dependent loss function is built.

“One might worry about interpreting the significance levels of multiple tests or of tests chosen by inspection of the data (…) We do not make [a multiple test] adjustment, because we use predictive checks to see how particular aspects of the data would be expected to appear in replications. If we examine several test variables, we would not be surprised for some of them not to be fitted by the model-but if we are planning to apply the model, we might be interested in those aspects of the data that do not appear typical.”

The natural objection that having a multivariate measure of discrepancy runs into multiple testing is answered within the book with the reply that the idea is not to run formal tests. I still wonder how one should behave when faced with a vector of posterior predictive p-values (ppp).

pospredThe above picture is based on a normal mean/normal prior experiment I ran where the ratio prior-to-sampling variance increases from 100 to 10⁴. The ppp is based on the Bayes factor against a zero mean as a discrepancy. It thus grows away from zero very quickly and then levels up around 0.5, reaching only values close to 1 for very large values of x (i.e. never in practice). I find the graph interesting because if instead of the Bayes factor I use the marginal (numerator of the Bayes factor) then the picture is the exact opposite. Which, I presume, does not make a difference for Bayesian Data Analysis, since both extremes are considered as equally toxic… Still, still, still, we are is the same quandary as when using any kind of p-value: what is extreme? what is significant? Do we have again to select the dreaded 0.05?! To see how things are going, I then simulated the behaviour of the ppp under the “true” model for the pair (θ,x). And ended up with the histograms below:

truepospredwhich shows that under the true model the ppp does concentrate around .5 (surprisingly the range of ppp’s hardly exceeds .5 and I have no explanation for this). While the corresponding ppp does not necessarily pick any wrong model, discrepancies may be spotted by getting away from 0.5…

“The p-value is to the u-value as the posterior interval is to the confidence interval. Just as posterior intervals are not, in general, classical confidence intervals, Bayesian p-values are not generally u-values.”

Now, Bayesian Data Analysis also has this warning about ppp’s being not uniform under the true model (u-values), which is just as well considering the above example, but I cannot help wondering if the authors had intended a sort of subliminal message that they were not that far from uniform. And this brings back to the forefront the difficult interpretation of the numerical value of a ppp. That is, of its calibration. For evaluation of the fit of a model. Or for decision-making…

8 Responses to “posterior predictive p-values”

  1. akismet-6f1b5b2aca640e3dd43860b48158ce9c Says:

    I am coming in months after the post, but I have a mundane thought: think about the simulation process as separate from evaluating the model. Simulating a dataset based on the posterior values is a step to elucidate a model: what are the observable consequences if the posterior distribution is true?

    If you’re very lucky, insights (or Enlightenment) will jump out at you from the simulation. But it may not, and measures based on the simulated dataset(s) can help with evaluating the model (after being elucidated by simulation). To the extent that you’re making judgments based on those add-on measures, you’re not really evaluating how well the model predicts new data, but you are evaluating a model based on the help you get in thinking through the “what if this is absolutely true?” consequences.

  2. Dan Simpson Says:

    It was my understanding that the cross-validated versions of these perform better in the sense that they are uniformly distributed and can, therefore, give some descriptive evidence of systemic bias (if the histogram is skewed left/right, u shaped or bell shaped).

    There’s a nice description (albeit in the context of INLA) in here http://www.math.ntnu.no/%7Ehrue/r-inla.org/papers/festschrift.pdf

    I don’t have a copy of BDA, so I don’t know if Gelman etc have a good reason for preferring the “raw” versions…

    • Dan:

      It depends what your goal is. If your goal is to estimate out-of-sample prediction error, then cross-validation is typically better (except for its higher computational cost); we discuss this in Chapter 7 of BDA3. If your goal is to examine discrepancies between fitted model and data, then I think it makes sense to use the full posterior distribution. See this recent paper for some discussion of this point.

  3. X: Thanks for the comments. I will need to digest your example. But, in the meantime, I’ve written a bit on the “using the data twice” issue; in particular, see here and here.

    S: You write, “I find that the posterior distributions are usually not representative of the data. What does this mean? If we take the idea of posterior predictive checks seriously, much published research will fall by the wayside.” No no no no no! We make this clear in the book (but not clear enough, I suppose): the point of the check is to understand where the model is not fitting the data, with an eye toward improving the model. There is no obligation or expectation that a model should be rejected, just because it does not fit! Rather, aspects of lack of fit are useful in helping us understand the limitations of our model. All our models are wrong. If you wanted to reject a model just because it was wrong, you could reject on theoretical grounds alone, before seeing any data at all. Posterior predictive checks are not about rejection, they’re about understanding the limitations of a fitted model.

    • Hi Andrew, Thanks for the clarification. I guess I’m still kind of lost with the following point: once I know the limitations, now what? For example, in the Newcomb data one can create a better model as you suggest in the book. This reflects a key property of the data, the generation of a few extreme values.

      My point (which I didn’t make clear) is that this will probably change your inferencing (yes/no decision) process. E.g., if I use a cauchy distribution as the generating distribution, a “significant” effect is often no longer going to be “significant”. Maybe I am wrong about this in general; I’m still in the process of testing this out with real data.

      I know, I know: you don’t like yes/no decisions. I understand your position, and it makes a lot of sense. But the fact is that in psychology-type areas we do planned experiments to make yes/no decisions, and that is not going to change; the statisticians have to adapt to that, not the researchers! :)

      So, if I write up models in Stan that reflect the properties of the data more faithfully, my inferences are probably going to change. That’s what I meant that many results would fall by the wayside.

      • I should add that it’s really great that people like Andrew and Christian are willing to engage with outsiders like me (who are end-users of statistics, so to speak). These conversations are impossible to have normally, and I have learnt a lot over the last three years as a result of following these discussions!

  4. Working with real data from psycholinguistics (reading data and the like), and carrying out pp checks of the type advocated in BDA, I find that the posterior distributions are usually not representative of the data. What does this mean? If we take the idea of posterior predictive checks seriously, much published research will fall by the wayside. I know Andrew that knows that, but it’s not clear how to proceed if the goal is a yes/no decision for theory evaluation (and whether Andrew likes it or not, that *is* the goal).

    I’m thinking of just giving up on experimental work and just concentrating on building computational models :). Worst comes to worst, we can just massage our data to confirm whatever theoretical position we like ;)

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.