## Bayes, reproducibility, and the quest for truth

“Avoid opinion priors, you could be held legally or otherwise responsible.”

**D**on Fraser, Mylène Bedard, Augustine Wong, Wei Lin, and Ailana Fraser wrote a paper to appear in Statistical Science, with the above title. This paper is a continuation of Don’s assessment of Bayes procedures in earlier Statistical Science [which I discussed] and Science 2013 papers, which I would qualify with all due respect of a demolition enterprise [of the Bayesian approach to statistics]… The argument therein is similar in that “reproducibility” is to be understood therein as providing frequentist confidence assessment. The authors also use “accuracy” in this sense. (As far as I know, there is no definition of *reproducibility* to be found in the paper.) Some priors are *matching* priors, in the (restricted) sense that they give second-order accurate frequentist coverage. Most are not matching and none is third-order accurate, a level that may be attained by alternative approaches. As far as the abstract goes, this seems to be the crux of the paper. Which is fine, but does not qualify in my opinion as a criticism of the Bayesian paradigm, given that (a) it makes no claim at frequentist coverage and (b) I see no reason in proper coverage being connected with “truth” or “accuracy”. It truly makes no sense to me to attempt either to put a frequentist hat on posterior distributions or to check whether or not the posterior is “valid”, “true” or “actual”. I similarly consider that Efron‘s “genuine priors” do not belong to the Bayesian paradigm but are on the opposite anti-Bayesian in that they suggest all priors should stem from frequency modelling, to borrow the terms from the current paper. (This is also the position of the authors, who consider they have “no Bayes content”.)

Among their arguments, the authors refer to two tragic real cases: the earthquake at L’Aquila, where seismologists were charged (and then discharged) with manslaughter for asserting there was little risk of a major earthquake, and the indictment of the pharmaceutical company Merck for the deadly side-effects of their drug Vioxx. The paper however never return to those cases and fails to explain in which sense this is connected with the lack of reproducibility or of truth(fullness) of Bayesian procedures. If anything, the morale of the Aquila story is that statisticians should not draw definitive conclusions like there is no risk of a major earthquake or that it was improbable. There is a strange if human tendency for experts to reach definitive conclusions and to omit the many layers of uncertainty in their models and analyses. In the earthquake case, seismologists do not know how to predict major quakes from the previous activity and that should have been the [non-]conclusion of the experts. Which could possibly have been reached by a Bayesian modelling that always includes uncertainty. But the current paper is not at all operating at this (epistemic?) level, as it never ever questions the impact of the choice of a likelihood function or of a statistical model in the reproducibility framework. First, third or 47th order accuracy nonetheless operates strictly within the referential of the chosen model and providing the data to another group of scientists, experts or statisticians will invariably produce a different statistical modelling. So much for reproducibility or truth.

September 5, 2016 at 1:10 am

I agree with Dan’s point above; but what *is* worth doing is to do the same computations for a few realistic models that are hard to do inference for?

September 6, 2016 at 12:35 am

Computing the prior requires you to compute a 1D profile likelihood for the (1D) quantity of interest and then construct a Jeffreys prior for that profile likelihood.

Everything about that procedure says that it’s not relevant to realistic models.

(Or, less face it, a realistic method of prior construction. I have a very catholic view of what is and isn’t Bayesian, but even I’m nervous about such unbelievably data-dependent prior constructions!)

September 8, 2016 at 2:31 am

At first blush, and from a geometric point of view, I think Fraser’s basic approach could make a lot of sense for more complex problems. Can you think of a nice ‘intermediate-level’ difficulty problem that it might make sense to compare them on?

September 4, 2016 at 4:08 am

In Fraser’s classification of ‘subjective’, ‘mathematical’ and ‘frequency’ priors, where would ‘regularisation’ priors fit? Eg for stabilising small sample estimates, constraining model complexity etc. Gelman’s example of a single observation y=1 from a N(theta,1) comes to mind. In this case the Jeffreys prior matches the confidence approach example, but is either desirable in this case?

September 4, 2016 at 4:09 am

Oops, meant ‘matches exactly’ not ‘matches example’

September 4, 2016 at 10:21 am

I really cannot say, I presume they belong to mathematical priors. But those categories are not particularly meaningful, are they?!

September 4, 2016 at 2:47 pm

Yes, there’s a lot that is quite ambiguous to me (though I find some of the results in the paper interesting).

And, as you say, reproducibility is not really defined. So the criteria of evaluation are somewhat unclear.

If you just want to record where the data are relative to a model, a p-curve (observed cdf as a function of the parameter) seems fine. So finding different representations of this (e.g. sample space vs parameter space integrals) might (or might not!) be useful.

But if you want to go further and ‘draw an inference’, then this typically carries some implications about ‘true’ parameter values or repetition properties or updating states of information etc etc. I guess I then want to know what assumptions you are making about available contextual information and how these impact your inferences. (And how you interpret your inferences.)

That Fisher’s fiducial or Neyman’s confidence match Jeffreys’ ‘objective’ Bayes in simple cases, and perhaps even make similar assumptions under these circumstances – for Fisher in particular e.g. no recognisable subsets etc – is interesting, but perhaps none of these is appropriate for a particular problem.

For example, what is the desired outcome of a method applied to Gelman’s example? Is it a well-defined problem as stated? What else would need to be added?

Again, this gets to what is meant by ‘reproducibility’ and other ‘desirable’ criteria, and in the more ‘general’ sense of the terms (Fraser appears to freely equate particular technical notions with general ‘Good Science’ notions that use similar words).

When methods disagree on more complex problems, what are the criteria on which to decide? Say for arguments sake that Fraser’s approach preserves something more akin to what Jeffreys intended but applicable to more complex problems. Is that a good thing? I think there is still too much question begging and focus on ‘internal consistency’ (re: Dan’s comment) here, with a corresponding lack of ‘external motivation’ and connection of some of the technical terms to their scientific or philosophical motivations.

Now, if Fraser had just presented these results (which have also largely appeared elsewhere in his and others’ previous work) in a fairly straight-forward way i.e. ‘here is when these sorts of procedures agree, here is when they don’t’ then fine, and even interesting.

But by raising the more philosophical questions and calling one approach ‘reproducible’ (a Good Thing) and one not (a Bad Thing) I think more needs to be done to explain why the technical frequency reproducibility condition, given a particular model and a single dataset, corresponds to the scientific or philosophical concept, which does not occur in such a vacuum.

Phew, long rambling comment, sorry.

September 3, 2016 at 1:19 am

So I don’t think this article is very good. I think it’s internally consistent and mathematically correct (which is under most circumstances my barrier for “publishable”), but my main thought when reading this is that people should be able to see the problems with it (seriously: third order (ish) matching won’t fix a crappy model and the examples are so simplistic that they’re essentially useless. There’s no serious Bayesian critique here. At best it’s a critique of bad statistics. And that rubbish about the earthquakes was a non-sequitur that wasn’t even remotely supported by the scientific content of the paper).

I’m increasingly of the opinion that a lot of good Bayesian models should have some frequency properties (in the sense that the quality of repeated inferences are often important in practice), but in the cases where this is important and you’re using a Bayesian method, it’s probably impossible to verify any classical frequentist methods. So asking for matching to something that doesn’t exist is silly. And in the incredibly contrived examples in the paper, who on earth cares?

September 5, 2016 at 6:05 am

I most respectfully disagree with the frequency part. What matters to me is that a procedure can be calibrated and its behaviour evaluated under several scenarios. Given that I will never see the same data [and its generating mechanism] again, frequency does not matter.So to speak.

September 6, 2016 at 12:29 am

That’s true. I just view frequentist methods (and learning theory tail bounds) as a way of generating a set of “calibration” measures. They’re not always appropriate, but they should always be at hand. And when they’re not appropriate, they can be useful as a jumping off point for what the different scenarios you can measure the data under are.

Just like testing procedures, frequency methods give very detailed answers to very specific questions. Just because they’re not the questions you necessarily want to ask doesn’t mean that they’re not useful exploratory tools.