revised (lower?) standards for statistical evidence

Valen Johnson published a follow-up paper to his Annals of Statistics paper on uniformly most powerful Bayesian tests. This one aims at reaching a wider audience by (a) publishing in PNAS (b) linking the lack of reproducibility in scientific research with the improper use of significance levels and (c) proposing to move from 0.05 to 0.005 or even 0.001. (As noted in a previous post, the clarity of the proposal was bound to attract journalists.) The criticism of the significance level and of the sacrosanct status of 0.05 (not so much in fields like Physics or Astronomy) is not novel, see for instance the extreme The cult of significance reviewed here two years ago. But most of the PNAS paper is dedicated to the technical derivation of UMPBTs, rather than providing strong arguments in their favour. There is no discussion of the analysis of true small effects, i.e. whether or not significance tests should be used in such cases. (Not!) The overall messages of the paper are thus unsubstantiated. Except the obvious one that lowering the significance level will lower the number of false positives.

“Modifications of common standards of evidence are proposed to reduce the rate of nonreproducibility of scientific research by a factor of 5 or greater.”

My first reaction to the proposal is that moving from a reference significance level to another reference significance level does not change an iota to the existing criticisms. Namely, adopting another standard for blind rejection of the null remains blindly rejecting the null. (The same criticism applies to Jeffreys’ scale, mind you.) Furthermore, as exposed in my earlier criticism, whose points obviously apply here, the resolution proposed by Valen is only Bayesian by the terms it uses, as it relies on least favourable priors in a minimax sense. And almost removes the notion of alternative hypotheses from the picture. Which in my opinion is the strongest appeal of the Bayesian perspective on decisional model selection. What Valen built there is a goodness of fit procedure with a Bayesianish flavour. Not a Bayesian test. As a marginalia, I fail to understand the point of Fig. 1 and 2. The “strong curvilinear relation between” p-values and UMPBT “Bayes” factors is a consequence of both of them depending on the … data. It would actually been slightly more pertinent to compare p-values and posterior probabilities (under “equipoise”).

Second, Valen’s proposal depends upon a choice of an evidence threshold γ that seems to be calibrated against the standard significance test, as illustrated on page 2 with the z test. The value γ = 3.87 is chosen so that the “the rejection region of the resulting test exactly matches the rejection region of a one-sided 5% significance test”. This means further that γ  also seems to be calibrated against the uniformly most powerful (or least favourable) Bayes factor constructed for this purpose. Thus, γ is relative—as opposed to absolute—for two reasons. Meaning that the whole notion of uniformly most powerful tests is somehow tautological, rather than induced by the real problem. And that comparing another Bayes factor (i.e., for another alternative) to the same threshold is meaningless.

“Although it is difficult to assess the proportion of all
tested null hypotheses that are actually true, if one assumes that this proportion is approximately one-half, then these results suggest that between 17% and 25% of marginally significant scientific findings are false.”

Third, the discussion that revolves around the above quote, while very attractive for the media and plain enough for the general public, is unrelated to any statistical ground in the paper, were it frequentist or Bayesian. Even though it steps in with Jim Berger’s earlier evaluations. First, I dispute the assumption that the proportion is a half, when considering that only borderline hypotheses are run through statistical tests, so I would think the proportion is much higher. Second, I do not see what “these results” refer to. The only evidence found therein is the “distribution” of p-values in Fig.3 which amalgamates reported p-values from 855 t-tests found in the literature, with no correction for sample size, truncation, censoring, and a myriad of other possible variations from the formal property that “the nominal distribution of p-values is uniformly distributed on the range (0.0,0.05)”.

2 Responses to “revised (lower?) standards for statistical evidence”

  1. […] some comments and critiques.  The best ones I have read so far are the posts written by  Xi’An, Andrew Gelman and William Briggs, in their blogs  and the piece that Erika Hayden […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: