## Adaptive revised standards for statistical evidence [guest post]

Posted in Books, Statistics, University life with tags , , , , , , , on March 25, 2014 by xi'an

[Here is a discussion of Valen Johnson’s PNAS paper written by Luis Pericchi, Carlos Pereira, and María-Eglée Pérez, in conjunction with an arXived paper of them I never came to discuss. This has been accepted by PNAS along with a large number of other letters. Our discussion permuting the terms of the original title also got accepted.]

Johnson  argues for decreasing the bar of statistical significance from 0.05 and 0.01 to 0:005 and 0:001 respectively. There is growing evidence that the canonical fixed standards of significance are inappropriate. However, the author simply proposes other fixed standards. The essence of the problem of classical testing of significance lies on its goal of minimizing type II error (false negative) for a fixed type I error (false positive). A real departure instead would be to minimize a weighted sum of the two errors, as proposed by Jeffreys . Significance levels that are constant with respect to sample size do not balance errors. Size levels of 0.005 and 0.001 certainly will lower false positives (type I error) to the expense of increasing type II error, unless the study is carefully de- signed, which is not always the case or not even possible. If the sample size is small the type II error can become unacceptably large. On the other hand for large sample sizes, 0.005 and 0.001 levels may be too high. Consider the Psychokinetic data, Good : the null hypothesis is that individuals can- not change by mental concentration the proportion of 1’s in a sequence of n = 104; 490; 000 0’s and 1’s, generated originally with a proportion of 1=2. The proportion of 1’s recorded was 0:5001768. The observed p-value is p = 0.0003, therefore according to the present revision of standards, still the null hypothesis is rejected and a Psychokinetic effect claimed. This is contrary to intuition and to virtually any Bayes Factor. On the other hand to make the standards adaptable to the amount of information (see also Raftery ) Perez and Pericchi  approximate the behavior of Bayes Factors by, $\alpha_{\mathrm{ref}}(n)=\alpha\,\dfrac{\sqrt{n_0(\log(n_0)+\chi^2_\alpha(1))}}{\sqrt{n(\log(n)+\chi^2_\alpha(1))}}$

This formula establishes a bridge between carefully designed tests and the adaptive behavior of Bayesian tests: The value n0 comes from a theoretical design for which a value of both errors has been specified ed, and n is the actual (larger) sample size. In the Psychokinetic data n0 = 44,529 for type I error of 0:01, type II error of 0.05 to detect a difference of 0.01. The αref (104, 490,000) = 0.00017 and the null of no Psychokinetic effect is accepted.

A simple constant recipe is not the solution to the problem. The standard how to judge the evidence should be a function of the amount of information. Johnson’s main message is to toughen the standards and design the experiments accordingly. This is welcomed whenever possible. But it does not balance type I and type II errors: it would be misleading to pass the message—use now standards divided by ten, regardless of neither type II errors nor sample sizes. This would move the problem without solving it.