## statistical significance as explained by The Economist

**T**here is a long article in The Economist of this week (also making the front cover), which discusses how and why many published research papers have unreproducible and most often “wrong” results. Nothing immensely new there, esp. if you read Andrew’s blog on a regular basis, but the (anonymous) writer(s) take(s) pains to explain how this related to statistics and in particular statistical testing of hypotheses. The above is an illustration from this introduction to statistical tests (and their interpretation).

“First, the statistics, which if perhaps off-putting are quite crucial.”

It is not the first time I spot a statistics backed article in this journal and so assume it has either journalists with a statistics background or links with (UK?) statisticians. The description of why statistical tests can err is fairly (Type I – Type II) classical. Incidentally, it reports a finding of Ioannidis that when reporting a positive at level 0.05, the expectation of a false positive rate of one out of 20 is “highly optimistic”. An evaluation opposed to, e.g., Berger and Sellke (1987) who reported a too-early rejection in a large number of cases. More interestingly, the paper stresses that this classical approach ignores “the unlikeliness of the hypothesis being tested”, which I interpret as the prior probability of the hypothesis under test.

“Statisticians have ways to deal with such problems. But most scientists are not statisticians.”

The paper also reports about the lack of power in most studies, report that I find a bit bizarre and even meaningless in its ability to compute an overall power, all across studies and researchers and even fields. Even in a single study, the alternative to “no effect” is composite, hence has a power that depends on the unknown value of the parameter. Seeking a single value for the power requires some prior distribution on the alternative.

“Peer review’s multiple failings would matter less if science’s self-correction mechanism—replication—was in working order.”

The next part of the paper covers the failings of peer review, of which I discussed in the ISBA Bulletin, but it seems to me too easy to blame the ref in failing to spot statistical or experimental errors, when lacking access to the data or to the full experimental methodology and when under pressure to return (for free) a report within a short time window. The best that can be expected is that a referee detects the implausibility of a claim or an obvious methodological or statistical mistake. These are not math papers! And, as pointed out repeatedly, not all referees are statistically numerate….

“Budding scientists must be taught technical skills, including statistics.”

The last part discusses of possible solutions to achieve reproducibility and hence higher confidence in experimental results. Paying for independent replication is the proposed solution but it can obviously only apply to a small margin of all published results. And having control bodies testing at random labs and teams following a major publication seems rather unrealistic, if only for filling the teams of such bodies with able controllers… An interesting if pessimistic debate, *in fine*. And fit for the International Year of Statistics.

November 13, 2013 at 8:22 am

Is the 20 false negatives correct? If it’s 80% power, wouldn’t it be 200 out of 1000 false negatives? Confused.

November 12, 2013 at 4:41 pm

[…] believed the crises in science will abate if we only educate everyone on the correct intepreation of […]

November 11, 2013 at 5:24 am

My blogpost on this: http://errorstatistics.com/2013/11/09/beware-of-questionable-front-page-articles-warning-you-to-beware-of-questionable-front-page-articles-i/

November 8, 2013 at 3:28 pm

I had a bit of fun at the whole “we can save classical statistics if we just emphasize ‘fail to reject’ is not the same as ‘accept’ strongly enough” here:

http://www.entsophy.net/blog/?p=195

November 7, 2013 at 1:12 am

Just from looking at your post, not the article, it seems to me another example of a crass use of tests as interested in buckets of nulls and alternatives (as opposed to the case at hand) .8 power doesn’t say the test confirms 80% of the true hypotheses, as they assert. It means the prob the test would correctly reject the null, under the assumption that the discrepancy from null were a given magnitude d, is .8. (I don’t know if it’s 1 or 2-sided). But never mind the paper which I don’t want to discuss out of context. Just a remark on something you say. The power doesn’t depend on a prior or knowing the alternative any more than interpreting the result of a measuring tool requires knowing the true measurement ahead of time (or its probability). I can evaluate which discrepancies a test will detect with high power (I prefer to consider P(a p-value less than observed; values of the discrepancy). If it has a high probability of producing so small a p-value (or smaller) even with underlying discrepancy no larger than d’, then the observed p-value is a poor indication of a discrepancy even larger than d’.

and, of course, Berger and Sellke’s “too early” allegation depends on a spiked prior to a null, leading to results scarcely desirable for an error statistician. But you know I’ve discussed all this elsewhere.