Next Fall, on 15-16 September, I will take part in a CRiSM workshop on hypothesis testing. In our department in Warwick. The registration is now open [until Sept 2] with a moderate registration free of £40 and a call for posters. Jim Berger and Joris Mulder will both deliver a plenary talk there, while Andrew Gelman will alas give a remote talk from New York. (A terrific poster by the way!)
Archive for statistical tests
There is a long article in The Economist of this week (also making the front cover), which discusses how and why many published research papers have unreproducible and most often “wrong” results. Nothing immensely new there, esp. if you read Andrew’s blog on a regular basis, but the (anonymous) writer(s) take(s) pains to explain how this related to statistics and in particular statistical testing of hypotheses. The above is an illustration from this introduction to statistical tests (and their interpretation).
“First, the statistics, which if perhaps off-putting are quite crucial.”
It is not the first time I spot a statistics backed article in this journal and so assume it has either journalists with a statistics background or links with (UK?) statisticians. The description of why statistical tests can err is fairly (Type I – Type II) classical. Incidentally, it reports a finding of Ioannidis that when reporting a positive at level 0.05, the expectation of a false positive rate of one out of 20 is “highly optimistic”. An evaluation opposed to, e.g., Berger and Sellke (1987) who reported a too-early rejection in a large number of cases. More interestingly, the paper stresses that this classical approach ignores “the unlikeliness of the hypothesis being tested”, which I interpret as the prior probability of the hypothesis under test.
“Statisticians have ways to deal with such problems. But most scientists are not statisticians.”
The paper also reports about the lack of power in most studies, report that I find a bit bizarre and even meaningless in its ability to compute an overall power, all across studies and researchers and even fields. Even in a single study, the alternative to “no effect” is composite, hence has a power that depends on the unknown value of the parameter. Seeking a single value for the power requires some prior distribution on the alternative.
“Peer review’s multiple failings would matter less if science’s self-correction mechanism—replication—was in working order.”
The next part of the paper covers the failings of peer review, of which I discussed in the ISBA Bulletin, but it seems to me too easy to blame the ref in failing to spot statistical or experimental errors, when lacking access to the data or to the full experimental methodology and when under pressure to return (for free) a report within a short time window. The best that can be expected is that a referee detects the implausibility of a claim or an obvious methodological or statistical mistake. These are not math papers! And, as pointed out repeatedly, not all referees are statistically numerate….
“Budding scientists must be taught technical skills, including statistics.”
The last part discusses of possible solutions to achieve reproducibility and hence higher confidence in experimental results. Paying for independent replication is the proposed solution but it can obviously only apply to a small margin of all published results. And having control bodies testing at random labs and teams following a major publication seems rather unrealistic, if only for filling the teams of such bodies with able controllers… An interesting if pessimistic debate, in fine. And fit for the International Year of Statistics.
Here are comments by Olli following my post:
I think we found a general means to obtain accurate ABC in the sense of matching the posterior mean or MAP exactly, and then minimising the KL distance between the true posterior and its ABC approximation subject to this condition. The construction works on an auxiliary probability space, much like indirect inference. Now, we construct this probability space empirically, this is where our approach differs first from indirect inference and this is where we need the “summary values” (>1 data points on a summary level; see Figure 1 for clarification). Without replication, we cannot model the distribution of summary values but doing so is essential to construct this space. Now, lets focus on the auxiliary space. We can fiddle with the tolerances (on a population level) and m so that on this space, the ABC approximation has the aforesaid properties. All the heavy technical work is in this part. Intuitively, as m increases, the power increases for sufficiently regular tests (see Figure 2) and consequently, for calibrated tolerances, the ABC approximation on the auxiliary space goes tighter. This offsets the broadening effect of the tolerances, so having non-identical lower and upper tolerances is fine and does not hurt the approximation. Now, we need to transport the close-to-exact ABC approximation on the auxiliary space back to the original space. We need some assumptions here, and given our time series example, it seems these are not unreasonable. We can reconstruct the link between the auxiliary space and the original parameter space as we accept/reject. This helps us understand (with the videos!) the behaviour of the transformation and to judge if its properties satisfy the assumptions of Theorems 2-4. While we offer some tools to understand the behaviour of the link function, yes, we think more work could be done here to improve on our first attempt to accurate ABC.
Now some more specific comments:
“The paper also insists over and over on sufficiency, which I fear is a lost cause.” To clarify, all we say is that on the simple auxiliary space, sufficient summaries are easily found. For example, if the summary values are normally distributed, the sample mean and the sample variance are sufficient statistics. Of course, this is not the original parameter space and we only transform the sufficiency problem into a change of variable problem. This is why we think that inspecting and understanding the link function is important.
“Another worry is that the … test(s) rel(y) on an elaborate calibration”. We provide some code here for everyone to try out. In our examples, this did not slow down ABC considerably. We generally suppose that the distribution of the summary values is simple, like Gaussian, Exponential, Gamma, ChiSquare, Lognormal. In these cases, the ABC approximation takes on an easy-enough-to-calibrate-fast functional form on the auxiliary space.
“This Theorem 3 sounds fantastic but makes me uneasy: unbiasedness is a sparse property that is rarely found in statistical problems. … Witness the use of “essentially unbiased” in Fig. 4.” What Theorem 3 says is that if unbiasedness can be achieved on the simple auxiliary space, then there are regularity conditions under which these properties can be transported back to the original parameter space. We hope to illustrate these conditions with our examples, and to show that they hold in quite general cases such as the time series application. The thing in Figure 4 is that the sample autocorrelation is not an unbiased estimator of the population autocorrelation. So unbiasedness does not quite hold on the auxiliary space and the conditions of Theorem 3 are not satisfied. Nevertheless, we found this bias to be rather negligible in our example and the bigger concern was the effect of the link function.
And here are Olli’s slides:
As posted in the previous entry, Olli Ratman, Anton Camacho, Adam Meijer, and Gé Donker arXived their paper on accurate ABC. A paper which [not whose!] avatars I was privy to in the past six months! While I acknowledge the cleverness of the reformulation of the core ABC accept/reject step as a statistical test, and while we discussed the core ideas with Olli and Anton when I visited Gatsby, the paper still eludes me to some respect… Here is why. (Obviously, you should read this rich & challenging paper first for the comments to make any sense! And even then they may make little sense…)
The central idea of this accurate ABC [aABC? A²BC?] is that, if the distribution of the summary statistics is known and if replicas of those summary statistics are available for the true data (and less problematically for the generated data), then a classical statistical test can be turned into a natural distance measure for each statistics and even “natural” bounds can be found on that distance, to the point of recovering most properties of the original posterior distribution… A first worry is this notion that the statistical distribution of a collection of summary statistics is available in closed form: this sounds unrealistic even though it may not constitute a major contention issue. Indeed, replacing a tailored test with a distribution-free test of identical location parameter could not hurt that much. [Just the power. If that matters… See bellow.] The paper also insists over and over on sufficiency, which I fear is a lost cause. In my current understanding of ABC, the loss of some amount of information contained in the data should be acknowledged and given a write-off as a Big Data casualty. (See, e.g., Lemma 1.)
Another worry is that the rephrasing of the acceptance distance as the maximal difference for a particular test relies on an elaborate calibration, incl. α, c+, τ+, &tc. (I am not particularly convinced by the calibration in terms of the power of the test being maximised at the point null value. Power?! See bellow, once again.) When cumulating tests and aiming at a nominal α level, the orthogonality of the test statistics in Theorem 1(iii) is puzzling and I think unrealistic.
The notion of accuracy that is central to the paper and its title corresponds to the power of every test being maximal at the true value of the parameter. And somehow to the ABC approximation being maximises at the true parameter, even though I am lost by then [i.e. around eqn (18)] about the meaning of ρ*… The major result in the paper is however that, under the collection of assumptions made therein, the ABC MLE and MAP versions are equal to their exact counterparts. And that these versions are also unbiased. This Theorem 3 sounds fantastic but makes me uneasy: unbiasedness is a sparse property that is rarely found in statistical problems. Change the parameterisation and you loose unbiasedness. And even the possibility to find an unbiased estimator. Since this difficulty does not appear in the paper, I would conclude that either the assumptions are quite constraining or the result holds in a weaker sense… (Witness the use of “essentially unbiased” in Fig. 4.)
This may be a wee rude comment (even for a Frenchman) but I also felt the paper could be better written in that notations pop in unannounced. For instance, on page 2, x [the data] becomes x1:n becomes sk1:n. This seems to imply that the summary statistics are observed repeatedly over the true sample. Unless n=1, this does not seem realistic. (I do not understand everything in Example 1, in particular the complaint that the ABC solutions were biased for finite values of n. That sounds like an odd criticism of Bayesian estimators. Now, it seems the paper is very intent on achieving unbiasedness! So maybe it should be called the aAnsBC algorithm for “not-so-Bayes!) I am also puzzled by the distinction between summary values and summary statistics. This sounds like insisting on having a large enough iid dataset. Or, on page 5, the discussion that the summary parameters are replaced by estimates seems out of context because this adds an additional layer of notation to the existing summary “stuff”… With the additional difficulty that Lemma 1 assumes reparameterisation of the model in terms of those summary parameters. I also object to the point null hypotheses being written in terms of a point estimate, i.e. of a quantity depending on the data x: it sounds like confusing the test [procedure] with the test [problem]. Another example: I read several times Lemma 5 about the calibration of the number of ABC simulations m but cannot fathom what this m is calibrated against. It seems only a certain value of m achieves the accurate correspondence with the genuine posterior, which sounds counter-intuitive. Last counter-example: some pictures seemed to be missing in the Appendix, but as it happened, it is only my tablet being unable to process them! S2 is actually a movie about link functions, really cool!
In conclusion, this is indeed a rich and challenging paper. I am certain I will get a better understanding by listening to Olli’s talk in Roma. And discussing with him.
A few months ago, I received a puzzling advertising for this book, Randomness through Computation, and I eventually ordered it, despite getting a rather negative impression from reading the chapter written by Tomasso Toffoli… The book as a whole is definitely perplexing (even when correcting for this initial bias) and I would not recommend it to readers interested in simulation, in computational statistics or even in the philosophy of randomness. My overall feeling is indeed that, while there are genuinely informative and innovative chapters in this book, some chapters read more like newspeak than scientific material (mixing the Second Law of Thermodynamics, Gödel’s incompleteness theorem, quantum physics, and NP completeness within the same sentence) and do not provide a useful entry on the issue of randomness. Hence, the book is not contributing in a significant manner to my understanding of the notion. (This post also appeared on the Statistics Forum.) Continue reading
“The measurement outputs contain at the 99% confidence level 42 new random bits. This is a much stronger statement than passing or not passing statistical tests, which merely indicate that no obvious non-random patterns are present.” arXiv:0911.3427
As often, I bought La Recherche in the station newsagent for the wrong reason! The cover of the December issue was about “God and Science” and I thought this issue would bring some interesting and deep arguments in connection with my math and realism post. The debate is very short, does not go in any depth. reproduces the Hawking’s quote that started the earlier post, and recycles the same graph about cosmology I used last summer in Vancouver! However, there are alternative interesting entries about probabilistic proof checking in Mathematics and truly random numbers… The first part is on an ACM paper on the PCP theorem by Irit Dinur, but is too terse as is (while the theory behind presumably escapes my abilities!). The second part is about a paper in Nature published by Pironio et al. and arXived as well. It is entitled “Random numbers certified by Bell’s Theorem” and also is one of the laureates of the La Recherche prize this year. I was first annoyed by the French coverage of the paper, mentioning that “a number was random with a probability of 99%” (?!) and that “a sequence of numbers is perfectly random” (re-?!). The original paper is however stating the same thing, hence stressing the different meaning associated to randomness by those physicists, “the unpredictable character of the outcomes” and “universally-composable security”. The above “probability of randomness” is actually a p-value (associated with the null hypothesis that Bell’s inequality is not violated) that is equal to 0.00077. (So the above quote is somehow paradoxical!) The huge apparatus used to produce those random events is not very efficient: on average, 7 binary random numbers are detected per hour… A far cry from the “truly random” generator produced by Intel!
Ps-As a concidence, Julien Cornebise pointed out to me that there is a supplement in the journal about “Le Savoir du Corps” which is in fact handled by the pharmaceutical company Servier, currently under investigation for its drug Mediator… A very annoying breach of basic journalistic ethics in my opinion!