## statistical significance as explained by The Economist

Posted in Books, Statistics, University life with tags , , , , , , on November 7, 2013 by xi'an

There is a long article in The Economist of this week (also making the front cover), which discusses how and why many published research papers have unreproducible and most often “wrong” results. Nothing immensely new there, esp. if you read Andrew’s blog on a regular basis, but the (anonymous) writer(s) take(s) pains to explain how this related to statistics and in particular statistical testing of hypotheses. The above is an illustration from this introduction to statistical tests (and their interpretation).

“First, the statistics, which if perhaps off-putting are quite crucial.”

It is not the first time I spot a statistics backed article in this journal and so assume it has either journalists with a statistics background or links with (UK?) statisticians. The description of why statistical tests can err is fairly (Type I – Type II) classical. Incidentally, it reports a finding of Ioannidis that when reporting a positive at level 0.05,  the expectation of a false positive rate of one out of 20 is “highly optimistic”. An evaluation opposed to, e.g., Berger and Sellke (1987) who reported a too-early rejection in a large number of cases. More interestingly, the paper stresses that this classical approach ignores “the unlikeliness of the hypothesis being tested”, which I interpret as the prior probability of the hypothesis under test.

“Statisticians have ways to deal with such problems. But most scientists are not statisticians.”

The paper also reports about the lack of power in most studies, report that I find a bit bizarre and even meaningless in its ability to compute an overall power, all across studies and researchers and even fields. Even in a single study, the alternative to “no effect” is composite, hence has a power that depends on the unknown value of the parameter. Seeking a single value for the power requires some prior distribution on the alternative.

“Peer review’s multiple failings would matter less if science’s self-correction mechanism—replication—was in working order.”

The next part of the paper covers the failings of peer review, of which I discussed in the ISBA Bulletin, but it seems to me too easy to blame the ref in failing to spot statistical or experimental errors, when lacking access to the data or to the full experimental methodology and when under pressure to return (for free) a report within a short time window. The best that can be expected is that a referee detects the implausibility of a claim or an obvious methodological or statistical mistake. These are not math papers! And, as pointed out repeatedly, not all referees are statistically numerate….

“Budding scientists must be taught technical skills, including statistics.”

The last part discusses of possible solutions to achieve reproducibility and hence higher confidence in experimental results. Paying for independent replication is the proposed solution but it can obviously only apply to a small margin of all published results. And having control bodies testing at random labs and teams following a major publication seems rather unrealistic, if only for filling the teams of such bodies with able controllers… An interesting if pessimistic debate, in fine. And fit for the International Year of Statistics.

## informative hypotheses (book review)

Posted in Books, R, Statistics with tags , , , , , , on September 19, 2013 by xi'an

The title of this book Informative Hypotheses somehow put me off from the start: the author, Hebert Hoijtink, seems to distinguish between informative and uninformative (deformative? disinformative?) hypotheses. Namely, something like

H0: μ1234

is “very informative” and unrealistic, and the alternative Ha is completely uninformative, while the “alternative null”

H1: μ1<μ23<μ4

is informative. (Hence the < signs on the cover. One of my book reviews idiosyncrasies is to find hidden meaning behind the cover design…) The idea is thus to have the researcher give some input in the construction of the null hypothesis (as if hypothesis tests usually were not about questions that mattered….).

In fact, this distinction put me off so much that I only ended up reading chapters 1 (an introduction), 3 (an introduction [to the Bayesian processing of such hypotheses]) and 10 (on Bayesian foundations of testing informative hypotheses). Hence a very biased review of Informative Hypotheses that follows….

Given an existing (but out of print?) reference like Robertson, Wright and Dykjstra (1988), that I particularly enjoyed when working on isotonic regression in the mid 90’s, I do not see much of an added value in the present book. The important references are mostly centred on works by the author and his co-authors or students (often Unpublished or In Press), which gives me the impression the book was hurriedly gathered from those papers.

“The Bayes factor (…) is default, objective, based on an appropriate quantification of complexity.” (p.197)

The first chapter of Informative Hypotheses is a motivation for the study of those informative hypotheses, with a focus on ANOVA models. There is not much in the chapter that explains what is so special about those ordering (null) hypotheses and why a whole book is required to cover their processing. A noteworthy specificity of the approach, nonetheless, is that point null hypotheses seem to be replaced with “about equality constraints” (p.9), |μ23|<d, where d is specified by the researcher as significant. This chapter also gives illustrations of ordered (or informative) hypotheses in the settings of analysis of covariance (ANCOVA) and regression models, but does not indicate (yet) how to run the tests. The concluding section is about the epistemological focus of the book, quoting Popper, Sober and Carnap, although I do not see much of a support in those quotes.

“Objective means that Bayes factors based on this prior distribution are essentially independent of this prior distribution.” (p.53)

Chapter 3 starts the introduction to Bayesian statistics with the strange idea of calling the likelihood the “density of the data”. It is indeed the probability density of the model evaluated at the data but… it conveys a confusing meaning since it is not a density when plotted against the parameters (as in Figure 1, p. 44, where, incidentally the exact probability model is not specified). The prior distribution is defined as a normal x inverse chi-square distribution on the vector of the means (in the ANOVA model) and the common variance. Due to the classification of the variance as a nuisance parameter, the author can get away with putting an improper prior on this parameter (p.46). The normal prior is chosen to be “neutral”, i.e. to give the same prior weight to the null and the alternative hypotheses. This seems logical at some initial level, but constructing such a prior for convoluted hypotheses may simply be impossible… Because the null hypothesis has a positive mass (maybe .5) under the “unconstrained prior” (p.48), the author can also get away with projecting this prior onto the constrained space of the null hypothesis. Even when setting the prior variance to oo (p.50). The Bayes factor is then the ratio of the (posterior and prior) normalising constants over the constrained parameter space. The book still mentions the Lindley-Bartlett paradox (p.60) in the case of the about equality hypotheses. The appendix to this chapter mentions the issue of improper priors and the need for accommodating infinite mass with training samples, providing a minimum training sample solution using mixtures that sound fairly ad hoc to me.

“Bayes factors for the evaluation of informative hypotheses have a simple form.” (p. 193)

Chapter 10 is the final chapter of Informative Hypotheses, on “Foundations of Bayesian evaluation of informative hypotheses”, and I was expecting a more in-depth analysis of those special hypotheses, but it is mostly a repetition of what is found in Chapter 3, the wider generality being never exploited to a useful depth. There is also this gem quoted above  that, because Bayes factors are the ratio of two (normalising) constants, fm/cm, they have a “simple form”. The reference to Carlin and Chib (1995) for computing other cases then sounds pretty obscure. (Another tiny gem is that I spotted the R software contingency spelled with three different spellings.)  The book mentions the Savage-Dickey representation of the Bayes factor, but I could not spot the connection from the few lines (p.193) dedicated to this ratio. More generally, I do not find the generality of this chapter particularly convincing, most of it replicating the notions found in Chapter 3., like the use of posterior priors. The numerical approximation of Bayes factors is proposed via simulation from the unconstrained prior and posterior (p.207) then via a stepwise decomposition of the Bayes factor (p.208) and a Gibbs sampler that relies on inverse cdf sampling.

Overall, I feel that this book came out too early, without a proper basis and dissemination of the ideas of the author: to wit, a large number of references are connected to the author, some In Press, other Unpublished (which leads to a rather abstract “see Hoijtink (Unpublished) for a related theorem” (p.195)). From my incomplete reading, I did not gather a sense of novel perspective but rather of a topic that seemed too narrow for a whole book.

## Statistics, with interactions

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , on June 6, 2013 by xi'an

Yesterday, I also had a short discussion with Paul Minh who presented a talk on a general regenerative device for MCMC algorithms, using a bound on the target density rather than on the Markov transition in order to achieve easier regeneration. While a neat idea, this method requires the construction of a lower bound that can easily simulated. Furthermore, if the regeneration probability is low, the mixing speed may remain similar to the original MCMC sampler, as the method ressorts to a standard MCMC step on the remaining part of the target density.

## who’s afraid of the big B wolf?

Posted in Books, Statistics, University life with tags , , , , , , , , , , on March 13, 2013 by xi'an

Aris Spanos just published a paper entitled “Who should be afraid of the Jeffreys-Lindley paradox?” in the journal Philosophy of Science. This piece is a continuation of the debate about frequentist versus llikelihoodist versus Bayesian (should it be Bayesianist?! or Laplacist?!) testing approaches, exposed in Mayo and Spanos’ Error and Inference, and discussed in several posts of the ‘Og. I started reading the paper in conjunction with a paper I am currently writing for a special volume in  honour of Dennis Lindley, paper that I will discuss later on the ‘Og…

“…the postdata severity evaluation (…) addresses the key problem with Fisherian p-values in the sense that the severity evaluation provides the “magnitude” of the warranted discrepancy from the null by taking into account the generic capacity of the test (that includes n) in question as it relates to the observed data”(p.88)

First, the antagonistic style of the paper is reminding me of Spanos’ previous works in that it relies on repeated value judgements (such as “Bayesian charge”, “blatant misinterpretation”, “Bayesian allegations that have undermined the credibility of frequentist statistics”, “both approaches are far from immune to fallacious interpretations”, “only crude rules of thumbs”, &tc.) and rhetorical sleights of hand. (See, e.g., “In contrast, the severity account ensures learning from data by employing trustworthy evidence (…), the reliability of evidence being calibrated in terms of the relevant error probabilities” [my stress].) Connectedly, Spanos often resorts to an unusual [at least for statisticians] vocabulary that amounts to newspeak. Here are some illustrations: “summoning the generic capacity of the test”, ‘substantively significant”, “custom tailoring the generic capacity of the test”, “the fallacy of acceptance”, “the relevance of the generic capacity of the particular test”, yes the term “generic capacity” is occurring there with a truly high frequency. Continue reading

## reading classics (#10 and #10bis)

Posted in Books, Statistics, University life with tags , , , , , , , , , on February 28, 2013 by xi'an

Today’s classics seminar was rather special as two students were scheduled to talk. It was even more special as both students had picked (without informing me) the very same article by Berger and Sellke (1987), Testing a point-null hypothesis: the irreconcilability of p-values and evidence, on the (deep?) discrepancies between frequentist p-values and Bayesian posterior probabilities. In connection with the Lindley-Jeffreys paradox. Here are Amira Mziou’s slides:

and Jiahuan Li’s slides:

for comparison.

It was a good exercise to listen to both talks, seeing two perspectives on the same paper, and I hope the students in the class got the idea(s) behind the paper. As you can see, there were obviously repetitions between the talks, including the presentation of the lower bounds for all classes considered by Jim Berger and Tom Sellke, and the overall motivation for the comparison. Maybe as a consequence of my criticisms on the previous talk, both Amira and Jiahuan put some stress on the definitions to formally define the background of the paper. (I love the poetic line: “To prevent having a non-Bayesian reality”, although I am not sure what Amira meant by this…)

I like the connection made therein with the Lindley-Jeffreys paradox since this is the core idea behind the paper. And because I am currently writing a note about the paradox. Obviously, it was hard for the students to take a more remote stand on the reason for the comparison, from questioning .the relevance of testing point null hypotheses and of comparing the numerical values of a p-value with a posterior probability, to expecting asymptotic agreement between a p-value and a Bayes factor when both are convergent quantities, to setting the same weight on both hypotheses, to the ad-hocquery of using a drift on one to equate the p-value with the Bayes factor, to use specific priors like Jeffreys’s (which has the nice feature that it corresponds to g=n in the g-prior,  as discussed in the new edition of Bayesian Core). The students also failed to remark on the fact that the developments were only for real parameters, as the phenomenon (that the lower bound on the posterior probabilities is larger than the p-value) does not happen so universally in larger dimensions.  I would have expected more discussion from the ground, but we still got good questions and comments on a) why 0.05 matters and b) why comparing  p-values and posterior probabilities is relevant. The next paper to be discussed will be Tukey’s piece on the future of statistics.

## Error and Inference [arXived]

Posted in Books, Statistics, University life with tags , , , , , , , on November 29, 2011 by xi'an

Following my never-ending series of posts on the book Error and Inference, (edited) by Deborah Mayo and Ari Spanos (and kindly sent to me by Deborah), I decided to edit those posts into a (slightly) more coherent document, now posted on arXiv. And to submit it as a book review to Siam Review, even though I had not high expectations it fits the purpose of the journal: the review was rejected between the submission to arXiv and the publication of this post!

## the cult of significance

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 18, 2011 by xi'an

Statistical significance is not a scientific test. It is a philosophical, qualitative test. It asks “whether”. Existence, the question of whether, is interesting. But it is not scientific.” S. Ziliak and D. McCloskey, p.5

The book, written by economists Stephen Ziliak and Deirdre McCloskey, has a theme bound to attract Bayesians and all those puzzled by the absolute and automatised faith in significance tests. The main argument of the authors is indeed that an overwhelming majority of papers stop at rejecting variables (“coefficients”) on the sole and unsupported basis of non-significance at the 5% level. Hence the subtitle “How the standard error costs us jobs, justice, and lives“… This is an argument I completely agree with, however, the aggressive style of the book truly put me off! As with Error and Inference, which also addresses a non-Bayesian issue, I could have let the matter go, however I feel the book may in the end be counter-productive and thus endeavour to explain why through this review.  (I wrote the following review in batches, before and during my trip to Dublin, so the going is rather broken, I am afraid…) Continue reading