Archive for p-values

no publication without confirmation

Posted in Books, Statistics, University life with tags , , , on March 15, 2017 by xi'an

“Our proposal is a new type of paper for animal studies (…) that incorporates an independent, statistically rigorous confirmation of a researcher’s central hypothesis.” (p.409)

A comment tribune in Nature of Feb 23, 2017, suggests running clinical trials in three stages towards meeting higher standards in statistical validation. The idea is to impose a preclinical trial run by an independent team following an initial research showing some potential for some new treatment. The three stages are thus (i) to generate hypotheses; (ii) to test hypotheses; (iii) to test broader application of hypotheses (p.410). While I am skeptical of the chances of this proposal reaching adoption (for various reasons, like, what would the incentive of the second team be [of the B team be?!], especially if the hypothesis is dis-proved, how would both teams share the authorship and presumably patenting rights of the final study?, and how could independence be certain were the B team contracted by the A team?), the statistical arguments put forward in the tribune are rather weak (in my opinion). Repeating experiments with a larger sample size and an hypothesis set a priori rather than cherry-picked is obviously positive, but moving from a p-value boundary of 0.05 to one of 0.01 and to a power of 80% is more a cosmetic than a foundational change. As Andrew and I pointed out in our PNAS discussion of Johnson two years ago.

“the earlier experiments would not need to be held to the same rigid standards.” (p.410)

The article contains a vignette on “the maths of predictive value” that makes intuitive sense but only superficially. First, “the positive predictive value is the probability that a positive result is truly positive” (p.411) A statement that implies a distribution of probability on the space of hypotheses, although I see no Bayesian hint throughout the paper. Second, this (ersatz of a) probability is computed by a ratio of the number of positive results under the hypothesis over the total number of positive results. Which does not make much sense outside a Bayesian framework and even then cannot be assessed experimentally or by simulation without defining a distribution of the output under both hypotheses. Simplistic pictures are the above are not necessarily meaningful. And Nature should certainly invest into a statistical editor!

contemporary issues in hypothesis testing

Posted in Statistics with tags , , , , , , , , , , , , , , , , , , on September 26, 2016 by xi'an

hipocontemptThis week [at Warwick], among other things, I attended the CRiSM workshop on hypothesis testing, giving the same talk as at ISBA last June. There was a most interesting and unusual talk by Nick Chater (from Warwick) about the psychological aspects of hypothesis testing, namely about the unnatural features of an hypothesis in everyday life, i.e., how far this formalism stands from human psychological functioning.  Or what we know about it. And then my Warwick colleague Tom Nichols explained how his recent work on permutation tests for fMRIs, published in PNAS, testing hypotheses on what should be null if real data and getting a high rate of false positives, got the medical imaging community all up in arms due to over-simplified reports in the media questioning the validity of 15 years of research on fMRI and the related 40,000 papers! For instance, some of the headings questioned the entire research in the area. Or transformed a software bug missing the boundary effects into a major flaw.  (See this podcast on Not So Standard Deviations for a thoughtful discussion on the issue.) One conclusion of this story is to be wary of assertions when submitting a hot story to journals with a substantial non-scientific readership! The afternoon talks were equally exciting, with Andrew explaining to us live from New York why he hates hypothesis testing and prefers model building. With the birthday model as an example. And David Draper gave an encompassing talk about the distinctions between inference and decision, proposing a Jaynes information criterion and illustrating it on Mendel‘s historical [and massaged!] pea dataset. The next morning, Jim Berger gave an overview on the frequentist properties of the Bayes factor, with in particular a novel [to me] upper bound on the Bayes factor associated with a p-value (Sellke, Bayarri and Berger, 2001)

B¹⁰(p) ≤ 1/-e p log p

with the specificity that B¹⁰(p) is not testing the original hypothesis [problem] but a substitute where the null is the hypothesis that p is uniformly distributed, versus a non-parametric alternative that p is more concentrated near zero. This reminded me of our PNAS paper on the impact of summary statistics upon Bayes factors. And of some forgotten reference studying Bayesian inference based solely on the p-value… It is too bad I had to rush back to Paris, as this made me miss the last talks of this fantastic workshop centred on maybe the most important aspect of statistics!

Validity and the foundations of statistical inference

Posted in Statistics with tags , , , , , , , , on July 29, 2016 by xi'an

Natesh pointed out to me this recent arXival with a somewhat grandiose abstract:

In this paper, we argue that the primary goal of the foundations of statistics is to provide data analysts with a set of guiding principles that are guaranteed to lead to valid statistical inference. This leads to two new questions: “what is valid statistical inference?” and “do existing methods achieve this?” Towards answering these questions, this paper makes three contributions. First, we express statistical inference as a process of converting observations into degrees of belief, and we give a clear mathematical definition of what it means for statistical inference to be valid. Second, we evaluate existing approaches Bayesian and frequentist approaches relative to this definition and conclude that, in general, these fail to provide valid statistical inference. This motivates a new way of thinking, and our third contribution is a demonstration that the inferential model framework meets the proposed criteria for valid and prior-free statistical inference, thereby solving perhaps the most important unsolved problem in statistics.

Since solving the “most important unsolved problem in statistics” sounds worth pursuing, I went and checked the paper‘s contents.

“To us, the primary goal of the foundations of statistics is to provide a set of guiding principles that, if followed, will guarantee validity of the resulting inference. Our motivation for writing this paper is to be clear about what is meant by valid inference and to provide the necessary principles to help data analysts achieve validity.”

Which can be interpreted in so many ways that it is somewhat meaningless…

“…if real subjective prior information is available, we recommend using it. However, there is an expanding collection of work (e.g., machine learning, etc) that takes the perspective that no real prior information is available. Even a large part of the literature claiming to be Bayesian has abandoned the interpretation of the prior as a serious part of the model, opting for “default” prior that “works.” Our choice to omit a prior from the model is not for the (misleading) purpose of being “objective”—subjectivity is necessary—but, rather, for the purpose of exploring what can be done in cases where a fully satisfactory prior is not available, to see what improvements can be made over the status quo.”

This is a pretty traditional criticism of the Bayesian approach, namely that if a “true” prior is provided (by whom?) then it is optimal to use it. But this amounts to turn the prior into another piece of the sampling distribution and is not in my opinion a Bayesian argument! Most of the criticisms in the paper are directed at objective Bayes approaches, with the surprising conclusion that, because there exist cases where no matching prior is available, “the objective Bayesian approach [cannot] be considered as a general framework for scientific inference.” (p.9)

Another section argues that a Bayesian modelling cannot describe a state of total ignorance. This is formally correct, which is why there is no such thing as a non-informative or the non-informative prior, as often discussed here, but is this truly relevant, in that the inference problem contains one way or another information about the parameter, for instance through a loss function or a pseudo-likelihood.

“This is a desirable property that most existing methods lack.”

The proposal central to the paper thesis is to replace posterior probabilities by belief functions b(.|X), called statistical inference, that are interpreted as measures of evidence about subsets A of the parameter space. If not necessarily as probabilities. This is not very novel, witness the works of Dempster, Shafer and subsequent researchers. And not very much used outside Bayesian and fiducial statistics because of the mostly impossible task of defining a function over all subsets of the parameter space. Because of the subjectivity of such “beliefs”, they will be “valid” only if they are well-calibrated in the sense of b(A|X) being sub-uniform, that is, more concentrated near zero than a uniform variate (i.e., small) under the alternative, i.e. when θ is not in A. At this stage, since this is a mix of a minimax and proper coverage condition, my interest started to quickly wane… Especially because the sub-uniformity condition is highly demanding, if leading to controls over the Type I error and the frequentist coverage. As often, I wonder at the meaning of a calibration property obtained over all realisations of the random variable and all values of the parameter. So for me stability is neither “desirable” nor “essential”. Overall, I have increasing difficulties in perceiving proper coverage as a relevant property. Which has no stronger or weaker meaning that the coverage derived from a Bayesian construction.

“…frequentism does not provide any guidance for selecting a particular rule or procedure.”

I agree with this assessment, which means that there is no such thing as frequentist inference, but rather a philosophy for assessing procedures. That the Gleser-Hwang paradox invalidates this philosophy sounds a bit excessive, however. Especially when the bounded nature of Bayesian credible intervals is also analysed as a failure. A more relevant criticism is the lack of directives for picking procedures.

“…we are the first to recognize that the belief function’s properties are necessary in order for the inferential output to satisfy the required validity property”

The construction of the “inferential model” proposed by the authors offers similarities withn fiducial inference, in that it builds upon the representation of the observable X as X=a(θ,U). With further constraints on the function a() to ensure the validity condition holds… An interesting point is that the functional connection X=a(θ,U) means that the nature of U changes once X is observed, albeit in a delicate manner outside a Bayesian framework. When illustrated on the Gleser-Hwang paradox, the resolution proceeds from an arbitrary choice of a one-dimensional summary, though. (As I am reading the paper, I realise it builds on other and earlier papers by the authors, papers that I cannot read for lack of time. I must have listned to a talk by one of the authors last year at JSM as this rings a bell. Somewhat.) In conclusion of a quick Sunday afternoon read, I am not convinced by the arguments in the paper and even less by the impression of a remaining arbitrariness in setting the resulting procedure.

not an ASA’s statement on p-values

Posted in Books, Kids, Statistics, University life with tags , , , , on March 18, 2016 by xi'an


This may be a coincidence, but a few days after the ASA statement got published, Yuri Gurevich and Vladimir Vovk arXived a note on the Fundamentals of p-values. Which actually does not contribute to the debate. The paper is written in a Q&A manner. And defines a sort of peculiar logic related with [some] p-values. A second and more general paper is in the making, which may shed more light on the potential appeal of this formalism…

ASA’s statement on p-values [#2]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on March 9, 2016 by xi'an


It took a visit on FiveThirtyEight to realise the ASA statement I mentioned yesterday was followed by individual entries from most members of the panel, much more diverse and deeper than the statement itself! Without discussing each and all comments, some points I subscribe to

  • it does not make sense to try to replace the p-value and the 5% boundary by something else but of the same nature. This was the main line of our criticism of Valen Johnson’s PNAS paper with Andrew.
  • it does not either make sense to try to come up with a hard set answer about whether or not a certain parameter satisfies a certain constraint. A comparison of predictive performances at or around the observed data sounds much more sensible, if less definitive.
  • the Bayes factor is often advanced as a viable alternative to the p-value in those comments, but it suffers from difficulties exposed in our recent testing by mixture paper, one being the lack of absolute scale.
  • we seem unable to escape the landscape set by Neyman and Pearson when constructing their testing formalism, including the highly unrealistic 0-1 loss function. And the grossly asymmetric opposition between null and alternative hypotheses.
  • the behaviour of any procedure of choice should be evaluated under different scenarios, most likely by simulation, including some accounting for misspecified models. Which may require an extra bit of non-parametrics. And we should abstain from considering further than evaluating whether or not the data looks compatible with each of the scenarios. Or how much through the mixture representation.

ASA’s statement on p-values

Posted in Books, Statistics, University life with tags , , , , , on March 8, 2016 by xi'an


Last night I received an email from the ASA signed by Jessica Utts and Ron Wasserstein with the following sentence

“Widespread use of ‘statistical significance’ (generally interpreted as ‘p

In short, we envision a new era, in which the broad scientific community recognizes what statisticians have been advocating for many years. In this “post p

Is such an era beyond reach? We think not, but we need your help in making sure this opportunity is not lost.”

which is obviously missing important bits. The email was pointing out a free access American Statistician article warning about the misuses and over-interpretations of p-values. Which contains rather basic “principles” that p-values are not probabilities that the null is true, that there is no golden level against which to compare the p-value, that nominal p-values may be far from actual p-values, that they do not provide a measure of evidence per se, &tc. As written in the conclusion, “Nothing in the ASA statement is new”. But, besides calling for caution and the cumulative use of different assessments of evidence, this statement may leave the non-statistician completely nonplussed about how to proceed when testing hypotheses or comparing models. And make the decision of Basic and Applied Social Psychology of rejecting all arguments based on p-values sound sensible.

Incidentally, the article contains the completion of the first sentence [in red below], if not of the second:

“Widespread use of ‘statistical significance’ (generally interpreted as ‘p≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.


It’s the selection’s fault not the p-values’… [seminar]

Posted in pictures, Statistics, University life with tags , , , , , , on February 5, 2016 by xi'an

Paris and la Seine, from Pont du Garigliano, Oct. 20, 2011Yoav Benjamini will give a seminar talk in Paris next Monday on the above (full title: “The replicability crisis in science: It’s the selection’s fault not the p-values’“). (That I will miss for being in Warwick at the time.) With a fairly terse abstract:

I shall discuss the problem of lack of replicability of results in science, and point at selective inference as a statistical root cause. I shall then present a few strategies for addressing selective inference, and their application in genomics, brain research and earlier phases of clinical trials where both primary and secondary endpoints are being used.

Details: February 8, 2016, 16h, Université Pierre & Marie Curie, campus Jussieu, salle 15-16-101.