Archive for statistical significance

ASA’s statement on p-values [#2]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on March 9, 2016 by xi'an


It took a visit on FiveThirtyEight to realise the ASA statement I mentioned yesterday was followed by individual entries from most members of the panel, much more diverse and deeper than the statement itself! Without discussing each and all comments, some points I subscribe to

  • it does not make sense to try to replace the p-value and the 5% boundary by something else but of the same nature. This was the main line of our criticism of Valen Johnson’s PNAS paper with Andrew.
  • it does not either make sense to try to come up with a hard set answer about whether or not a certain parameter satisfies a certain constraint. A comparison of predictive performances at or around the observed data sounds much more sensible, if less definitive.
  • the Bayes factor is often advanced as a viable alternative to the p-value in those comments, but it suffers from difficulties exposed in our recent testing by mixture paper, one being the lack of absolute scale.
  • we seem unable to escape the landscape set by Neyman and Pearson when constructing their testing formalism, including the highly unrealistic 0-1 loss function. And the grossly asymmetric opposition between null and alternative hypotheses.
  • the behaviour of any procedure of choice should be evaluated under different scenarios, most likely by simulation, including some accounting for misspecified models. Which may require an extra bit of non-parametrics. And we should abstain from considering further than evaluating whether or not the data looks compatible with each of the scenarios. Or how much through the mixture representation.

ASA’s statement on p-values

Posted in Books, Statistics, University life with tags , , , , , on March 8, 2016 by xi'an


Last night I received an email from the ASA signed by Jessica Utts and Ron Wasserstein with the following sentence

“Widespread use of ‘statistical significance’ (generally interpreted as ‘p

In short, we envision a new era, in which the broad scientific community recognizes what statisticians have been advocating for many years. In this “post p

Is such an era beyond reach? We think not, but we need your help in making sure this opportunity is not lost.”

which is obviously missing important bits. The email was pointing out a free access American Statistician article warning about the misuses and over-interpretations of p-values. Which contains rather basic “principles” that p-values are not probabilities that the null is true, that there is no golden level against which to compare the p-value, that nominal p-values may be far from actual p-values, that they do not provide a measure of evidence per se, &tc. As written in the conclusion, “Nothing in the ASA statement is new”. But, besides calling for caution and the cumulative use of different assessments of evidence, this statement may leave the non-statistician completely nonplussed about how to proceed when testing hypotheses or comparing models. And make the decision of Basic and Applied Social Psychology of rejecting all arguments based on p-values sound sensible.

Incidentally, the article contains the completion of the first sentence [in red below], if not of the second:

“Widespread use of ‘statistical significance’ (generally interpreted as ‘p≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.


should I run less?!

Posted in Running, Statistics with tags , , , on February 10, 2015 by xi'an

Run_ABCA study [re]published three days ago in both The New York Times and the BBC The Guardian reproduced the conclusion of an article in the Journal of the American College of Cardiology that strenuous and long-distance jogging (or more appropriately running) could have a negative impact on longevity! And that the best pace is around 8km/h, just above a brisk walk! Quite depressing… However, this was quickly followed by other articles, including this one in The New York Times, pointing out the lack of statistical validation in the study and the ridiculously small number of runners in the study. I am already feeling  better (and ready for my long run tomorrow morning!), but appalled all the same by the lack of standards of journals publishing statistically void studies. I know, nothing new there…

independent component analysis and p-values

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , on September 8, 2014 by xi'an

WariseLast morning at the neuroscience workshop Jean-François Cardoso presented independent component analysis though a highly pedagogical and enjoyable tutorial that stressed the geometric meaning of the approach, summarised by the notion that the (ICA) decomposition


of the data X seeks both independence between the columns of S and non-Gaussianity. That is, getting as away from Gaussianity as possible. The geometric bits came from looking at the Kullback-Leibler decomposition of the log likelihood

-\mathbb{E}[\log L(\theta|X)] = KL(P,Q_\theta) + \mathfrak{E}(P)

where the expectation is computed under the true distribution P of the data X. And Qθ is the hypothesised distribution. A fine property of this decomposition is a statistical version of Pythagoreas’ theorem, namely that when the family of Qθ‘s is an exponential family, the Kullback-Leibler distance decomposes into

KL(P,Q_\theta) = KL(P,Q_{\theta^0}) + KL(Q_{\theta^0},Q_\theta)

where θ⁰ is the expected maximum likelihood estimator of θ. (We also noticed this possibility of a decomposition in our Kullback-projection variable-selection paper with Jérôme Dupuis.) The talk by Aapo Hyvärinen this morning was related to Jean-François’ in that it used ICA all the way to a three-level representation if oriented towards natural vision modelling in connection with his book and the paper on unormalised models recently discussed on the ‘Og.

On the afternoon, Eric-Jan Wagenmaker [who persistently and rationally fight the (ab)use of p-values and who frequently figures on Andrew’s blog] gave a warning tutorial talk about the dangers of trusting p-values and going fishing for significance in existing studies, much in the spirit of Andrew’s blog (except for the defence of Bayes factors). Arguing in favour of preregistration. The talk was full of illustrations from psychology. And included the line that ESP testing is the jester of academia, meaning that testing for whatever form of ESP should be encouraged as a way to check testing procedures. If a procedure finds a significant departure from the null in this setting, there is something wrong with it! I was then reminded that Eric-Jan was one of the authors having analysed Bem’s controversial (!) paper on the “anomalous processes of information or energy transfer that are currently unexplained in terms of known physical or biological mechanisms”… (And of the shocking talk by Jessica Utts on the same topic I attended in Australia two years ago.)

MCMSki IV [day 3]

Posted in Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 9, 2014 by xi'an

ridge5Already on the final day..! And still this frustration in being unable to attend three sessions at once… Andrew Gelman started the day with a non-computational talk that broached on themes that are familiar to readers of his blog, on the misuse of significance tests and on recommendations for better practice. I then picked the Scaling and optimisation of MCMC algorithms session organised by Gareth Roberts, with optimal scaling talks by Tony Lelièvre, Alex Théry and Chris Sherlock, while Jochen Voss spoke about the convergence rate of ABC, a paper I already discussed on the blog. A fairly exciting session showing that MCMC’ory (name of a workshop I ran in Paris in the late 90’s!) is still well and alive!

After the break (sadly without the ski race!), the software round-table session was something I was looking for. The four softwares covered by this round-table were BUGS, JAGS, STAN, and BiiPS, each presented according to the same pattern. I would have like to see a “battle of the bands”, illustrating pros & cons for each language on a couple of models & datasets. STAN got the officious prize for cool tee-shirts (we should have asked the STAN team for poster prize tee-shirts). And I had to skip the final session for a flu-related doctor appointment…

I called for a BayesComp meeting at 7:30, hoping for current and future members to show up and discuss the format of the future MCMski meetings, maybe even proposing new locations on other “sides of the Italian Alps”! But (workshop fatigue syndrome?!), no-one showed up. So anyone interested in discussing this issue is welcome to contact me or David van Dyk, the new BayesComp program chair.

Statistical evidence for revised standards

Posted in Statistics, University life with tags , , , , , , , , , on December 30, 2013 by xi'an

In yet another permutation of the original title (!), Andrew Gelman posted the answer Val Johnson sent him after our (submitted)  letter to PNAS. As Val did not send me a copy (although Andrew did!), I will not reproduce it here and I rather refer the interested readers to Andrews’ blog… In addition to Andrew’s (sensible) points, here are a few idle (post-X’mas and pre-skiing) reflections:

  • “evidence against a false null hypothesis accrues exponentially fast” makes me wonder in which metric this exponential rate (in γ?) occurs;
  • that “most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25)” is difficult to accept as an argument since there is no trace of a decision-theoretic argument in the whole paper;
  • Val rejects our minimaxity argument on the basis that “[UMPBTs] do not involve minimization of maximum loss” but the prior that corresponds to those tests is minimising the integrated probability of not rejecting at threshold level γ, a loss function integrated against parameter and observation, a Bayes risk in other words… Point masses or spike priors are clearly characteristics of minimax priors. Furthermore, the additional argument that “in most applications, however, a unique loss function/prior distribution combination does not exist” has been used by many to refute the Bayesian perspective and makes me wonder what are the arguments left in using a (pseudo-)Bayesian approach;
  • the next paragraph is pure tautology: the fact that “no other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold” is a paraphrase of the definition of UMPBTs, not an argument. I do not see we should solely “worry about false negatives”, since minimising those should lead to a point mass on the null (or, more seriously, should not lead to the minimax-like selection of the prior under the alternative).

Shravan’s comments on “Valen in Le Monde” [guest post]

Posted in Books, Statistics, University life with tags , , , , , , , on November 22, 2013 by xi'an

[Those are comments sent yesterday by Shravan Vasishth in connection with my post. Since they are rather lengthy, I made them into a post. Shravan is also the author of The foundations of Statistics and we got in touch through my review of the book . I may address some of his points later, but, for now, I find the perspective of a psycholinguist quite interesting to hear.]

Christian, Is the problem for you that the p-value, however low, is only going to tell you the probability of your data (roughly speaking) assuming the null is true, it’s not going to tell you anything about the probability of the alternative hypothesis, which is the real hypothesis of interest.

However, limiting the discussion to (Bayesian) hierarchical models (linear mixed models), which is the type of model people often fit in repeated measures studies in psychology (or at least in psycholinguistics), as long as the problem is about figuring out P(θ>0) or P(θ>0), the decision (to act as if θ>0) is going to be the same regardless of whether one uses p-values or a fully Bayesian approach. This is because the likelihood is going to dominate in the Bayesian model.

Andrew has objected to this line of reasoning by saying that making a decision like θ>0 is not a reasonable one in the first place. That is true in some cases, where the result of one experiment never replicates because of study effects or whatever. But there are a lot of effects which are robust and replicable, and where it makes sense to ask these types of questions.

One central issue for me is: in situations like these, using a low p-value to make such a decision is going to yield pretty similar outcomes compared to doing inference using the posterior distribution. The machinery needed to do a fully Bayesian analysis is very intimidating; you need to know a lot, and you need to do a lot more coding and checking than when you fit an lmer type of model.

It took me 1.5 to 2 years of hard work (=evenings spent not reading novels) to get to the point that I knew roughly what I was doing when fitting Bayesian models. I don’t blame anyone for not wanting to put their life on hold to get to such a point. I find the Bayesian method attractive because it actually answers the question I really asked, namely is θ>0 or θ<0? This is really great, I don’t have beat around the bush any more! (there; I just used an exclamation mark). But for the researcher unwilling (or more likely: unable) to invest the time into the maths and probability theory and the world of BUGS, the distance between a heuristic like a low p-value and the more sensible Bayesian approach is not that large.