Archive for testing of hypotheses

severe testing : beyond Statistics wars?!

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on January 7, 2019 by xi'an

A timely start to my reading Deborah Mayo’s [properly printed] Statistical Inference as Severe Testing (How to get beyond the Statistics Wars) on the Armistice Day, as it seems to call for just this, an armistice! And the opportunity of a long flight to Oaxaca in addition… However, this was only the start and it took me several further weeks to peruse seriously enough the book (SIST) before writing the (light) comments below. (Receiving a free copy from CUP and then a second one directly from Deborah after I mentioned the severe sabotage!)

Indeed, I sort of expected a different content when taking the subtitle How to get beyond the Statistics Wars at face value. But on the opposite the book is actually very severely attacking anything not in the line of the Cox-Mayo severe testing line. Mostly Bayesian approach(es) to the issue! For instance, Jim Berger’s construct of his reconciliation between Fisher, Neyman, and Jeffreys is surgically deconstructed over five pages and exposed as a Bayesian ploy. Similarly, the warnings from Dennis Lindley and other Bayesians that the p-value attached with the Higgs boson experiment are not probabilities that the particle does not exist are met with ridicule. (Another go at Jim’s Objective Bayes credentials is found in the squared myth of objectivity chapter. Maybe more strongly than against staunch subjectivists like Jay Kadane. And yet another go when criticising the Berger and Sellke 1987 lower bound results. Which even extends to Vale Johnson’s UMP-type Bayesian tests.)

“Inference should provide posterior probabilities, final degrees of support, belief, probability (…) not provided by Bayes factors.” (p.443)

Another subtitle of the book could have been testing in Flatland given the limited scope of the models considered with one or at best two parameters and almost always a Normal setting. I have no idea whatsoever how the severity principle would apply in more complex models, with e.g. numerous nuisance parameters. By sticking to the simplest possible models, the book can carry on with the optimality concepts of the early days, like sufficiency (p.147) and and monotonicity and uniformly most powerful procedures, which only make sense in a tiny universe.

“The estimate is really a hypothesis about the value of the parameter.  The same data warrant the hypothesis constructed!” (p.92)

There is an entire section on the lack of difference between confidence intervals and the dual acceptance regions, although the lack of unicity in defining either of them should come as a bother. Especially outside Flatland. Actually the following section, from p.193 onward, reminds me of fiducial arguments, the more because Schweder and Hjort are cited there. (With a curve like Fig. 3.3. operating like a cdf on the parameter μ but no dominating measure!)

“The Fisher-Neyman dispute is pathological: there’s no disinterring the truth of the matter (…) Fisher grew to renounce performance goals he himself had held when it was found that fiducial solutions disagreed with them.”(p.390)

Similarly the chapter on the “myth of the “the myth of objectivity””(p.221) is mostly and predictably targeting Bayesian arguments. The dismissal of Frank Lad’s arguments for subjectivity ends up [or down] with a rather cheap that it “may actually reflect their inability to do the math” (p.228). [CoI: I once enjoyed a fantastic dinner cooked by Frank in Christchurch!] And the dismissal of loss function requirements in Ziliak and McCloskey is similarly terse, if reminding me of Aris Spanos’ own arguments against decision theory. (And the arguments about the Jeffreys-Lindley paradox as well.)

“It’s not clear how much of the current Bayesian revolution is obviously Bayesian.” (p.405)

The section (Tour IV) on model uncertainty (or against “all models are wrong”) is somewhat limited in that it is unclear what constitutes an adequate (if wrong) model. And calling for the CLT cavalry as backup (p.299) is not particularly convincing.

It is not that everything is controversial in SIST (!) and I found agreement in many (isolated) statements. Especially in the early chapters. Another interesting point made in the book is to question whether or not the likelihood principle at all makes sense within a testing setting. When two models (rather than a point null hypothesis) are X-examined, it is a rare occurrence that the likelihood factorises any further than the invariance by permutation of iid observations. Which reminded me of our earlier warning on the dangers of running ABC for model choice based on (model specific) sufficient statistics. Plus a nice sprinkling of historical anecdotes, esp. about Neyman’s life, from Poland, to Britain, to California, with some time in Paris to attend Borel’s and Lebesgue’s lectures. Which is used as a background for a play involving Bertrand, Borel, Neyman and (Egon) Pearson. Under the title “Les Miserables Citations” [pardon my French but it should be Les Misérables if Hugo is involved! Or maybe les gilets jaunes…] I also enjoyed the sections on reuniting Neyman-Pearson with Fisher, while appreciating that Deborah Mayo wants to stay away from the “minefields” of fiducial inference. With, mot interestingly, Neyman himself trying in 1956 to convince Fisher of the fallacy of the duality between frequentist and fiducial statements (p.390). Wisely quoting Nancy Reid at BFF4 stating the unclear state of affair on confidence distributions. And the final pages reawakened an impression I had at an earlier stage of the book, namely that the ABC interpretation on Bayesian inference in Rubin (1984) could come closer to Deborah Mayo’s quest for comparative inference (p.441) than she thinks, in that producing parameters producing pseudo-observations agreeing with the actual observations is an “ability to test accordance with a single model or hypothesis”.

“Although most Bayesians these days disavow classic subjective Bayesian foundations, even the most hard-nosed. “we’re not squishy” Bayesian retain the view that a prior distribution is an important if not the best way to bring in background information.” (p.413)

A special mention to Einstein’s cafe (p.156), which reminded me of this picture of Einstein’s relative Cafe I took while staying in Melbourne in 2016… (Not to be confused with the Markov bar in the same city.) And a fairly minor concern that I find myself quoted in the sections priors: a gallimaufry (!) and… Bad faith Bayesianism (!!), with the above qualification. Although I later reappear as a pragmatic Bayesian (p.428), although a priori as a counter-example!

severe testing or severe sabotage? [not a book review]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on October 16, 2018 by xi'an

Last week, I received this new book of Deborah Mayo, which I was looking forward reading and annotating!, but thrice alas, the book had been sabotaged: except for the preface and acknowledgements, the entire book is printed upside down [a minor issue since the entire book is concerned] and with some part of the text cut on each side [a few letters each time but enough to make reading a chore!]. I am thus waiting for a tested copy of the book to start reading it in earnest!

 

relativity is the keyword

Posted in Books, Statistics, University life with tags , , , , , , , on February 1, 2017 by xi'an

St John's College, Oxford, Feb. 23, 2012As I was teaching my introduction to Bayesian Statistics this morning, ending up with the chapter on tests of hypotheses, I found reflecting [out loud] on the relative nature of posterior quantities. Just like when I introduced the role of priors in Bayesian analysis the day before, I stressed the relativity of quantities coming out of the BBB [Big Bayesian Black Box], namely that whatever happens as a Bayesian procedure is to be understood, scaled, and relativised against the prior equivalent, i.e., that the reference measure or gauge is the prior. This is sort of obvious, clearly, but bringing the argument forward from the start avoids all sorts of misunderstanding and disagreement, in that it excludes the claims of absolute and certainty that may come with the production of a posterior distribution. It also removes the endless debate about the determination of the prior, by making each prior a reference on its own. With an additional possibility of calibration by simulation under the assumed model. Or an alternative. Again nothing new there, but I got rather excited by this presentation choice, as it seems to clarify the path to Bayesian modelling and avoid misapprehensions.

Further, the curious case of the Bayes factor (or of the posterior probability) could possibly be resolved most satisfactorily in this framework, as the [dreaded] dependence on the model prior probabilities then becomes a matter of relativity! Those posterior probabilities depend directly and almost linearly on the prior probabilities, but they should not be interpreted in an absolute sense as the ultimate and unique probability of the hypothesis (which anyway does not mean anything in terms of the observed experiment). In other words, this posterior probability does not need to be scaled against a U(0,1) distribution. Or against the p-value if anyone wishes to do so. By the end of the lecture, I was even wondering [not so loudly] whether or not this perspective was allowing for a resolution of the Lindley-Jeffreys paradox, as the resulting number could be set relative to the choice of the [arbitrary] normalising constant. Continue reading

not an ASA’s statement on p-values

Posted in Books, Kids, Statistics, University life with tags , , , , on March 18, 2016 by xi'an

 

This may be a coincidence, but a few days after the ASA statement got published, Yuri Gurevich and Vladimir Vovk arXived a note on the Fundamentals of p-values. Which actually does not contribute to the debate. The paper is written in a Q&A manner. And defines a sort of peculiar logic related with [some] p-values. A second and more general paper is in the making, which may shed more light on the potential appeal of this formalism…

ASA’s statement on p-values [#2]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , on March 9, 2016 by xi'an

 

It took a visit on FiveThirtyEight to realise the ASA statement I mentioned yesterday was followed by individual entries from most members of the panel, much more diverse and deeper than the statement itself! Without discussing each and all comments, some points I subscribe to

  • it does not make sense to try to replace the p-value and the 5% boundary by something else but of the same nature. This was the main line of our criticism of Valen Johnson’s PNAS paper with Andrew.
  • it does not either make sense to try to come up with a hard set answer about whether or not a certain parameter satisfies a certain constraint. A comparison of predictive performances at or around the observed data sounds much more sensible, if less definitive.
  • the Bayes factor is often advanced as a viable alternative to the p-value in those comments, but it suffers from difficulties exposed in our recent testing by mixture paper, one being the lack of absolute scale.
  • we seem unable to escape the landscape set by Neyman and Pearson when constructing their testing formalism, including the highly unrealistic 0-1 loss function. And the grossly asymmetric opposition between null and alternative hypotheses.
  • the behaviour of any procedure of choice should be evaluated under different scenarios, most likely by simulation, including some accounting for misspecified models. Which may require an extra bit of non-parametrics. And we should abstain from considering further than evaluating whether or not the data looks compatible with each of the scenarios. Or how much through the mixture representation.

ASA’s statement on p-values

Posted in Books, Statistics, University life with tags , , , , , on March 8, 2016 by xi'an

 

Last night I received an email from the ASA signed by Jessica Utts and Ron Wasserstein with the following sentence

“Widespread use of ‘statistical significance’ (generally interpreted as ‘p

In short, we envision a new era, in which the broad scientific community recognizes what statisticians have been advocating for many years. In this “post p

Is such an era beyond reach? We think not, but we need your help in making sure this opportunity is not lost.”

which is obviously missing important bits. The email was pointing out a free access American Statistician article warning about the misuses and over-interpretations of p-values. Which contains rather basic “principles” that p-values are not probabilities that the null is true, that there is no golden level against which to compare the p-value, that nominal p-values may be far from actual p-values, that they do not provide a measure of evidence per se, &tc. As written in the conclusion, “Nothing in the ASA statement is new”. But, besides calling for caution and the cumulative use of different assessments of evidence, this statement may leave the non-statistician completely nonplussed about how to proceed when testing hypotheses or comparing models. And make the decision of Basic and Applied Social Psychology of rejecting all arguments based on p-values sound sensible.

Incidentally, the article contains the completion of the first sentence [in red below], if not of the second:

“Widespread use of ‘statistical significance’ (generally interpreted as ‘p≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.

 

statistical significance as explained by The Economist

Posted in Books, Statistics, University life with tags , , , , , , on November 7, 2013 by xi'an

There is a long article in The Economist of this week (also making the front cover), which discusses how and why many published research papers have unreproducible and most often “wrong” results. Nothing immensely new there, esp. if you read Andrew’s blog on a regular basis, but the (anonymous) writer(s) take(s) pains to explain how this related to statistics and in particular statistical testing of hypotheses. The above is an illustration from this introduction to statistical tests (and their interpretation).

“First, the statistics, which if perhaps off-putting are quite crucial.”

It is not the first time I spot a statistics backed article in this journal and so assume it has either journalists with a statistics background or links with (UK?) statisticians. The description of why statistical tests can err is fairly (Type I – Type II) classical. Incidentally, it reports a finding of Ioannidis that when reporting a positive at level 0.05,  the expectation of a false positive rate of one out of 20 is “highly optimistic”. An evaluation opposed to, e.g., Berger and Sellke (1987) who reported a too-early rejection in a large number of cases. More interestingly, the paper stresses that this classical approach ignores “the unlikeliness of the hypothesis being tested”, which I interpret as the prior probability of the hypothesis under test.

“Statisticians have ways to deal with such problems. But most scientists are not statisticians.”

The paper also reports about the lack of power in most studies, report that I find a bit bizarre and even meaningless in its ability to compute an overall power, all across studies and researchers and even fields. Even in a single study, the alternative to “no effect” is composite, hence has a power that depends on the unknown value of the parameter. Seeking a single value for the power requires some prior distribution on the alternative.

“Peer review’s multiple failings would matter less if science’s self-correction mechanism—replication—was in working order.”

The next part of the paper covers the failings of peer review, of which I discussed in the ISBA Bulletin, but it seems to me too easy to blame the ref in failing to spot statistical or experimental errors, when lacking access to the data or to the full experimental methodology and when under pressure to return (for free) a report within a short time window. The best that can be expected is that a referee detects the implausibility of a claim or an obvious methodological or statistical mistake. These are not math papers! And, as pointed out repeatedly, not all referees are statistically numerate….

“Budding scientists must be taught technical skills, including statistics.”

The last part discusses of possible solutions to achieve reproducibility and hence higher confidence in experimental results. Paying for independent replication is the proposed solution but it can obviously only apply to a small margin of all published results. And having control bodies testing at random labs and teams following a major publication seems rather unrealistic, if only for filling the teams of such bodies with able controllers… An interesting if pessimistic debate, in fine. And fit for the International Year of Statistics.