Archive for uniformly most powerful tests

Bayesian spectacles

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on October 4, 2017 by xi'an

E.J. Wagenmakers and his enthusiastic team of collaborators at University of Amsterdam and in the JASP software designing team have started a blog called Bayesian spectacles which I find a fantastic title. And not only because I wear glasses. Plus, they got their own illustrator, Viktor Beekman, which sounds like the epitome of sophistication! (Compared with resorting to vacation or cat pictures…)

In a most recent post they addressed the criticisms we made of the 72 author paper on p-values, one of the co-authors being E.J.! Andrew already re-addressed some of the address, but here is a disagreement he let me to chew on my own [and where the Abandoners are us!]:

Disagreement 2. The Abandoners’ critique the UMPBTs –the uniformly most powerful Bayesian tests– that features in the original paper. This is their right (see also the discussion of the 2013 Valen Johnson PNAS paper), but they ignore the fact that the original paper presented a series of other procedures that all point to the same conclusion: p-just-below-.05 results are evidentially weak. For instance, a cartoon on the JASP blog explains the Vovk-Sellke bound. A similar result is obtained using the upper bounds discussed in Berger & Sellke (1987) and Edwards, Lindman, & Savage (1963). We suspect that the Abandoners’ dislike of Bayes factors (and perhaps their upper bounds) is driven by a disdain for the point-null hypothesis. That is understandable, but the two critiques should not be mixed up. The first question is Given that we wish to test a point-null hypothesis, do the Bayes factor upper bounds demonstrate that the evidence is weak for p-just-below-.05 results? We believe they do, and in this series of blog posts we have provided concrete demonstrations.

Obviously, this reply calls for an examination of the entire BS blog series, but being short in time at the moment, let me point out that the upper lower bounds on the Bayes factors showing much more support for H⁰ than a p-value at 0.05 only occur in special circumstances. Even though I spend some time in my book discussing those bounds. Indeed, the [interesting] fact that the lower bounds are larger than the p-values does not hold in full generality. Moving to a two-dimensional normal with potentially zero mean is enough to see the order between lower bound and p-value reverse, as I found [quite] a while ago when trying to expand Berger and Sellker (1987, the same year as I was visiting Purdue where both had a position). I am not sure this feature has been much explored in the literature, I did not pursue it when I realised the gap was missing in larger dimensions… I must also point out I do not have the same repulsion for point nulls as Andrew! While considering whether a parameter, say a mean, is exactly zero [or three or whatever] sounds rather absurd when faced with the strata of uncertainty about models, data, procedures, &tc.—even in theoretical physics!—, comparing several [and all wrong!] models with or without some parameters for later use still makes sense. And my reluctance in using Bayes factors does not stem from an opposition to comparing models or from the procedure itself, which is quite appealing within a Bayesian framework [thus appealing per se!], but rather from the unfortunate impact of the prior [and its tail behaviour] on the quantity and on the delicate calibration of the thing. And on a lack of reference solution [to avoid the O and the N words!]. As exposed in the demise papers. (Which main version remains in a publishing limbo, the onslaught from the referees proving just too much for me!)

abandon all o(p) ye who enter here

Posted in Books, Statistics, University life with tags , , , , , , on September 28, 2017 by xi'an

Today appeared on arXiv   a joint paper by Blakeley McShane, David Gal, Andrew Gelman, Jennifer Tackett, and myself, towards the abandonment of significance tests, which is a response to the 72 author paper in Nature Methods that recently made the news (and comments on the ‘Og). Some of these comments have been incorporated in the paper, along with others more on the psychology testing side. From the irrelevance of point null hypotheses to the numerous incentives for multiple comparisons, to the lack of sufficiency of the p-value itself, to the limited applicability of the uniformly most powerful prior principle…

“…each [proposal] is a purely statistical measure that fails to take a more holistic view of the evidence that includes the consideration of the traditionally neglected factors, that is, prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain.”

One may wonder about this list of grievances and its impact on statistical practice. The paper however suggests two alternatives, one being to investigate the potential impact of (neglected) factors rather than relying on thresholds. Another one, maybe less realistic, unless it is the very same, is to report the entirety of the data associated with the experiment. This makes the life of journal editors and grant evaluators harder, possibly much harder, but it indeed suggests an holistic and continuous approach to data analysis, rather than the mascarade of binary outputs. (Not surprisingly, posting this item of news on Andrew’s blog a few hours ago generated a large amount of discussion.)

Statistical evidence for revised standards

Posted in Statistics, University life with tags , , , , , , , , , on December 30, 2013 by xi'an

In yet another permutation of the original title (!), Andrew Gelman posted the answer Val Johnson sent him after our (submitted)  letter to PNAS. As Val did not send me a copy (although Andrew did!), I will not reproduce it here and I rather refer the interested readers to Andrews’ blog… In addition to Andrew’s (sensible) points, here are a few idle (post-X’mas and pre-skiing) reflections:

  • “evidence against a false null hypothesis accrues exponentially fast” makes me wonder in which metric this exponential rate (in γ?) occurs;
  • that “most decision-theoretic analyses of the optimal threshold to use for declaring a significant finding would lead to evidence thresholds that are substantially greater than 5 (and probably also greater 25)” is difficult to accept as an argument since there is no trace of a decision-theoretic argument in the whole paper;
  • Val rejects our minimaxity argument on the basis that “[UMPBTs] do not involve minimization of maximum loss” but the prior that corresponds to those tests is minimising the integrated probability of not rejecting at threshold level γ, a loss function integrated against parameter and observation, a Bayes risk in other words… Point masses or spike priors are clearly characteristics of minimax priors. Furthermore, the additional argument that “in most applications, however, a unique loss function/prior distribution combination does not exist” has been used by many to refute the Bayesian perspective and makes me wonder what are the arguments left in using a (pseudo-)Bayesian approach;
  • the next paragraph is pure tautology: the fact that “no other test, based on either a subjectively or objectively specified alternative hypothesis, is as likely to produce a Bayes factor that exceeds the specified evidence threshold” is a paraphrase of the definition of UMPBTs, not an argument. I do not see we should solely “worry about false negatives”, since minimising those should lead to a point mass on the null (or, more seriously, should not lead to the minimax-like selection of the prior under the alternative).

Revised evidence for statistical standards

Posted in Kids, Statistics, University life with tags , , , , , , , , on December 19, 2013 by xi'an

valizWe just submitted a letter to PNAS with Andrew Gelman last week, in reaction to Val Johnson’s recent paper “Revised standards for statistical evidence”, essentially summing up our earlier comments within 500 words. Actually, we wrote one draft each! In particular, Andrew came up with the (neat) rhetorical idea of alternative Ronald Fishers living in parallel universes who had each set a different significance reference level and for whom alternative Val Johnsons would rise and propose a modification of the corresponding Fisher’s level. For which I made the above graph, left out of the letter and its 500 words. It relates “the old z” and “the new z”, meaning the boundaries of the rejection zones when, for each golden dot, the “old z” is the previous “new z” and “the new z” is Johnson’s transform. We even figured out that Val’s transform was bringing the significance down by a factor of 10 in a large range of values. As an aside, we also wondered why most of the supplementary material was spent on deriving UMPBTs for specific (formal) problems when the goal of the paper sounded much more global…

As I am aware we are not the only ones to have submitted a letter about Johnson’s proposal, I am quite curious at the reception we will get from the editor! (Although I have to point out that all of my earlier submissions of letters to to PNAS got accepted.)

on alternative perspectives and solutions on Bayesian tests

Posted in Statistics, Travel, University life with tags , , , , , , , on December 16, 2013 by xi'an

Here are the slides of my tutorial at O’ Bayes 2013 today, a pot-pourri of various, recent and less recent, criticisms (with, albeit less than usual, a certain proportion of recycled slides):

Valen in Le Monde

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 21, 2013 by xi'an

Valen Johnson made the headline in Le Monde, last week. (More precisely, to the scientific blog Passeur de Sciences. Thanks, Julien, for the pointer!) With the alarming title of “(A study questions one major tool of the scientific approach). The reason for this French fame is Valen’s recent paper in PNAS, Revised standards for statistical evidence, where he puts forward his uniformly most powerful Bayesian tests (recently discussed on the ‘Og) to argue against the standard 0.05 significance level and in favour of “the 0.005 or 0.001 level of significance.”

“…many statisticians have noted that P values of 0.05 may correspond to Bayes factors that only favor the alternative hypothesis by odds of 3 or 4–1…” V. Johnson, PNAS

While I do plan to discuss the PNAS paper later (and possibly write a comment letter to PNAS with Andrew), I find interesting the way it made the headlines within days of its (early edition) publication: the argument suggesting to replace .05 with .001 to increase the proportion of reproducible studies is both simple and convincing for a scientific journalist. If only the issue with p-values and statistical testing could be that simple… For instance, the above quote from Valen is reproduced as “an [alternative] hypothesis that stands right below the significance level has in truth only 3 to 5 chances to 1 to be true”, the “truth” popping out of nowhere. (If you read French, the 300+ comments on the blog are also worth their weight in jellybeans…)

uniformly most powerful Bayesian tests???

Posted in Books, Statistics, University life with tags , , , , , , , on September 30, 2013 by xi'an

“The difficulty in constructing a Bayesian hypothesis test arises from the requirement to specify an alternative hypothesis.”

Vale Johnson published (and arXived) a paper in the Annals of Statistics on uniformly most powerful Bayesian tests. This is in line with earlier writings of Vale on the topic and good quality mathematical statistics, but I cannot really buy the arguments contained in the paper as being compatible with (my view of) Bayesian tests. A “uniformly most powerful Bayesian test” (acronymed as UMBT)  is defined as

“UMPBTs provide a new form of default, nonsubjective Bayesian tests in which the alternative hypothesis is determined so as to maximize the probability that a Bayes factor exceeds a specified threshold”

which means selecting the prior under the alternative so that the frequentist probability of the Bayes factor exceeding the threshold is maximal for all values of the parameter. This does not sound very Bayesian to me indeed, due to this averaging over all possible values of the observations x and comparing the probabilities for all values of the parameter θ rather than integrating against a prior or posterior and selecting the prior under the alternative with the sole purpose of favouring the alternative, meaning its further use when the null is rejected is not considered at all and catering to non-Bayesian theories, i.e. trying to sell Bayesian tools as supplementing p-values and arguing the method is objective because the solution satisfies a frequentist coverage (at best, this maximisation of the rejection probability reminds me of minimaxity, except there is no clear and generic notion of minimaxity in hypothesis testing).

Continue reading