who’s afraid of the big B wolf?
Aris Spanos just published a paper entitled “Who should be afraid of the Jeffreys-Lindley paradox?” in the journal Philosophy of Science. This piece is a continuation of the debate about frequentist versus llikelihoodist versus Bayesian (should it be Bayesianist?! or Laplacist?!) testing approaches, exposed in Mayo and Spanos’ Error and Inference, and discussed in several posts of the ‘Og. I started reading the paper in conjunction with a paper I am currently writing for a special volume in honour of Dennis Lindley, paper that I will discuss later on the ‘Og…
“…the postdata severity evaluation (…) addresses the key problem with Fisherian p-values in the sense that the severity evaluation provides the “magnitude” of the warranted discrepancy from the null by taking into account the generic capacity of the test (that includes n) in question as it relates to the observed data”(p.88)
First, the antagonistic style of the paper is reminding me of Spanos’ previous works in that it relies on repeated value judgements (such as “Bayesian charge”, “blatant misinterpretation”, “Bayesian allegations that have undermined the credibility of frequentist statistics”, “both approaches are far from immune to fallacious interpretations”, “only crude rules of thumbs”, &tc.) and rhetorical sleights of hand. (See, e.g., “In contrast, the severity account ensures learning from data by employing trustworthy evidence (…), the reliability of evidence being calibrated in terms of the relevant error probabilities” [my stress].) Connectedly, Spanos often resorts to an unusual [at least for statisticians] vocabulary that amounts to newspeak. Here are some illustrations: “summoning the generic capacity of the test”, ‘substantively significant”, “custom tailoring the generic capacity of the test”, “the fallacy of acceptance”, “the relevance of the generic capacity of the particular test”, yes the term “generic capacity” is occurring there with a truly high frequency.
“…there is nothing fallacious or paradoxical about a small p-value or a rejection of the null, for a given significance level a; when n is large enough, since a highly sensitive test is likely to pick up on tiny (in a substantive sense) discrepancies from H0.” (p.75)
I also note that evidence is never defined throughout the paper. Neither is the specific purpose of conducting a test (against, say, constructing a confidence interval) discussed. It is as though there was an obvious truth that only one approach could reach… However, this is not so much surprising given the anti-decision-theory feelings of the author. Ironically, the numerical example used in the paper (borrowed from Stone, 1997, also father to the marginalisation paradoxes) is the very same as Bayes’s [billard example] (with a larger n admittedly) and as Laplace’s [example on births] (with a similar value of n).
“…the problem does not lie with the p-value or the accept/reject rules as such, but with how such results are transformed into evidence for or against H0 or a particular alternative.” (p.76)
Second, the paper argues that the Jeffreys-Lindley paradox is demonstrating against the Bayesian (and likelihood) resolutions of the problem for failing to account for the large sample size. I do not disagree with this perspective as I consider the main lesson learned from this great paper is that vague priors are dreadful when conducting hypothesis testing. (More on that in the incoming paper.) While I do not see much strength in arguing in favour of a procedure that would always conclude by picking the null, no matter what the value of the test statistics is, considering a fixed value of the t statistic has little meaning in an asymptotic referential. Either the t statistic converges in distribution under the null or it diverges to infinity under the alternative.(Re-more on that in the incoming paper.) If still considering the whole shebang of hypothesis testing, I would actually say the paradox expresses difficulties for all of the three threads: when going Fisher, how one should decrease the bound on the p-value as n increases and on which principle this bound should be chosen? The paper mentions (p.78) that because “of the large sample size, it is often judicious to choose a small type I error, say α=.003” but this simply points out at the arbitrariness of this bound (or worse, that it is dictated by the data since the p-value is the nearby .0027). I also find the argument of consistence rather (big bad wolf) “huff and puff” in that case since the Bayes factor and the likelihood ratio tests are also consistent testing procedures. Going Neyman (or going Pearson!), how can one use the balance or imbalance between Type I and Type II errors? In other words, I have trouble with the notion of power when it is a function that depends on the unknown parameter. In particular, the power remains desperately and constantly equal to the Type I error at the boundary of the parameter set. Without a prior distribution, giving a meaning to something like (eqn. (25), p.87)
ℙ(x;d(X)<d(x0);θ>θ1 is false)
seems hopeless. Going Bayes with a flat prior on the binomial probability leads to a Bayes factor of 8.115 (p.80). While not a huge quantity per se, the difficulty is in calibrating it, Jeffreys’s scale being all straw and no brick to keep along with the title of the paper.
“…what is problematic is the move from the accept/reject results, and the p-value, to claiming that data x0 provide evidence for a particular hypothesis” (p.79)
Third, Aris Spanos uses the failures (or fallacies?) of all three main approaches to advocate for the “postdata severity evaluation” introduced in an earlier paper with Deborah Mayo. Section 6 starts with the strange argument that, since we have observed x0, the sign of x0-θ0 “indicates the relevant direction of departure from H0“, strange indeed because random outcomes may well occur the “other” side from θ0. The notion of severe tests has been advocated by Mayo and Spanos over the past, but I do not know whether or not it had any impact on the practice of statistics: the solution seems to require even further calibration than the regular p-value and is thus bound to confuse practitioners. Indeed, the severity evaluation implies defining for each departure from the null θ0+γ the probability that data associated with this parameter values “accords less with θ>θ1 [should it be θ0?] than x0 does” (p.87). It is therefore a mix of p-value and of type II error that is supposed to “provide the `magnitude’ of the warranted discrepancy from the null” (p.88), i.e. to decide on how close to the null can we get and still discriminate the null from the alternative. As discussed in the paper, the value of this closest discrepancy γ depends on another tail probability, the “severity threshold”, which has to be chosen by the experimenter without being particularly intuitive. Further, once the resulting discrepancy γ is found, whether it is “substantially significant (..) pertains to the substantive subject matter” (p.88), implying some sort of loss function that is so blatantly ignored throughout the paper. (Note that, as a statistics, the Bayes factor or the likelihood ratio could be processed in exactly the same way.)
“…the fact that the discrepancy between the p-value and the Bayes factor is smaller for a composite null is not particularly interesting because both measures are misleading in different ways.” (p.80)
Part of the discussion of the Bayesian approach argues (p.81) about other values of θ supported and even better supported by the data than the null value. This is a surprising argument as it relates to confidence intervals rather than to testing. The observed data may “favor certain values more strongly” but those values are (a) driven by the data and (b) of no particular relevance for conducting a test. The tested value, 0.2 say, is chosen before the experiment because it has some special meaning for the problem at hand. The fact that the likelihood and the posterior are larger in other values does not “constitute conflicting evidence” that the null hypothesis holds. I am just bemused by such arguments.
“…what goes wrong is that the Bayesian factor [sic] and the likelihoodist procedures use Euclidean geometry to evaluate evidence for different hypotheses when in fact the statistical testing space is curved” (p.90)
A few mathematical and typographical quibbles: (i) the notation Bin(0,1;n) used in eqn (4) (p.78) for the standardised binomial is not valid as the standardised binomial still depends on the parameter θ0 (except if you use the normal approximation); (b) the fact that the p-value is an upper bound on the infimum of the posterior probabilities (p.80) clearly depends on what set of prior distributions is considered; (c) the intervals favoured by the Bayes factor and more strongly than θ0 (p.81) are data-dependent, so can hardly be opposed to a testing value chosen before the observation; (d) the discussion on p.82 confuses confidence intervals and tests; (e) the invariance of the Bayes factor to n (p.84) is nowhere to be seen; (f) turning the two-sided test into a one-sided one after seeing the data (p.86) is incorrect [from a frequentist viewpoint]; (g) eqn (25) has no frequentist meaning; (h) the last line in p.87 is mixing the standardised and the non-standardised versions of the test statistic; (i) there are several typos in Table 1 (p.88); (j) the test statistic is but seldom a monotone function of the likelihood ratio statistic (p.90); (k) a Bayes factor does not “ignore the sampling distribution (…) by invoking the likelihood principle” (p.90), one just has to look at its denominator; (l) as written in the above quote, a Bayes factor does not rely on Euclidean geometry (p.90) since it is invariant by reparameterisation.