who’s afraid of the big B wolf?

Aris Spanos just published a paper entitled “Who should be afraid of the Jeffreys-Lindley paradox?” in the journal Philosophy of Science. This piece is a continuation of the debate about frequentist versus llikelihoodist versus Bayesian (should it be Bayesianist?! or Laplacist?!) testing approaches, exposed in Mayo and Spanos’ Error and Inference, and discussed in several posts of the ‘Og. I started reading the paper in conjunction with a paper I am currently writing for a special volume in  honour of Dennis Lindley, paper that I will discuss later on the ‘Og…

“…the postdata severity evaluation (…) addresses the key problem with Fisherian p-values in the sense that the severity evaluation provides the “magnitude” of the warranted discrepancy from the null by taking into account the generic capacity of the test (that includes n) in question as it relates to the observed data”(p.88)

First, the antagonistic style of the paper is reminding me of Spanos’ previous works in that it relies on repeated value judgements (such as “Bayesian charge”, “blatant misinterpretation”, “Bayesian allegations that have undermined the credibility of frequentist statistics”, “both approaches are far from immune to fallacious interpretations”, “only crude rules of thumbs”, &tc.) and rhetorical sleights of hand. (See, e.g., “In contrast, the severity account ensures learning from data by employing trustworthy evidence (…), the reliability of evidence being calibrated in terms of the relevant error probabilities” [my stress].) Connectedly, Spanos often resorts to an unusual [at least for statisticians] vocabulary that amounts to newspeak. Here are some illustrations: “summoning the generic capacity of the test”, ‘substantively significant”, “custom tailoring the generic capacity of the test”, “the fallacy of acceptance”, “the relevance of the generic capacity of the particular test”, yes the term “generic capacity” is occurring there with a truly high frequency.

“…there is nothing fallacious or paradoxical about a small p-value or a rejection of the null, for a given significance level a; when n is large enough, since a highly sensitive test is likely to pick up on tiny (in a substantive sense) discrepancies from H0.” (p.75)

I also note that evidence is never defined throughout the paper. Neither is the specific purpose of conducting a test (against, say, constructing a confidence interval) discussed. It is as though there was an obvious truth that only one approach could reach… However, this is not so much surprising given the anti-decision-theory feelings of the author. Ironically, the numerical example used in the paper (borrowed from Stone, 1997, also father to the marginalisation paradoxes) is the very same as Bayes’s [billard example] (with a larger n admittedly) and as Laplace’s [example on births] (with a similar value of n).

“…the problem does not lie with the p-value or the accept/reject rules as such, but with how such results are transformed into evidence for or against H0 or a particular alternative.” (p.76)

Second, the paper argues that the Jeffreys-Lindley paradox is demonstrating against the Bayesian (and likelihood) resolutions of the problem for failing to account for the large sample size. I do not disagree with this perspective as I consider the main lesson learned from this great paper is that vague priors are dreadful when conducting hypothesis testing. (More on that in the incoming paper.) While I do not see much strength in arguing in favour of a procedure that would always conclude by picking the null, no matter what the value of the test statistics is, considering a fixed value of the t statistic has little meaning in an asymptotic referential. Either the t statistic converges in distribution under the null or it diverges to infinity under the alternative.(Re-more on that in the incoming paper.) If still considering the whole shebang of hypothesis testing, I would actually say the paradox expresses difficulties for all of the three threads: when going Fisher, how one should decrease the bound on the p-value as n increases and on which principle this bound should be chosen? The paper mentions (p.78) that because “of the large sample size, it is often judicious to choose a small type I error, say α=.003” but this simply points out at the arbitrariness of this bound (or worse, that it is dictated by the data since the p-value is the nearby .0027). I also find the argument of consistence rather (big bad wolf) “huff and puff” in that case since the Bayes factor and the likelihood ratio tests are also consistent testing procedures. Going Neyman (or going Pearson!), how can one use the balance or imbalance between Type I and Type II errors? In other words, I have trouble with the notion of power when it is a function that depends on the unknown parameter. In particular, the power remains desperately and constantly equal to the Type I error at the boundary of the parameter set. Without a prior distribution, giving a meaning to something like (eqn. (25), p.87)

ℙ(x;d(X)<d(x0);θ>θ1 is false)

seems hopeless. Going Bayes with a flat prior on the binomial probability leads to a Bayes factor of 8.115 (p.80). While not a huge quantity per se, the difficulty is in calibrating it, Jeffreys’s scale being all straw and no brick to keep along with the title of the paper.

“…what is problematic is the move from the accept/reject results, and the p-value, to claiming that data x0 provide evidence for a particular hypothesis” (p.79)

Third, Aris Spanos uses the failures (or fallacies?) of all three main approaches to advocate for the “postdata severity evaluation” introduced in an earlier paper with Deborah Mayo. Section 6 starts with the strange argument that, since we have observed x0, the sign of x00 “indicates the relevant direction of departure from H0“, strange indeed because random outcomes may well occur the “other” side from θ0. The notion of severe tests has been advocated by Mayo and Spanos over the past, but I do not know whether or not it had any impact on the practice of statistics: the solution seems to require even further calibration than the regular p-value and is thus bound to confuse practitioners. Indeed, the severity evaluation implies defining for each departure from the null θ0+γ the probability that data associated with this parameter values “accords less with θ>θ1 [should it be θ0?] than x0 does” (p.87). It is therefore a mix of p-value and of type II error that is supposed to “provide the `magnitude’ of the warranted discrepancy from the null” (p.88), i.e. to decide on how close to the null can we get and still discriminate the null from the alternative. As discussed in the paper, the value of this closest discrepancy γ depends on another tail probability, the “severity threshold”, which has to be chosen by the experimenter without being particularly intuitive. Further, once the resulting discrepancy γ is found, whether it is “substantially significant (..) pertains to the substantive subject matter” (p.88), implying some sort of loss function that is so blatantly ignored throughout the paper. (Note that, as a statistics, the Bayes factor or the likelihood ratio could be processed in exactly the same way.)

“…the fact that the discrepancy between the p-value and the Bayes factor is smaller for a composite null is not particularly interesting because both measures are misleading in different ways.” (p.80)

Part of the discussion of the Bayesian approach argues (p.81) about other values of θ supported and even better supported by the data than the null value. This is a surprising argument as it relates to confidence intervals rather than to testing. The observed data may “favor certain values more strongly” but those values are (a) driven by the data and (b) of no particular relevance for conducting a test. The tested value, 0.2 say, is chosen before the experiment because it has some special meaning for the problem at hand. The fact that the likelihood and the posterior are larger in other values does not “constitute conflicting evidence” that the null hypothesis holds. I am just bemused by such arguments.

“…what goes wrong is that the Bayesian factor [sic] and the likelihoodist procedures use Euclidean geometry to evaluate evidence for different hypotheses when in fact the statistical testing space is curved” (p.90)

A few mathematical and typographical quibbles: (i) the notation Bin(0,1;n) used in eqn (4) (p.78) for the standardised binomial is not valid as the standardised binomial still depends on the parameter θ0 (except if you use the normal approximation); (b) the fact that the p-value is an upper bound on the infimum of the posterior probabilities (p.80) clearly depends on what set of prior distributions is considered; (c) the intervals favoured by the Bayes factor and more strongly than θ0 (p.81) are data-dependent, so can hardly be opposed to a testing value chosen before the observation; (d) the discussion on p.82 confuses confidence intervals and tests; (e) the invariance of the Bayes factor to n (p.84) is nowhere to be seen; (f) turning the two-sided test into a one-sided one after seeing the data (p.86) is incorrect [from a frequentist viewpoint]; (g) eqn (25) has no frequentist meaning; (h) the last line in p.87 is mixing the standardised and the non-standardised versions of the test statistic; (i) there are several typos in Table 1 (p.88); (j) the test statistic is but seldom a monotone function of the likelihood ratio statistic (p.90); (k) a Bayes factor does not “ignore the sampling distribution (…) by invoking the likelihood principle” (p.90), one just has to look at its denominator; (l) as written in the above quote, a Bayes factor does not rely on Euclidean geometry (p.90) since it is invariant by reparameterisation.

11 Responses to “who’s afraid of the big B wolf?”

  1. Those comments on Spanos (2013) have been included in an arXiv document on the Jeffreys-Lindley paradox.

  2. Also, another way to make SEV differ from the Bayesian posterior would be to include a non uniform prior for the parameter. For example suppose we knew that the parameter was restricted to a definite interval [a,b]. That information would have absolutely no effect on SEV since it only depends on the sampling distribution for the errors.

    The Bayesian posterior on the other hand can be quite different if you use a prior to encode the [a,b] restriction. A comparison of the Bayesian answer and the SEV answer for various endpoints [a,b] would be quite illuminating.

  3. For simplicity, refer to the example in my blogpost:

    Mayo: (section 6) “StatSci and PhilSci: part 2″


    Using the one-sided test T+
    H₀: μ ≤ 0 vs H₁:μ>0
    defined in that post, let the null be rejected whenever the sample mean is observed to be as great as, or greater than, .4 (significance level ~.03). So .4 is the fixed cut-off for rejecting the null in a standard N-P test. The power of the test against .5, POW(.5) = .7; and the power against .6 is POW(.6) = .84, etc. Compare to values of SEV for claims about positive discrepancies from 0, with the same observed sample mean, .4. They are given in this post.

  4. Just a remark on your claim: “Without a prior distribution, giving a meaning to something like (eqn. (25), p.87)
    ℙ(x ; d(X)<d(x0); θ larger than θ₁ is false)
    seems hopeless. "

    Not the slightest bit hopeless, and no prior is needed. I and others have shown how to compute these SEV assessments, just as with ordinary power. Or are you claiming ordinary power requires priors? They don't. In the same way people derive power curves, we derive severity curves.

    Examples are linked to on the page:
    http://www.phil.vt.edu/dmayo/personal_website/bibliography%20complete.htm

    Mayo, D. (1983). "An Objective Theory of Statistical Testing." Synthese 57(2): 297-340.

    Mayo, D (1996) Chapter 11 (of Error and the Growth of Experimental Knowledge). Why Pearson Rejected the Neyman-Pearson (Behavioristic) Philosophy and a Note on Objectivity in Statistics.

    Mayo, D.G. and Cox, D. R. (2006) "Frequentists Statistics as a Theory of Inductive Inference," Optimality: The Second Erich L. Lehmann Symposium (ed. J. Rojo), Lecture Notes-Monograph series, Institute of Mathematical Statistics (IMS), Vol. 49: 77-97.

    Mayo, D. G. and Spanos, A. (2006). "Severe Testing as a Basic Concept in a Neyman-Pearson Philosophy of Induction," British Journal of Philosophy of Science, 57: 323-357.

    • Thanks: notations are a bit confused up there but what I mean is how can one condition on {θ; θ > θ₁ is false} without a prior?

      • And I repeat my answer: one computes at a point and the others follow, as with power curves (for the examples discussed). (Even though it is not a conditional but as with all error probabilities “computed under the assumption that”).

      • Ok, I see: the notation is definitely confusing, the statement should not be inside the probability…

      • In Spanos (2013), I see that this probability is computed at the boundary θ₁ so how does it differ from a Type II error?

      • Entsophy Says:

        xi’an,

        Severity of a hypothesis such as:

        H: theta>theta_0

        is calculated using the boundary value theta_0 (since in the examples where I’ve seen this done, this value resulted in the largest severity). But what about the hypothesis:

        H’: theta_0+10^{-1000} > theta >theta_0

        I take it that the Severity of H’ would again be calculated at theta_0 for the same reason, resulting in H’ having the same Severity as H.

      • Uh, uh… The way the severity is explained in Spanos’ Philosophy of Science paper I am discussing here is that the severity is computed at the current value, θ₁, as it grows away from θ₀, which sounds like a Type II error to me.

      • Entsophy Says:

        xi’an,

        The Normal distribution example usually used in Mayo’s papers has the property that SEV(mu >mu_0) is numerically equal to the Bayesian P(mu>mu_0 : data). You can see this by a simple change of varialbes in the integral used to compute it. This conguence wouldn’t be such a big deal if it weren’t for the fact that this is almost the only example ever discussed.

        This identity doesn’t always hold though. For example using the H’ definied above SEV(H’) is definitly no longer equal to P(H’ : data). So maybe H’ would be a good example to illustrate the different behavior and performance of SEV and Bayesian posteriors.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.