Error and Inference [end]

(This is my sixth and last post on Error and Inference, being as previously a raw and naïve reaction born from a linear and sluggish reading of the book, rather than a deeper and more informed criticism with philosophical bearings. Read at your own risk.)

‘It is refreshing to see Cox and Mayo give a hard-nosed statement of what scientific objectivity demands of an account of statistics, show how it relates to frequentist statistics, and contrast that with the notion of “objectivity” used by O-Bayesians.”—A. Spanos, p.326, Error and Inference, 2010

In order to conclude my pedestrian traverse of Error and Inference, I read the discussion by Aris Spanos of the second part of the seventh chapter by David Cox’s and Deborah Mayo’s, discussed in the previous post. (In the train to the half-marathon to be precise, which may have added a sharper edge to the way I read it!) The first point in the discussion is that the above paper is “a harmonious blend of the Fisherian and N-P perspectives to weave a coherent frequentist inductive reasoning anchored firmly on error probabilities”(p.316). The discussion by Spanos is very much a-critical of the paper, so I will not engage into a criticism of the non-criticism, but rather expose some thoughts of mine that came from reading this apology. (Remarks about Bayesian inference are limited to some piques like the above, which only reiterates those found earlier [and later: "the various examples Bayesians employ to make their case involve some kind of "rigging" of the statistical model", Aris Spanos, p.325; "The Bayesian epistemology literature is filled with shadows and illusions", Clark Glymour, p. 335] in the book.) [I must add I do like the mention of O-Bayesians, as I coined the O'Bayes motto for the objective Bayes bi-annual meetings from 2003 onwards! It also reminds me of the O-rings and of the lack of proper statistical decision-making in the Challenger tragedy...]

The “general frequentist principle for inductive reasoning” (p.319) at the core of Cox and Mayo’s paper is obviously the central role of the p-value in “providing (strong) evidence against the null H0 (for a discrepancy from H0)”. Once again, I fail to see it as the epitome of a working principle in that

  1. it depends on the choice of a divergence d(z), which reduces the information brought by the data z;
  2. it does not articulate the level for labeling nor the consequences of finding a low p-value;
  3. it ignores the role of the alternative hypothesis.

Furthermore, Spanos’ discussion deals with “the fallacy of rejection” (pp.319-320) in a rather artificial (if common) way, namely by setting a buffer of discrepancy γ around the null hypothesis. While the choice of a maximal degree of precision sounds natural to me (in the sense that a given sample size should not allow for the discrimination between two arbitrary close values of the parameter), the fact that γ is in fine set by the data (so that the p-value is high) is fairly puzzling. If I understand correctly, the change from a p-value to a discrepancy γ is a fine device to make the “distance” from the null better understood, but it has an extremely limited range of application. If I do not understand correctly, the discrepancy γ is fixed by the statistician and then this sounds like an extreme form of prior selection.

There is at least one issue I do not understand in this part, namely the meaning of the severity evaluation probability

P(d(Z) > d(z_0);\,\mu> \mu_1)

as the conditioning on the event seems impossible in a frequentist setting. This leads me to an idle and unrelated questioning as to whether there is a solution to

\sup_d \mathbb{P}_{H_0}(d(Z) \ge d(z_0))

as this would be the ultimate discrepancy. Or whether this does not make any sense… because of the ambiguous role of z0, which needs somehow to be integrated out. (Otherwise, d can be chosen so that the probability is 1.)

“If one renounces the likelihood, the stopping rule, and the coherence principles, marginalizes the use of prior information as largely untrustworthy, and seek procedures with `good’ error probabilistic properties (whatever that means), what is left to render the inference Bayesian, apart from a belief (misguided in my view) that the only way to provide an evidential account of inference is to attach probabilities to hypotheses?”—A. Spanos, p.326, Error and Inference, 2010

The role of conditioning ancillary statistics is emphasized both in the paper and the discussion. This conditioning clearly reduces variability, however there is no reservation about the arbitrariness of such ancillary statistics. And the fact that conditioning any further would lead to conditioning upon the whole data, i.e. to a Bayesian solution. I also noted a curious lack of proper logical reasoning in the argument that, when

f(z|\theta) \propto f(z|s) f(s|\theta),

using the conditional ancillary distribution is enough, since, while “any departure from f(z|s) implies that the overall model is false” (p.322), but not the reverse. Hence, a poor choice of s may fail to detect a departure. (Besides the fact that  fixed-dimension sufficient statistics do not exist outside exponential families.) Similarly, Spanos expands about the case of a minimal sufficient statistic that is independent from a maximal ancillary statistic, but such cases are quite rare and limited to exponential families [in the iid case]. Still in the conditioning category, he also supports Mayo’s argument against the likelihood principle being a consequence of the sufficiency and weak conditionality principles. A point I discussed in a previous post. However, he does not provide further evidence against Birnbaum’s result, arguing rather in favour of a conditional frequentist inference I have nothing to complain about. (I fail to perceive the appeal of the Welch uniform example in terms of the likelihood principle.)

In an overall conclusion, let me repeat and restate that this series of posts about Error and Inference is far from pretending at bringing a Bayesian reply to the philosophical arguments raised in the volume. The primary goal being of “taking some crucial steps towards legitimating the philosophy of frequentist statistics” (p.328), I should not feel overly concerned. It is only when the debate veered towards a comparison with the Bayesian approach [often too often of the "holier than thou" brand] that I felt allowed to put in my twopennies worth… I do hope I may crystallise this set of notes into a more constructed review of the book, if time allows, although I am pessimistic at the chances of getting it published given our current difficulties with the critical review of Murray Aitkin’s  Statistical Inference. However, as a coincidence, we got back last weekend an encouraging reply from Statistics and Risk Modelling, prompting us towards a revision and the prospect of a reply by Murray.

7 Responses to “Error and Inference [end]”

  1. [...] I completely agree with, however, the aggressive style of the book truly put me off! As with Error and Inference, which also addresses a non-Bayesian issue, I could have let the matter go, however I feel the book [...]

  2. I do not understand your question either. Do you mean my difficulty with the formula

    P(d(Z) > d(z_0);\,\mu> \mu_1)

    which is not defined from a frequentist perspective? (I should not have used conditioning there as this is a conditioning only from a Bayesian perspective.) Or the remarks about conditioning upon ancillary statistics and the in-sufficience of the conditioning ancillary statistic distribution to signal departures from the null?

    • For the example given in the paper “Error Statistics”, this integral is exactly the same as the integral used to compute the Bayesian posterior P(mu>mu1: z0) using a uniform prior for mu. Just use a change of variables to transform the Bayesian integral into the one used to compute P(d(Z)>d(z0):mu>mu1).

    • The P(d(Z)>d(z0): mu>mu1) is effectively defined as P(d(Z)>d(z0):mu1) because the value mu1 leads to a maximum “severity”. Of course this dodge only works because of the nice problem she chooses (i.e. exponential family distributions).

      This is actually a big problem for Mayo, but she’s so dogmatically sure of her philosophy that she won’t look at the technical details long enough to see why.

      The problem comes from Cox’s Theorem (made famous by Jaynes). The hypothesis mu>mu1 is a compound hypothesis. So according to Cox’s theorem you should handle the composite hypothesis using the product rule [A&B]=[A:B][B] or you’ll run into absurdities (an “absurdity” is as defined by the state of the theorem itself). Taking [A&B] = max{ [A],[B] } violates this rule and it’s not hard to think of examples where this is clearly wrong. (Mayo simply denies that Cox’s theorem applies apparently unaware that it is a statement of mathematics and not philosophy).

      So the problem for Mayo is that she can’t widely apply the severity concept to non-trivial real problems. If she does, the absurdities of using [A&B] = max{ [A],[B] } will become apparent. Either she or someone else, will then want to patch things up to remove the problems. But once you patch them up, it will bring the whole analysis ever closer to a Bayesian statistics (via the magic of Cox’s theorem).

      It’s already much closer than she thinks, because she dodged this problem in the example above by reducing her calculation to the equivalent of the Bayesian P(mu>mu1: z0). As long as you don’t stray too far, the resulting numbers (unsurprisingly) seem like they remove the flaws of classical p-value type statistics.

    • In reference to Cox’s Theorem note that from Mayo’s Error Statistics paper:

      SEV(H)+SEV(not H)=1

      Thus the sum rule is already satisfied. So Dr. Mayo has to do something like:

      SEV(H,J)=max{SEV(H),SEV(J)}

      Because if she ever identifies a “conditional severity” and writes

      SEV(H,J)=SEV(H:J)SEV(J)

      Then she is for all practical purposes assigning Bayesian style probabilities to hypothesis, which she dogmatically insists cannot be done.

      I get the feeling that she’d have liked to avoided this altogether by never defining severity for a composite hypothesis, but that would make the concept useless. So she picked the very special example of IID normals, and then took the severity of the composite to be the maximum of each severity individually.

      In that case, the answer is the same as the Bayesian P(mu>mu1:z0), so the resulting numbers appear to correct all the problems with p-values. However that doesn’t mean that SEV is the same as a posterior probability. The difference arises because they follow different rules of composition.

      Consider H=”mu>mu1″ and H’ = “mu1+10^(-1000)>mu>mu1″.
      According to her “maximum rule” in the example in which this is drawn from in her “Error Statistics” paper:
      SEV(H)=SEV(H’).

      I respectfully submit that H and H’ have NOT been tested with the same severity.

      • My apologies: I mean “min” not “max”. In the paper “Error Statsitcs” Dr. Mayo uses a “min severity” rule. See for example:

        “How do we calculate
        P(X<.4|\mu\le.2\text{ is true})
        when μ ≤ .2 is a composite claim? We need only to calculate it for the point
        μ=.2 because μ values less than .2 would yield an even higher SEV value."

  3. There is no conditioning on an event in calculating severity or any other error probability. I can’t figure out what in the world you can mean? Of course there’s so much else…..where to start?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 634 other followers