That the likelihood principle does not hold…

Coming to Section III in Chapter Seven of Error and Inference, written by Deborah Mayo, I discovered that she considers that the likelihood principle does not hold (at least as a logical consequence of the combination of the sufficiency and of the conditionality principles), thus that  Allan Birnbaum was wrong…. As well as the dozens of people working on the likelihood principle after him! Including Jim Berger and Robert Wolpert [whose book sells for \$214 on amazon!, I hope the authors get a hefty chunk of that ripper!!! Esp. when it is available for free on project Euclid…] I had not heard of  (nor seen) this argument previously, even though it has apparently created enough of a bit of a stir around the likelihood principle page on Wikipedia. It does not seem the result is published anywhere but in the book, and I doubt it would get past a review process in a statistics journal. [Judging from a serious conversation in Zürich this morning, I may however be wrong!]

The core of Birnbaum’s proof is relatively simple: given two experiments and about the same parameter θ with different sampling distributions and , such that there exists a pair of outcomes (y¹,y²) from those experiments with proportional likelihoods, i.e. as a function of θ

$f^1(y^1|\theta) = c f^2(y^2|\theta),$

one considers the mixture experiment E⁰ where  and are each chosen with probability ½. Then it is possible to build a sufficient statistic T that is equal to the data (j,x), except when j=2 and x=y², in which case T(j,x)=(1,y¹). This statistic is sufficient since the distribution of (j,x) given T(j,x) is either a Dirac mass or a distribution on {(1,y¹),(2,y²)} that only depends on c. Thus it does not depend on the parameter θ. According to the weak conditionality principle, statistical evidence, meaning the whole range of inferences possible on θ and being denoted by Ev(E,z), should satisfy

$Ev(E^0, (j,x)) = Ev(E^j,x)$

Because the sufficiency principle states that

$Ev(E^0, (j,x)) = Ev(E^0,T(j,x))$

this leads to the likelihood principle

$Ev(E^1,y^1)=Ev(E^0, (j,y^j)) = Ev(E^2,y^2)$

(See, e.g., The Bayesian Choice, pp. 18-29.) Now, Mayo argues this is wrong because

“The inference from the outcome (Ej,yj) computed using the sampling distribution of [the mixed experiment] E⁰ is appropriately identified with an inference from outcome yj based on the sampling distribution of Ej, which is clearly false.” (p.310)

This sounds to me like a direct rejection of the conditionality principle, so I do not understand the point. (A formal rendering in Section 5 using the logic formalism of A’s and Not-A’s reinforces my feeling that the conditionality principle is the one criticised and misunderstood.) If Mayo’s frequentist stance leads her to take the sampling distribution into account at all times, this is fine within her framework. But I do not see how this argument contributes to invalidate Birnbaum’s proof. The following and last sentence of the argument may bring some light on the reason why Mayo considers it does:

“The sampling distribution to arrive at Ev(E⁰,(j,yj)) would be the convex combination averaged over the two ways that yj could have occurred. This differs from the  sampling distributions of both Ev(E1,y1) and Ev(E2,y2).” (p.310)

Indeed, and rather obviously, the sampling distribution of the evidence Ev(E*,z*) will differ depending on the experiment. But this is not what is stated by the likelihood principle, which is that the inference itself should be the same for and . Not the distribution of this inference. This confusion between inference and its assessment is reproduced in the “Explicit Counterexample” section, where p-values are computed and found to differ for various conditional versions of a mixed experiment. Again, not a reason for invalidating the likelihood principle. So, in the end, I remain fully unconvinced by this demonstration that Birnbaum was wrong. (If in a bystander’s agreement with the fact that frequentist inference can be built conditional on ancillary statistics.)

29 Responses to “That the likelihood principle does not hold…”

1. […] has started a series of informal seminars at the LSE on the philosophy of errors in statistics and the likelihood principle. and has also posted a long comment on my argument about only using wrong models. (The title is […]

2. I am finally getting time to check up on blog discussions that I was unable to even read, let alone comment on, in the past.
I need to correct a serious misimpression, unless it is just a very different use of language. Roberts wrote:
xi’an Says:
October 16, 2011 at 5:38 pm

Just an addendum on “the distribution of the inference”: as read today in Error and Inference (page 310), Deborah Mayo uses the sentence “This differs from the sampling distributions of both InfrE’ (y‘*) and InfrE” (y”*). Which seems to indicate she also considers inference has a distribution.

The sampling distribution is the distribution of the (test) statistic. A frequentist has no inference without consideration of the sampling distribution of a (relevant) statistic. That is because without it there are no error probabilities. The sampling distributions I am referring to are the ones associated with the inferences that would be the outputs of these two observed outcomes from the two distinct experiments, E’ and E”, respectively.

3. I am finally getting time to check up on blog discussions that I was unable to even read, let alone comment on, in the past.
I need to correct a serious misimpression, unless it is just a very different use of language. Roberts wrote:
xi’an Says:
October 16, 2011 at 5:38 pm

Just an addendum on “the distribution of the inference”: as read today in Error and Inference (page 310), Deborah Mayo uses the sentence “This differs from the sampling distributions of both InfrE’ (y‘*) and InfrE” (y”*). Which seems to indicate she also considers inference has a distribution.

The sampling distribution is the distribution of the (test) statistic. A frequentist has no inference without consideration of the sampling distribution of a (relevant) statistic. That is because without it there are no error probabilities. The sampling distributions I am referring to are the ones associated with the inferences that would be the outputs of these two observed outcomes from the two distinct experiments, E’ and E”, respectively.

4. […] reminds me of earlier results, with the related drawback that this optimality is incompatible with the sufficiency principle.) […]

5. Christian Hennig Says:

Could it help to define properly what kind of thing Ev is? The equation proving the LP in Xi’An’s initial posting seems to me to assume that Ev is a well-defined mathematical entity, but I’m not sure whether it is.
As far as I can see, Birnbaum doesn’t define it in his original 1962 paper, but rather states that its precise meaning has to be clarified (if I haven’t missed anything, later he gives examples but doesn’t define it generally and formally).
Mayo’s point seems to be that due to the imprecise nature of Ev, it apparently happened that it doesn’t have precisely the same meaning in all involved equations, in which case Birnbaum’s arguments and the equations given here don’t make a valid mathematical proof.
The problem is nicely illustrated by the discussion between Spanos and Xi’An, which to me indicates that they don’t agree about what precisely Ev is. This is not surprising because it hasn’t been properly defined.
So if you want agreement, what about defining it unambiguously? Deciding the issue in Birnbaum’s favour from my point of view would imply that all the steps of the proof could be spelled out replacing Ev by its precise definition as a mathematical object. Of course Ev can be defined in such a way that the equations hold (ignoring interpretational issues it could for example be a constant), but I wonder whether it can be done in such a way that the argument grants the interpretational meaning that defenders of the LP usually assume (which Mayo denies).

• Thanks Christian: this is a very good point and an issue that has always bothered me. Ev appears like a black box, hence can be defined in diverse ways and even perverted to have the LP hold no matter what or not hold no matter what. For instance, if one includes all estimators AND their frequentist distribution, no hope for the LP to hold. Of course, from a Bayesian perspective, representing the evidence Ev by the equivalent posterior distribution is unambiguous but outside the Bayesian realm it seems harder to agree on an unambiguous modus vivendi.

• Aris Spanos Says:

That is exactly why in my first comment I wrote:
“Hence, the SLP assertion that the mixed experiment provides the same evidence about the unknown parameter as the model that actually gave rise to the data, is clearly false when evidence about the unknown parameter is evaluated in terms of the relevant error probabilities.”

• @Christian Hennig: I got hung up on this point too. I tried two approaches to attack the issue. First I imagined writing R functions to serve as tentative Ev functions. Right away I got hung up on R’s weak typing. When I considered a strongly typed language, or equivalently, vigorous type-checking in the R function, or yet still equivalently, the domain of Ev, I noticed that the equation for the weak conditionality principle uses (j, x) as the second argument on one side and x on the other. My Haskell-fu is not up to figuring out whether the info about the type of the second argument can legitimately be packed into the first argument of Ev.

My second approach was to think about what a generic p-value function would require. In addition to a description of the experiment (including the sampling distribution under the null hypothesis) and the observed outcome, it needs a statistic mapping outcomes to a set with a total ordering corresponding to some notion of “more extreme” relative to the null. So it takes three arguments, not two. Since Ev as given in Birmbaum’s proof doesn’t consider evidence functions that have a third argument that could take different values depending on whether (j, x) is observed or T(j, x) is observed, it doesn’t seem to rule out such functions.

@Spanos: In your second comment, it was you who mentioned mean squared error first, not Xian. You also don’t appear to be using the term in the same sense as statisticians use it. Given that you and Xian don’t seem to be construing the jargon the same way, perhaps some discussion of explicit definitions is in order before you begin leveling accusations of intellectually dishonest argumentation.

• Birnbaum is clear that Ev can be ANY inference or decision from data. A p-value is an example he gives. That is why his “proof” flounders. Fixed and optional stopping, just for one example, yield different p-values, and no one denies this. More generally, no one denies frequentist conflicts of the (strong) LP. The Birnbaum argument is based on starting from a LP violation. That is all one needs. The invalidity or unsoundness then follows along the lines I argue. I will soon be returning to “blogging the LP” on my errorstatistics blog.

6. The point is that the “proof” first requires you imagine being in a Birnbaum experiment that ” erases” which experiment the data came from and then the second step stipulates, at the same time, that you are NOT to evaluate evidence this way. So the premises are contradictory. Not to mention one can imagine lots of different Birnbaum mixtures that could have given rise to the data, shall we average over all of them? Why would we want to obliterate evaluating data from the experiment known to produce it in order to average with a variety of experiments not performed? That is what step 1 of the argument requires. But I allow this. Only then the second step contradicts it and says condition on the experiment performed! Erase and do not erase! The reasoning is fallacious, by burying what is going on, it has been missed. It could go through ONLY if you ignore the sampling distribution in the Birnbaum mixture. But that is to ASSUME the strong LP!

7. Aris Spanos Says:

I just saw the above exchange and some of xi’an’s comments mystified me. In particular, the claim:
“But this is not what is stated by the likelihood principle, which is that the inference itself should be the same for y¹ and y². Not the distribution of this inference.”

To begin with there is no such thing as “the distribution of the inference”, but I assume he meant to say the relevant sampling distribution underlying the inference. As is well-known, there is no frequentist inference without sampling distributions, and the two are inextricably bound up; there is no inference without a sampling distribution.
In particular, confidence intervals and frequentist tests cannot even be defined without error probabilities that stem from the relevant sampling distributions. Moreover, different error probabilities give rise to different inferences. In particular, different type I and II error probabilities, as well as different p-values are likely to give rise to totally different inferences!
Hence, the SLP assertion that the mixed experiment provides the same evidence about the unknown parameter in question as the model that actually gave rise to the data, is clearly false when evidence about the unknown parameter is evaluated in terms of the relevant error probabilities. That is, the warranted inferences based on y¹ and y² cannot be the same when the relevant error probabilities are different.

• Thanks for the comment. Unless we put different meanings to “inference”, I maintain there is a frequentist distribution of the inference: when your estimator of the mean is $(1-(p-2)/||x||^2)x$ say [to take the example of the James-Stein estimator], it is a random variable with a distribution. (Which is why we distinguish estimates from estimators in our courses and textbooks.) Now, this distribution is a projection of the sampling distribution and therefore depends on this distribution. That it does not agree with the SLP is [definitely] not a proof that the “SLP assertion (…) is clearly false”, just a signal of disagreement between the SLP and frequentist constructs: Nothing wrong with that, just two different approaches to statistical inference…

• Aris Spanos Says:

Invoking the James-Stein estimator and the Mean Square Error (MSE) criterion seems like a desperate attempt to cling to your undefedable position by confusing the issue! In my comment I mentioned inference-related sampling distributions and the relevant error probabilities, and neither element appears in your reply. Indeed, you ignored the traditional example used in these discussions by Berger and Wolpert (1988), Mayo (2010), etc, and instead used the James-Stein estimator, knowing that it is a bad example to illustrate the SLP anyway.
Leaving that aside, I hope you are not suggesting that the inference in frequentist estimation is that the unknown parameter is equal to the estimate. Are you? This is not an inference a frequentist would make, without qualifying the uncertainty pertaining to such an inference using a proper error probability. Moreover, the MSE, as used in the James-Stein estimator case, is a decision-theoretic/Bayesian criterion and does not evaluate anything interesting from the frequentist perspective; it’s neither a proper error probability nor an error a frequentist would care about. In estimation a frequentist does not care what happens to the MSE for all possible values of the unknown parameter [the decision-theoretic quantifier]. What is relevant from the frequentist perspective pertains to any errors relative to the true value of the unknown parameter, whatever that happens to be.

• Please keep the debate at a rational and scientific level. Otherwise, there is no point in debating. First and last warning.

To follow from my previous reply, I used the James-Stein estimator as an arbitrary example of an estimator. The MLE and the p-value would have been other examples. They all have (frequentist) distributions as transforms of the data. Now, the James-Stein estimator or the p-value have long-term error performances. This is what I call frequentist statistics. To state that the mean square error is a Bayesian criterion contradicts the definition of the MSE, namely the average of the error over the observation space. This is the average error for a infinite repetition of observing the data based on the same parameter. This is exactly the same with the type I error, namely the average frequency of false rejection. If you refuse decision-theory and the use of a loss function to compare statistical procedures, as it sounds from the above comment, how do you pick your estimation and testing procedures from the infinity of such procedures?

• Aris Spanos Says:

Indeed, confusing the difference between an error probability of a test statistics evaluated under the null [a prespecified value of the unknown parameter], vs. the mean of a square loss function associated with an estimator and evaluated over all possible values of the unknown parameter, I think calls for an and to the discussion. Thanks for the exchange!

• I do not think we had even started a discussion… Anyway, your final barb simply shows you are missing my argument that p-values should be evaluated for all possible values of the unknown parameter rather than under the null. In Hwang et al. (Annals of Statistics, 1992), we ran this frequentist evaluation under squared error loss and showed that p-values are inadmissible for two-sided hypotheses.

• Just an addendum on “the distribution of the inference”: as read today in Error and Inference (page 310), Deborah Mayo uses the sentence “This differs from the sampling distributions of both InfrE’ (y‘*) and InfrE” (y”*). Which seems to indicate she also considers inference has a distribution.

8. Please do not automatically accept what the “higher authorities” say; we all know this has been touted as a “breakthrough” for 50 years, no surprise to find endorsements. Frequentist theory, we know violates the (strong) LP. Right? You have observed an outcome x from experiment E’ that could be used in an LP violation (i.e the antecedent of the strong LP is assumed: x has a proportional likelihood to y from another experiment, E”, not performed; but because of the difference in sampling distributions, the evidential appraisal of x differs from that of y—for a frequentist. For example, the p-value associated with x from E’ might be p’, and the p-value associated with outcome y from E” might be p”. p’ is unequal to p”. That is the given, for an LP violation, no one disagrees on that.

Call y an “LP pair” for x (it is really an “LP violation” pair).

Then Birnbaum suggests you consider x could have come from flipping a fair coin to decide whether to do the experiment you actually performed E’ or another one, E”, that could have produced the “LP pair” y. (You don’t know what this could be of course until after you observe x, but let us grant all this.) Further, you are to agree that if you had conducted this imaginary mixture (I call it a Birnbaum experiment E-BB) that you will report x whether or not x occurred or y occurred. Outcome x from E-BB is evidentially equivalent to outcome y from E-BB. That is part of the given definition of E-BB, and I grant all that—weird as it is.
We’re playing the E-BB game.
How is the evidence from x to be reported? It would be to average p’ and p”. The evidential assessment of x is (p’ + p”)/2.
All of that is “step 1” of Birnbaum’s argument.

But if you are to evaluate outcome x which came from experiment E’ as if it actually came from the funny mixture defined in E-BB, then you cannot AT THE SAME TIME say that you should not evaluate the evidential import of x as if it came from E-BB. And yet, that is precisely what is demanded in “step 2” of the argument.

So the argument , reconstructed as deductively valid, is unsound (it’s premises contradict eachother). Alternatively, one can formulate it so that its premises are true, but it is no longer valid.
For the details, see E & I, I realize I’m writing this very quickly (that is because I promised myself I would not respond, and here I am responding).

By the way, Birnbaum rejected the strong LP. So maybe he was just showing that for those who accept the LP, the LP follows.

• Thank you very much for rewriting your arguments for the Og readers. I hope it generates more discussion.

At this stage, I fear we could be camping on our own positions for quite a while. So I will try to find time and energy to write a small note explaining more carefully why I [respectfully] disagree.

As to “automatically accept what the “higher authorities” say”, pardon my French for being imprecise in the previous reply. I simply meant that I cross-checked with others that my understanding was correct and that I was not missing an issue in my post. Sorry for the use of “higher authorities” whose lost irony turned into a poor argument of authority…

9. I don’t know why I find it so blinding to read this blog—I mean it’s beautiful, but I think it is the black. Or maybe it’s that you go too fast over what I’ve argued in detail, and dismiss too quickly some of my points. Please note, as discussed in the Cox and Mayo article in E & I, that there’s a big difference between a mathematical identity, and a claim that a given result OUGHT to be construed as evidentially equal to such and such. I don’t reject WCP, as you suppose. I only reject saying BOTH that you ought to condition AND you ought not to, for purposes of interpreting a given result. The premises are contradictory. Anyway, there’s a little dialogue (an imaginary one between me and Birnbaum I guess) coming out in the RMM volume any day now that rehearses the flawed argument. If it is still unclear after that, I will say more.

• Because of the small characters on my screen, I first read “binding” instead of “blinding”! Not the same at all, of course, and a wee on the harsh side… Well, the hit-and-run deeper nature of blog posts means that ‘going too fast’ is inherent to them.

In the case of the current post, I actually checked with higher authorities [on the LP] that I was not missing the point. (And I did think carefully about it.) From both of your comments, it seems to me the dissension is at the linguistic level rather than mathematical or even logical levels, in that evidence is such a vague word that the different actors in the LP=SP+WCP debate put different meanings to it. The discussion I had in Zürich on Tuesday was illuminating in this respect.

10. If you think more carefully about the argument, you will see I am correct! It is the focus of a much more detailed presentation, but even the short one in E & I should suffice. An even shorter one is in the upcoming RMM volume as an appendix to my conversation with D.R. Cox.. It should be up any day. Bayesians forget that sufficiency, for a frequentist, still requires computing the evidential appraisal relative to the sampling distribution! The series of equivalences fails. (It doesn’t matter which version of the argument you use, either.) And even without alluding to the details here—why spoil the fun for your readers?–the key that the “proof” is actually circular should have been obvious from the fact that the biconditional is shown to hold (between the SLP and (WCP and S)).

This site uses Akismet to reduce spam. Learn how your comment data is processed.