## Deborah Mayo’s talk in Montréal (JSM 2013)

Posted in Books, Statistics, Uncategorized with tags , , , , , , on July 31, 2013 by xi'an

As posted on her blog, Deborah Mayo is giving a lecture at JSM 2013 in Montréal about why Birnbaum’s derivation of the Strong Likelihood Principle (SLP) is wrong. Or, more accurately, why “WCP entails SLP”. It would have been a great opportunity to hear Deborah presenting her case and I am sorry I am missing this opportunity. (Although not sorry to be in the beautiful Dolomites at that time.) Here are the slides:

Deborah’s argument is the same as previously: there is no reason for the inference in the mixed (or Birnbaumized) experiment to be equal to the inference in the conditional experiment. As previously, I do not get it: the weak conditionality principle (WCP) implies that inference from the mixture output, once we know which component is used (hence rejecting the “and we don’t know which” on slide 8), should only be dependent on that component. I also fail to understand why either WCP or the Birnbaum experiment refers to a mixture (sl.13) in that the index of the experiment is assumed to be known, contrary to mixtures. Thus (still referring at slide 13), the presentation of Birnbaum’s experiment is erroneous. It is indeed impossible to force the outcome of y* if tail and of x* if head but it is possible to choose the experiment index at random, 1 versus 2, and then, if y* is observed, to report (E1,x*) as a sufficient statistic. (Incidentally, there is a typo on slide 15, it should be “likewise for x*”.)

## Birnbaum’s proof missing one bar?!

Posted in Statistics with tags , , , , on March 4, 2013 by xi'an

Michael Evans just posted a new paper on arXiv yesterday about Birnbaum’s proof of his likelihood principle theorem. There has recently been a lot of activity around this theorem (some of which reported on the ‘Og!) and the flurry of proofs, disproofs, arguments, counterarguments, and counter-counterarguments, mostly by major figures in the field, is rather overwhelming! This paper  is however highly readable as it sets everything in terms of set theory and relations. While I am not completely convinced that the conclusion holds, the steps in the paper seem correct. The starting point is that the likelihood relation, L, the invariance relation, G, and the sufficiency relation, S, all are equivalence relations (on the set of inference bases/parametric families). The conditionality relation,C, however fails to be transitive and hence an equivalence relation. Furthermore, the smallest equivalence relation containing the conditionality relation is the likelihood relation. Then Evans proves that the conjunction of the sufficiency and the conditionality relations is strictly included in the likelihood relation, which is the smallest equivalence relation containing the union. Furthermore, the fact that the smallest equivalence relation containing the conditionality relation is the likelihood relation means that sufficiency is irrelevant (in this sense, and in this sense only!).

This is a highly interesting and well-written document. I just do not know what to think of it in correspondence with my understanding of the likelihood principle. That

$\overline{S \cup C} = L$

rather than

$S \cup C =L$

makes a difference from a mathematical point of view, however I cannot relate it to the statistical interpretation. Like, why would we have to insist upon equivalence? why does invariance appear in some lemmas? why is a maximal ancillary statistics relevant at this stage when it does not appear in the original proof of Birbaum (1962)? why is there no mention made of weak versus strong conditionality principle?

Posted in Statistics with tags , , , , , , , , , on January 28, 2013 by xi'an

Last Monday, my student Li Chenlu presented the foundational 1962 JASA paper by Allan Birnbaum, On the Foundations of Statistical Inference. The very paper that derives the Likelihood Principle from the cumulated Conditional and Sufficiency principles and that had been discussed [maybe ad nauseam] on this ‘Og!!! Alas, thrice alas!, I was still stuck in the plane flying back from Atlanta as she was presenting her understanding of the paper, as the flight had been delayed four hours thanks to (or rather woe to!) the weather conditions in Paris the day before (chain reaction…):

I am sorry I could not attend this lecture and this for many reasons: first and  foremost, I wanted to attend every talk from my students both out of respect for them and to draw a comparison between their performances. My PhD student Sofia ran the seminar that day in my stead, for which I am quite grateful, but I do do wish I had been there… Second, this a.s. has been the most philosophical paper in the series.and I would have appreciated giving the proper light on the reasons for and the consequences of this paper as Li Chenlu stuck very much on the paper itself. (She provided additional references in the conclusion but they did not seem to impact the slides.)  Discussing for instance Berger’s and Wolpert’s (1988) new lights on the topic, as well as Deborah Mayo‘s (2010) attacks, and even Chang‘s (2012) misunderstandings, would have clearly helped the students.

## That the likelihood principle does not hold…

Posted in Statistics, University life with tags , , , , , , , , , , on October 6, 2011 by xi'an

Coming to Section III in Chapter Seven of Error and Inference, written by Deborah Mayo, I discovered that she considers that the likelihood principle does not hold (at least as a logical consequence of the combination of the sufficiency and of the conditionality principles), thus that  Allan Birnbaum was wrong…. As well as the dozens of people working on the likelihood principle after him! Including Jim Berger and Robert Wolpert [whose book sells for \$214 on amazon!, I hope the authors get a hefty chunk of that ripper!!! Esp. when it is available for free on project Euclid…] I had not heard of  (nor seen) this argument previously, even though it has apparently created enough of a bit of a stir around the likelihood principle page on Wikipedia. It does not seem the result is published anywhere but in the book, and I doubt it would get past a review process in a statistics journal. [Judging from a serious conversation in Zürich this morning, I may however be wrong!]

The core of Birnbaum’s proof is relatively simple: given two experiments and about the same parameter θ with different sampling distributions and , such that there exists a pair of outcomes (y¹,y²) from those experiments with proportional likelihoods, i.e. as a function of θ

$f^1(y^1|\theta) = c f^2(y^2|\theta),$

one considers the mixture experiment E⁰ where  and are each chosen with probability ½. Then it is possible to build a sufficient statistic T that is equal to the data (j,x), except when j=2 and x=y², in which case T(j,x)=(1,y¹). This statistic is sufficient since the distribution of (j,x) given T(j,x) is either a Dirac mass or a distribution on {(1,y¹),(2,y²)} that only depends on c. Thus it does not depend on the parameter θ. According to the weak conditionality principle, statistical evidence, meaning the whole range of inferences possible on θ and being denoted by Ev(E,z), should satisfy

$Ev(E^0, (j,x)) = Ev(E^j,x)$

Because the sufficiency principle states that

$Ev(E^0, (j,x)) = Ev(E^0,T(j,x))$

this leads to the likelihood principle

$Ev(E^1,y^1)=Ev(E^0, (j,y^j)) = Ev(E^2,y^2)$

(See, e.g., The Bayesian Choice, pp. 18-29.) Now, Mayo argues this is wrong because

“The inference from the outcome (Ej,yj) computed using the sampling distribution of [the mixed experiment] E⁰ is appropriately identified with an inference from outcome yj based on the sampling distribution of Ej, which is clearly false.” (p.310)

This sounds to me like a direct rejection of the conditionality principle, so I do not understand the point. (A formal rendering in Section 5 using the logic formalism of A’s and Not-A’s reinforces my feeling that the conditionality principle is the one criticised and misunderstood.) If Mayo’s frequentist stance leads her to take the sampling distribution into account at all times, this is fine within her framework. But I do not see how this argument contributes to invalidate Birnbaum’s proof. The following and last sentence of the argument may bring some light on the reason why Mayo considers it does:

“The sampling distribution to arrive at Ev(E⁰,(j,yj)) would be the convex combination averaged over the two ways that yj could have occurred. This differs from the  sampling distributions of both Ev(E1,y1) and Ev(E2,y2).” (p.310)

Indeed, and rather obviously, the sampling distribution of the evidence Ev(E*,z*) will differ depending on the experiment. But this is not what is stated by the likelihood principle, which is that the inference itself should be the same for and . Not the distribution of this inference. This confusion between inference and its assessment is reproduced in the “Explicit Counterexample” section, where p-values are computed and found to differ for various conditional versions of a mixed experiment. Again, not a reason for invalidating the likelihood principle. So, in the end, I remain fully unconvinced by this demonstration that Birnbaum was wrong. (If in a bystander’s agreement with the fact that frequentist inference can be built conditional on ancillary statistics.)

## Error and Inference [#5]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on September 28, 2011 by xi'an

(This is the fifth post on Error and Inference, as previously being a raw and naïve reaction following a linear and slow reading of the book, rather than a deeper and more informed criticism.)

‘Frequentist methods achieve an objective connection to hypotheses about the data-generating process by being constrained and calibrated by the method’s error probabilities in relation to these models .”—D. Cox and D. Mayo, p.277, Error and Inference, 2010

The second part of the seventh chapter of Error and Inference, is David Cox’s and Deborah Mayo’s “Objectivity and conditionality in frequentist inference“. (Part of the section is available on Google books.) The purpose is clear and the chapter quite readable from a statistician’s perspective. I however find it difficult to quantify objectivity by first conditioning on “a statistical model postulated to have generated data”, as again this assumes the existence of a “true” probability model where “probabilities (…) are equal or close to  the actual relative frequencies”. As earlier stressed by Andrew:

“I don’t think it’s helpful to speak of “objective priors.” As a scientist, I try to be objective as much as possible, but I think the objectivity comes in the principle, not the prior itself. A prior distribution–any statistical model–reflects information, and the appropriate objective procedure will depend on what information you have.”

The paper opposes the likelihood, Bayesian, and frequentist methods, reproducing what Gigerenzer called the “superego, the ego, and the id” in his paper on statistical significance. Cox and Mayo stress from the start that the frequentist approach is (more) objective because it is based on the sampling distribution of the test. My primary problem with this thesis is that the “hypothetical long run” (p.282) does not hold in realistic settings. Even in the event of a reproduction of similar or identical tests, a sequential procedure exploiting everything that has been observed so far is more efficient than the mere replication of the same procedure solely based on the current observation.

Virtually all (…) models are to some extent provisional, which is precisely what is expected in the building up of knowledge.”—D. Cox and D. Mayo, p.283, Error and Inference, 2010

The above quote is something I completely agree with, being another phrasing of George Box’s “all models are wrong”, but this transience of working models is a good reason in my opinion to account for the possibility of alternative working models from the start of the statistical analysis. Hence for an inclusion of those models in the statistical analysis equally from the start. Which leads almost inevitably to a Bayesian formulation of the testing problem.

‘Perhaps the confusion [over the role of sufficient statistics] stems in part because the various inference schools accept the broad, but not the detailed, implications of sufficiency.”—D. Cox and D. Mayo, p.286, Error and Inference, 2010

The discussion over the sufficiency principle is interesting, as always. The authors propose to solve the confusion between the sufficiency principle and the frequentist approach by assuming that inference “is relative to the particular experiment, the type of inference, and the overall statistical approach” (p.287). This creates a barrier between sampling distributions that avoids the binomial versus negative binomial paradox always stressed in the Bayesian literature. But the solution is somehow tautological: by conditioning on the sampling distribution, it avoids the difficulties linked with several sampling distributions all producing the same likelihood. After my recent work on ABC model choice, I am however less excited about the sufficiency principle as the existence of [non-trivial] sufficient statistics is quite the rare event. Especially across models. The section (pp. 288-289) is also revealing about the above “objectivity” of the frequentist approach in that the derivation of a test taking large value away from the null with a well-known distribution under the null is not an automated process, esp. when nuisance parameters cannot be escaped from (pp. 291-294). Achieving separation from nuisance parameters, i.e. finding statistics that can be conditioned upon to eliminate those nuisance parameters, does not seem feasible outside well-formalised models related with exponential families. Even in such formalised models, a (clear?) element of arbitrariness is involved in the construction of the separations, which implies that the objectivity is under clear threat. The chapter recognises this limitation in Section 9.2 (pp.293-294), however it argues that separation is much more common in the asymptotic sense and opposes the approach to the Bayesian averaging over the nuisance parameters, which “may be vitiated by faulty priors” (p.294). I am not convinced by the argument, given that the (approximate) condition approach amount to replace the unknown nuisance parameter by an estimator, without accounting for the variability of this estimator. Averaging brings the right (in a consistency sense) penalty.

A compelling section is the one about the weak conditionality principle (pp. 294-298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment  “are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Example 1.3.7, p.18) that the classical confidence interval averages over the experiments… Mea culpa! The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this. I could however argue about “conditioning is warranted to achieve objective frequentist goals” (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”. Nonetheless, an approach based on the mixture remains frequentist if non-optimal… (The chapter later attacks the derivation of the likelihood principle, I will come back to it in a later post.)

‘Many seem to regard reference Bayesian theory to be a resting point until satisfactory subjective or informative priors are available. It is hard to see how this gives strong support to the reference prior research program.”—D. Cox and D. Mayo, p.302, Error and Inference, 2010

A section also worth commenting is (unsurprisingly!) the one addressing the limitations of the Bayesian alternatives (pp. 298–302). It however dismisses right away the personalistic approach to priors by (predictably if hastily) considering it fails the objectivity canons. This seems a wee quick to me, as the choice of a prior is (a) the choice of a reference probability measure against which to assess the information brought by the data, not clearly less objective than picking one frequentist estimator or another, and (b) a personal construction of the prior can also be defended on objective grounds, based on the past experience of the modeler. That it varies from one modeler to the next is not an indication of subjectivity per se, simply of different past experiences. Cox and Mayo then focus on reference priors, à la Bernardo-Berger, once again pointing out the lack of uniqueness of those priors as a major flaw. While the sub-chapter agrees on the understanding of those priors as convention or reference priors, aiming at maximising the input from the data, it gets stuck on the impropriety of such priors: “if priors are not probabilities, what then is the interpretation of a posterior?” (p.299). This seems like a strange comment to me:  the interpretation of a posterior is that it is a probability distribution and this is the only mathematical constraint one has to impose on a prior. (Which may be a problem in the derivation of reference priors.) As detailed in The Bayesian Choice among other books, there are many compelling reasons to invite improper priors into the game. (And one not to, namely the difficulty with point null hypotheses.) While I agree that the fact that some reference priors (like matching priors, whose discussion p. 302 escapes me) have good frequentist properties is not compelling within a Bayesian framework, it seems a good enough answer to the more general criticism about the lack of objectivity: in that sense, frequency-validated reference priors are part of the huge package of frequentist procedures and cannot be dismissed on the basis of being Bayesian. That reference priors are possibly at odd with the likelihood principle does not matter very much:  the shape of the sampling distribution is part of the prior information, not of the likelihood per se. The final argument (Section 12) that Bayesian model choice requires the preliminary derivation of “the possible departures that might arise” (p.302) has been made at several points in Error and Inference. Besides being in my opinion a valid working principle, i.e. selecting the most appropriate albeit false model, this definition of well-defined alternatives is mimicked by the assumption of “statistics whose distribution does not depend on the model assumption” (p. 302) found in the same last paragraph.

In conclusion this (sub-)chapter by David Cox and Deborah Mayo is (as could be expected!) a deep and thorough treatment of the frequentist approach to the sufficiency and (weak) conditionality principle. It however fails to convince me that there exists a “unique and unambiguous” frequentist approach to all but the most simple problems. At least, from reading this chapter, I cannot find a working principle that would lead me to this single unambiguous frequentist procedure.