Archive for Neyman-Pearson

reading classics (#9)

Posted in Books, Statistics, University life with tags , , , , , , , , on February 24, 2013 by xi'an

In today’s classics seminar, my student Bassoum Abou presented the 1981 paper written by Charles Stein for the Annals of Statistics, Estimating the mean of a normal distribution, recapitulating the advances he made on Stein estimators, minimaxity and his unbiased estimator of risk. Unfortunately; this student missed a lot about paper and did not introduce the necessary background…So I am unsure at how much the class got from this great paper… Here are his slides (watch out for typos!)

 Historically, this paper is important as this is one of the very few papers published by Charles Stein in a major statistics journal, the other publications being made in conference proceedings. It contains the derivation of the unbiased estimator of the loss, along with comparisons with posterior expected loss.

reading classics (#8)

Posted in Books, Statistics, University life with tags , , , , , , , , on February 1, 2013 by xi'an

In today’s classics seminar, my student Dong Wei presented the historical paper by Neyman and Pearson on efficient  tests: “On the problem of the most efficient tests of statistical hypotheses”, published in the Philosophical Transactions of the Royal Society, Series A. She had a very hard time with the paper… It is not an easy paper, to be sure, and it gets into convoluted and murky waters when it comes to the case of composite hypotheses testing. Once again, it would have been nice to broaden the view on testing by including some of the references given in Dong Wei’s slides:

Listening to this talk, while having neglected to read the original paper for many years (!), I was reflecting on the way tests, Type I & II, and critical regions were introduced, without leaving any space for a critical (!!) analysis of the pertinence of those concepts. This is an interesting paper also because it shows the limitations of such a notion of efficiency. Apart from the simplest cases, it is indeed close to impossible to achieve this efficiency because there is no most powerful procedure (without restricting the range of those procedures). I also noticed from the slides that Neyman and Pearson did not seem to use a Lagrange multiplier to achieve the optimal critical region. (Dong Wei also inverted the comparison of the sufficient and insufficient statistics for the test on the variance, as the one based on the sufficient statistic is more powerful.) In any case, I think I will not keep the paper in my list for next year, maybe replacing it with the Karlin-Rubin (1956) UMP paper…

Error and Inference [arXived]

Posted in Books, Statistics, University life with tags , , , , , , , on November 29, 2011 by xi'an

Following my never-ending series of posts on the book Error and Inference, (edited) by Deborah Mayo and Ari Spanos (and kindly sent to me by Deborah), I decided to edit those posts into a (slightly) more coherent document, now posted on arXiv. And to submit it as a book review to Siam Review, even though I had not high expectations it fits the purpose of the journal: the review was rejected between the submission to arXiv and the publication of this post!

Error and Inference [#5]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on September 28, 2011 by xi'an

(This is the fifth post on Error and Inference, as previously being a raw and naïve reaction following a linear and slow reading of the book, rather than a deeper and more informed criticism.)

‘Frequentist methods achieve an objective connection to hypotheses about the data-generating process by being constrained and calibrated by the method’s error probabilities in relation to these models .”—D. Cox and D. Mayo, p.277, Error and Inference, 2010

The second part of the seventh chapter of Error and Inference, is David Cox’s and Deborah Mayo’s “Objectivity and conditionality in frequentist inference“. (Part of the section is available on Google books.) The purpose is clear and the chapter quite readable from a statistician’s perspective. I however find it difficult to quantify objectivity by first conditioning on “a statistical model postulated to have generated data”, as again this assumes the existence of a “true” probability model where “probabilities (…) are equal or close to  the actual relative frequencies”. As earlier stressed by Andrew:

“I don’t think it’s helpful to speak of “objective priors.” As a scientist, I try to be objective as much as possible, but I think the objectivity comes in the principle, not the prior itself. A prior distribution–any statistical model–reflects information, and the appropriate objective procedure will depend on what information you have.”

The paper opposes the likelihood, Bayesian, and frequentist methods, reproducing what Gigerenzer called the “superego, the ego, and the id” in his paper on statistical significance. Cox and Mayo stress from the start that the frequentist approach is (more) objective because it is based on the sampling distribution of the test. My primary problem with this thesis is that the “hypothetical long run” (p.282) does not hold in realistic settings. Even in the event of a reproduction of similar or identical tests, a sequential procedure exploiting everything that has been observed so far is more efficient than the mere replication of the same procedure solely based on the current observation.

Virtually all (…) models are to some extent provisional, which is precisely what is expected in the building up of knowledge.”—D. Cox and D. Mayo, p.283, Error and Inference, 2010

The above quote is something I completely agree with, being another phrasing of George Box’s “all models are wrong”, but this transience of working models is a good reason in my opinion to account for the possibility of alternative working models from the start of the statistical analysis. Hence for an inclusion of those models in the statistical analysis equally from the start. Which leads almost inevitably to a Bayesian formulation of the testing problem.

‘Perhaps the confusion [over the role of sufficient statistics] stems in part because the various inference schools accept the broad, but not the detailed, implications of sufficiency.”—D. Cox and D. Mayo, p.286, Error and Inference, 2010

The discussion over the sufficiency principle is interesting, as always. The authors propose to solve the confusion between the sufficiency principle and the frequentist approach by assuming that inference “is relative to the particular experiment, the type of inference, and the overall statistical approach” (p.287). This creates a barrier between sampling distributions that avoids the binomial versus negative binomial paradox always stressed in the Bayesian literature. But the solution is somehow tautological: by conditioning on the sampling distribution, it avoids the difficulties linked with several sampling distributions all producing the same likelihood. After my recent work on ABC model choice, I am however less excited about the sufficiency principle as the existence of [non-trivial] sufficient statistics is quite the rare event. Especially across models. The section (pp. 288-289) is also revealing about the above “objectivity” of the frequentist approach in that the derivation of a test taking large value away from the null with a well-known distribution under the null is not an automated process, esp. when nuisance parameters cannot be escaped from (pp. 291-294). Achieving separation from nuisance parameters, i.e. finding statistics that can be conditioned upon to eliminate those nuisance parameters, does not seem feasible outside well-formalised models related with exponential families. Even in such formalised models, a (clear?) element of arbitrariness is involved in the construction of the separations, which implies that the objectivity is under clear threat. The chapter recognises this limitation in Section 9.2 (pp.293-294), however it argues that separation is much more common in the asymptotic sense and opposes the approach to the Bayesian averaging over the nuisance parameters, which “may be vitiated by faulty priors” (p.294). I am not convinced by the argument, given that the (approximate) condition approach amount to replace the unknown nuisance parameter by an estimator, without accounting for the variability of this estimator. Averaging brings the right (in a consistency sense) penalty.

A compelling section is the one about the weak conditionality principle (pp. 294-298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment  “are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Example 1.3.7, p.18) that the classical confidence interval averages over the experiments… Mea culpa! The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this. I could however argue about “conditioning is warranted to achieve objective frequentist goals” (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”. Nonetheless, an approach based on the mixture remains frequentist if non-optimal… (The chapter later attacks the derivation of the likelihood principle, I will come back to it in a later post.)

‘Many seem to regard reference Bayesian theory to be a resting point until satisfactory subjective or informative priors are available. It is hard to see how this gives strong support to the reference prior research program.”—D. Cox and D. Mayo, p.302, Error and Inference, 2010

A section also worth commenting is (unsurprisingly!) the one addressing the limitations of the Bayesian alternatives (pp. 298–302). It however dismisses right away the personalistic approach to priors by (predictably if hastily) considering it fails the objectivity canons. This seems a wee quick to me, as the choice of a prior is (a) the choice of a reference probability measure against which to assess the information brought by the data, not clearly less objective than picking one frequentist estimator or another, and (b) a personal construction of the prior can also be defended on objective grounds, based on the past experience of the modeler. That it varies from one modeler to the next is not an indication of subjectivity per se, simply of different past experiences. Cox and Mayo then focus on reference priors, à la Bernardo-Berger, once again pointing out the lack of uniqueness of those priors as a major flaw. While the sub-chapter agrees on the understanding of those priors as convention or reference priors, aiming at maximising the input from the data, it gets stuck on the impropriety of such priors: “if priors are not probabilities, what then is the interpretation of a posterior?” (p.299). This seems like a strange comment to me:  the interpretation of a posterior is that it is a probability distribution and this is the only mathematical constraint one has to impose on a prior. (Which may be a problem in the derivation of reference priors.) As detailed in The Bayesian Choice among other books, there are many compelling reasons to invite improper priors into the game. (And one not to, namely the difficulty with point null hypotheses.) While I agree that the fact that some reference priors (like matching priors, whose discussion p. 302 escapes me) have good frequentist properties is not compelling within a Bayesian framework, it seems a good enough answer to the more general criticism about the lack of objectivity: in that sense, frequency-validated reference priors are part of the huge package of frequentist procedures and cannot be dismissed on the basis of being Bayesian. That reference priors are possibly at odd with the likelihood principle does not matter very much:  the shape of the sampling distribution is part of the prior information, not of the likelihood per se. The final argument (Section 12) that Bayesian model choice requires the preliminary derivation of “the possible departures that might arise” (p.302) has been made at several points in Error and Inference. Besides being in my opinion a valid working principle, i.e. selecting the most appropriate albeit false model, this definition of well-defined alternatives is mimicked by the assumption of “statistics whose distribution does not depend on the model assumption” (p. 302) found in the same last paragraph.

In conclusion this (sub-)chapter by David Cox and Deborah Mayo is (as could be expected!) a deep and thorough treatment of the frequentist approach to the sufficiency and (weak) conditionality principle. It however fails to convince me that there exists a “unique and unambiguous” frequentist approach to all but the most simple problems. At least, from reading this chapter, I cannot find a working principle that would lead me to this single unambiguous frequentist procedure.

Error and Inference [#4]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on September 21, 2011 by xi'an

(This is the fourth post on Error and Inference, again and again yet being a raw and naïve reaction following a linear and slow reading of the book, rather than a deeper and more informed criticism.)

‘The defining feature of an inductive inference is that the premises (evidence statements) can be true while the conclusion inferred may be false without a logical contradiction: the conclusion is “evidence transcending”.”—D. Mayo and D. Cox, p.249, Error and Inference, 2010

The seventh chapter of Error and Inference, entitled “New perspectives on (some old) problems of frequentist statistics“, is divided in four parts, written by David Cox, Deborah Mayo and Aris Spanos, in different orders and groups of authors. This is certainly the most statistical of all chapters, not a surprise when considering that David Cox is involved, and I thus have difficulties to explain why it took me so long to read through it…. Overall, this chapter is quite important by its contribution to the debate on the nature of statistical testing.

‘The advantage in the modern statistical framework is that the probabilities arise from defining a probability model to represent the phenomenon of interest. Had Popper made use of the statistical testing ideas being developed at around the same time, he might have been able to substantiate his account of falsification.”—D. Mayo and D. Cox, p.251, Error and Inference, 2010

The first part of the chapter is Mayo’s and Cox’ “Frequentist statistics as a theory of inductive inference“. It was first published in the 2006 Erich Lehmann symposium. And available on line as an arXiv paper. There is absolutely no attempt there to link of clash with the Bayesian approach, this paper is only looking at frequentist statistical theory as the basis for inductive inference. The debate therein about deducing that H is correct from a dataset successfully facing a statistical test is classical (in both senses) but I [unsurprisingly] remain unconvinced by the arguments. The null hypothesis remains the calibrating distribution throughout the chapter, with very little (or at least not enough) consideration of what happens when the null hypothesis does not hold.  Section 3.6 about confidence intervals being another facet of testing hypotheses is representative of this perspective. The p-value is defended as the central tool for conducting hypothesis assessment. (In this version of the paper, some p’s are written in roman characters and others in italics, which is a wee confusing until one realises that this is a mere typo!)  The fundamental imbalance problem, namely that, in contiguous hypotheses, a test cannot be expected both to most often reject the null when it is [very moderately] false and to most often accept the null when it is right is not discussed there. The argument about substantive nulls in Section 3.5 considers a stylised case of well-separated scientific theories, however the real world of models is more similar to a greyish  (and more Popperian?) continuum of possibles. In connection with this, I would have thought more likely that the book would address on philosophical grounds Box’s aphorism that “all models are wrong”. Indeed, one (philosophical?) difficulty with the p-values and the frequentist evidence principle (FEV) is that they rely on the strong belief that one given model can be exact or true (while criticising the subjectivity of the prior modelling in the Bayesian approach). Even in the typology of types of null hypotheses drawn by the authors in Section 3, the “possibility of model misspecification” is addressed in terms of the low power of an omnibus test, while agreeing that “an incomplete probability specification” is unavoidable (an argument found at several place in the book that the alternative cannot be completely specified).

‘Sometimes we can find evidence for H0, understood as an assertion that a particular discrepancy, flaw, or error is absent, and we can do this by means of tests that, with high probability, would have reported a discrepancy had one been present.”—D. Mayo and D. Cox, p.255, Error and Inference, 2010

The above quote relates to the Failure and Confirmation section where the authors try to push the argument in favour of frequentist tests one step further, namely that that “moderate p-values” may sometimes be used as confirmation of the null. (I may have misunderstood, the end of the section defending a purely frequentist, as in repeated experiments, interpretation. This reproduces an earlier argument about the nature of probability in Section 1.2, as characterising the “stability of relative frequencies of results of repeated trials”) In fact, this chapter and other recent readings made me think afresh about the nature of probability, a debate that put me off so much in Keynes (1921) and even in Jeffreys (1939). From a mathematical perspective, there is only one “kind” of probability, the one defined via a reference measure and a probability, whether it applies to observations or to parameters. From a philosophical perspective, there is a natural issue about the “truth” or “realism” of the probability quantities and of the probabilistic statements. The book and in particular the chapter consider that a truthful probability statement is the one agreeing with “a hypothetical long-run of repeated sampling, an error probability”, while the statistical inference school of Keynes (1921), Jeffreys (1939), and Carnap (1962) “involves quantifying a degree of support or confirmation in claims or hypotheses”, which makes this (Bayesian) sound as less realistic… Obviously, I have no ambition to solve this long-going debate, however I see no reason in the first approach to be more realistic by being grounded on stable relative frequencies à la von Mises. If nothing else, the notion that a test should be evaluated on its long run performances is very idealistic as the concept relies on an ever-repeating, an infinite sequence of identical trials. Relying on probability measures as self-coherent mathematical measures of uncertainty carries (for me) as much (or as less) reality as the above infinite experiment. Now, the paper is not completely entrenched in this interpretation, when it concludes that “what makes the kind of hypothetical reasoning relevant to the case at hand is not the long-run low error rates associated with using the tool (or test) in this manner; it is rather what those error rates reveal about the data generating source or phenomenon” (p.273).

‘If the data are so extensive that accordance with the null hypothesis implies the absence of an effect of practical importance, and a reasonably high p-value is achieved, then it may be taken as evidence of the absence of an effect of practical importance.”—D. Mayo and D. Cox, p.263, Error and Inference, 2010

The paper mentions several times conclusions to be drawn from a p-value near one, as in the above quote. This is an interpretation that does not sit well with my understanding of p-values being distributed as uniforms under the null: very high  p-values should be as suspicious as very low p-values. (This criticism is not new, of course.) Unless one does not strictly adhere to the null model, which brings back the above issue of the approximativeness of any model… I also found fascinating to read the criticism that “power appertains to a prespecified rejection region, not to the specific data under analysis” as I thought this equally applied to the p-values, turning “the specific data under analysis” into a departure event of a prespecified kind.

(Given the unreasonable length of the above, I fear I will continue my snailpaced reading in yet another post!)

Follow

Get every new post delivered to your Inbox.

Join 342 other followers