Archive for Error-Statistical philosophy

Error and Inference [#4]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , on September 21, 2011 by xi'an

(This is the fourth post on Error and Inference, again and again yet being a raw and naïve reaction following a linear and slow reading of the book, rather than a deeper and more informed criticism.)

‘The defining feature of an inductive inference is that the premises (evidence statements) can be true while the conclusion inferred may be false without a logical contradiction: the conclusion is “evidence transcending”.”—D. Mayo and D. Cox, p.249, Error and Inference, 2010

The seventh chapter of Error and Inference, entitled “New perspectives on (some old) problems of frequentist statistics“, is divided in four parts, written by David Cox, Deborah Mayo and Aris Spanos, in different orders and groups of authors. This is certainly the most statistical of all chapters, not a surprise when considering that David Cox is involved, and I thus have difficulties to explain why it took me so long to read through it…. Overall, this chapter is quite important by its contribution to the debate on the nature of statistical testing.

‘The advantage in the modern statistical framework is that the probabilities arise from defining a probability model to represent the phenomenon of interest. Had Popper made use of the statistical testing ideas being developed at around the same time, he might have been able to substantiate his account of falsification.”—D. Mayo and D. Cox, p.251, Error and Inference, 2010

The first part of the chapter is Mayo’s and Cox’ “Frequentist statistics as a theory of inductive inference“. It was first published in the 2006 Erich Lehmann symposium. And available on line as an arXiv paper. There is absolutely no attempt there to link of clash with the Bayesian approach, this paper is only looking at frequentist statistical theory as the basis for inductive inference. The debate therein about deducing that H is correct from a dataset successfully facing a statistical test is classical (in both senses) but I [unsurprisingly] remain unconvinced by the arguments. The null hypothesis remains the calibrating distribution throughout the chapter, with very little (or at least not enough) consideration of what happens when the null hypothesis does not hold.  Section 3.6 about confidence intervals being another facet of testing hypotheses is representative of this perspective. The p-value is defended as the central tool for conducting hypothesis assessment. (In this version of the paper, some p’s are written in roman characters and others in italics, which is a wee confusing until one realises that this is a mere typo!)  The fundamental imbalance problem, namely that, in contiguous hypotheses, a test cannot be expected both to most often reject the null when it is [very moderately] false and to most often accept the null when it is right is not discussed there. The argument about substantive nulls in Section 3.5 considers a stylised case of well-separated scientific theories, however the real world of models is more similar to a greyish  (and more Popperian?) continuum of possibles. In connection with this, I would have thought more likely that the book would address on philosophical grounds Box’s aphorism that “all models are wrong”. Indeed, one (philosophical?) difficulty with the p-values and the frequentist evidence principle (FEV) is that they rely on the strong belief that one given model can be exact or true (while criticising the subjectivity of the prior modelling in the Bayesian approach). Even in the typology of types of null hypotheses drawn by the authors in Section 3, the “possibility of model misspecification” is addressed in terms of the low power of an omnibus test, while agreeing that “an incomplete probability specification” is unavoidable (an argument found at several place in the book that the alternative cannot be completely specified).

‘Sometimes we can find evidence for H0, understood as an assertion that a particular discrepancy, flaw, or error is absent, and we can do this by means of tests that, with high probability, would have reported a discrepancy had one been present.”—D. Mayo and D. Cox, p.255, Error and Inference, 2010

The above quote relates to the Failure and Confirmation section where the authors try to push the argument in favour of frequentist tests one step further, namely that that “moderate p-values” may sometimes be used as confirmation of the null. (I may have misunderstood, the end of the section defending a purely frequentist, as in repeated experiments, interpretation. This reproduces an earlier argument about the nature of probability in Section 1.2, as characterising the “stability of relative frequencies of results of repeated trials”) In fact, this chapter and other recent readings made me think afresh about the nature of probability, a debate that put me off so much in Keynes (1921) and even in Jeffreys (1939). From a mathematical perspective, there is only one “kind” of probability, the one defined via a reference measure and a probability, whether it applies to observations or to parameters. From a philosophical perspective, there is a natural issue about the “truth” or “realism” of the probability quantities and of the probabilistic statements. The book and in particular the chapter consider that a truthful probability statement is the one agreeing with “a hypothetical long-run of repeated sampling, an error probability”, while the statistical inference school of Keynes (1921), Jeffreys (1939), and Carnap (1962) “involves quantifying a degree of support or confirmation in claims or hypotheses”, which makes this (Bayesian) sound as less realistic… Obviously, I have no ambition to solve this long-going debate, however I see no reason in the first approach to be more realistic by being grounded on stable relative frequencies à la von Mises. If nothing else, the notion that a test should be evaluated on its long run performances is very idealistic as the concept relies on an ever-repeating, an infinite sequence of identical trials. Relying on probability measures as self-coherent mathematical measures of uncertainty carries (for me) as much (or as less) reality as the above infinite experiment. Now, the paper is not completely entrenched in this interpretation, when it concludes that “what makes the kind of hypothetical reasoning relevant to the case at hand is not the long-run low error rates associated with using the tool (or test) in this manner; it is rather what those error rates reveal about the data generating source or phenomenon” (p.273).

‘If the data are so extensive that accordance with the null hypothesis implies the absence of an effect of practical importance, and a reasonably high p-value is achieved, then it may be taken as evidence of the absence of an effect of practical importance.”—D. Mayo and D. Cox, p.263, Error and Inference, 2010

The paper mentions several times conclusions to be drawn from a p-value near one, as in the above quote. This is an interpretation that does not sit well with my understanding of p-values being distributed as uniforms under the null: very high  p-values should be as suspicious as very low p-values. (This criticism is not new, of course.) Unless one does not strictly adhere to the null model, which brings back the above issue of the approximativeness of any model… I also found fascinating to read the criticism that “power appertains to a prespecified rejection region, not to the specific data under analysis” as I thought this equally applied to the p-values, turning “the specific data under analysis” into a departure event of a prespecified kind.

(Given the unreasonable length of the above, I fear I will continue my snailpaced reading in yet another post!)

Error and Inference [#3]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on September 14, 2011 by xi'an

(This is the third post on Error and Inference, yet again being a raw and naïve reaction to a linear reading rather than a deeper and more informed criticism.)

“Statistical knowledge is independent of high-level theories.”—A. Spanos, p.242, Error and Inference, 2010

The sixth chapter of Error and Inference is written by Aris Spanos and deals with the issues of testing in econometrics. It provides on the one hand a fairly interesting entry in the history of economics and the resistance to data-backed theories, primarily because the buffers between data and theory are multifold (“huge gap between economic theories and the available observational data“, p.203). On the other hand, what I fail to understand in the chapter is the meaning of theory, as it seems very distinct from what I would call a (statistical) model. The sentence “statistical knowledge, stemming from a statistically adequate model allows data to `have a voice of its own’ (…) separate from the theory in question and its succeeds in securing the frequentist goal of objectivity in theory testing” (p.206) is puzzling in this respect. (Actually, I would have liked to see a clear meaning put to this “voice of its own”, as it otherwise sounds mostly as a catchy sentence…) Similarly, Spanos distinguishes between three types of models: primary/theoretical, experimental/structural: “the structural model contains a theory’s substantive subject matter information in light of the available data” (p.213), data/statistical: “the statistical model is built exclusively using the information contained in the data” (p.213). I have trouble to understand how testing can distinguish between those types of models: as a naïve reader, I would have thought that only the statistical model could be tested by a statistical procedure, even though I would not call the above a proper definition of a statistical model (esp. since Spanos writes a few lines below that the statistical model “would embed (nest) the structural model in its context” (p.213)). The normal example followed on pages 213-217 does not help [me] to put sense to this distinction: it simply illustrates the impact of failing some of the defining assumptions (normality, time homogeneity [in mean and variance], independence). (As an aside, the discussion about the poor estimation of the correlation p.214-215 does not help, because it involves a second variable Y that is not defined for this example.) It would be nice of course if the “noise” in a statistical/econometric model could be studied in complete separation from the structure of this model, however they seem to be irremediably intermingled to prevent this partition of roles. I thus do not see how the “statistically adequate model is independent from the substantive information” (p.217), i.e. by which rigorous process one can isolate the “chance” parts of the data to build and validate a statistical model per se. The simultaneous equation model (SEM, pp.230-231) is more illuminating of the distinction set by Spanos between structural and statistical models/parameters, even though the difference in this case boils down to a question of identifiability. Continue reading

Error and Inference [#2]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , on September 8, 2011 by xi'an

(This is the second post on Error and Inference, again being a raw and naive reaction to a linear reading rather than a deeper and more informed criticism.)

“Allan Franklin once gave a seminar under the title `Ad Hoc is not a four letter word.'”—J. Worrall, p.130, Error and Inference, 2010

The fourth chapter of Error and Inference, written by John Worrall, covers the highly interesting issue of “using the data twice”. The point has been debated several times on Andrew’s blog and this is one of the main criticisms raised against Aitkin’s posterior/integrated likelihood. Worrall’s perspective is both related and unrelated to this purely statistical issue, when he considers that “you can’t use the same fact twice, once in the construction of a theory and then again in its support” (p.129). (He even signed a “UN Charter”, where UN stands for “use novelty”!) After reading both Worrall’s and Mayo’s viewpoints,  the later being that all that matters is severe testing as it encompasses the UN perspective (if I understood correctly), I afraid I am none the wiser, but this led me to reflect on the statistical issue. Continue reading

Error and Inference [#1]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on September 1, 2011 by xi'an

“The philosophy of science offer valuable tools for understanding and advancing solutions to the problems of evidence and inference in practice”—D. Mayo & A. Spanos, p.xiv, Error and Inference, 2010

Deborah Mayo kindly sent me her last book, whose subtitle is “Recent exchanges on experimental reasoning, reliability, and the objectivity and rationality of Science” and contributors are P. Achinstein, A. Chalmers, D. Cox, C. Glymour, L. Laudan, A. Musgrave, and J. Worrall, plus both editors, Deborah Mayo and Aris Spanos. Deborah Mayo rightly inferred that this debate was bound to appeal to my worries about the nature of testing and model choice and to my layman interest in the philosophy of Science. Speaking of which [layman], the book reads really well, even though I am missing references. And even though it cannot be read under my cherry tree (esp. now that weather has moved from été to étaumne… as I heard this morning on the national public radio) Deborah Mayo is clearly the driving force in putting this volume together, from setting the ERROR 06 conference to commenting the chapters of all contributors (but her own and Aris Spanos’). Her strongly frequentist perspective on the issues of testing and model choice are thus reflected in the overall tone of the volume, even though contributors bring some contradiction to the debate. (disclaimer: I found the comics below on Zoltan Dienes’s webpage. I however have no information nor opinion [yet] about the contents of the corresponding book.)

However, scientists wish to resist relativistic, fuzzy, or post-modern turns (…) Notably, the Popperian requirement that our theories are testable and falsifiable is widely regarded to contain important insights about responsibile science  and objectivity.—D. Mayo & A. Spanos, p.2, Error and Inference, 2010

Given the philosophical, complex, and interesting nature of the work, I will split my comments into several linear posts (hence the #1), as I did for Evidence and Evolution. The following comments are thus about a linear (even pedestrian) and incomplete read through the first three chapters.  Those comments are not  pretending at any depth, but simply reflect the handwritten notes and counterarguments I scribbled as I was reading through…  A complete book review was published in the Notre-Dame Philosophical Reviews. (Though, can you trust a review considering Sartre as a major philosopher?! At least, he appears as a counterpart to Bertrand Russell in the frontispiece of the review.) As illustrated by the above quote (which first part I obviously endorse), the overall perspective in the book is Popperian, despite Popper’s criticism of statistical inference as a whole and of Bayesian statistics as a particular (although Andrew would disagree). Another fundamental concept throughout the book is the “Error-Statistical philosophy” whose Deborah Mayo is the proponent. One of the tenets of this philosophy is a reliance on statistical significance tests in the Fisher-Neyman-Pearson (or frequentist) tradition, along with a severity principle (“We want hypotheses that will allow for stringent testing so that if they pass we have evidence of a genuine experimental effect“, p.19) stated as (p.22)

A hypothesis H passes a severe test T with data x is

  1. x agrees with H, and
  2. with very high probability, test T would have produced a result that accords less well with H than does x, if H were false or incorrect.

(The p-value is advanced as a direct accomplishment of this goal, but I fail to see why it does or why a Bayes factor would not. Indeed, the criterion depends on the definition of probability when H is false or incorrect. This relates to Mayo’s criticism of the Bayesian approach, as explained below.)

Formal error-statistical tests provide tools to ensure that errors will be correctly detected with high probabilities“—D. Mayo, p.33, Error and Inference, 2010

In Chapter 1, Deborah Mayo has a direct go at the Bayesian approach. The main criticism is about the Bayesian approach to testing (defined through the posterior probability of the hypothesis, rather than through the predictive) is about the catchall hypothesis, a somehow desultory term replacing the alternative hypothesis. According to Deborah Mayo, this alternative should “include all possible rivals, including those not even though of”  (p.37). This sounds like a weak argument, although it was also used by Alan Templeton in his rebuttal of ABC, given that (a) it should also apply in the frequentist sense, in order to define the probability distribution “when H is false or incorrect” (see, e.g., “probability of so good an agreement (between H and x) calculated under the assumption that H is false”, p.40); (b) a well-defined alternative should be available as testing an hypothesis is very rarely the end of the story: if H is rejected, there should/will be a contingency plan; (c) rejecting or accepting an hypothesis H in terms of the sole null hypothesis H does not make sense from operational as well as from game-theoretic perspectives. The further argument that the posterior probability of H is a direct function of the prior probability of H does not stand against the Bayes factor. (The same applies to the criticism that the Bayesian approach does not accommodate newcomers, i.e., new alternatives.) Stating that “one cannot vouch for the reliability of [this Bayesian] procedure—that it would rarely affirm theory T were T false” (p.37) completely ignores the wealth of results about the consistency of the Bayes factor (since the “asymptotic long run”, p.20, matters in the Error-Statistical philosophy). The final argument that Bayesians rank “theories that fit the data equally well (i.e., have identical likelihoods)” (p.38) does not account for (or dismisses, p.50, referring to Jeffreys and Berger instead of Jefferys and Berger) the fact that Bayes factors are automated Occam’s razors in that the averaging of the likelihoods over spaces of different dimensions are natural advocates of simpler models. Even though I plan to discuss this point in a second post, Deborah Mayo also seems to imply that Bayesians are using the data twice (this is how I interpret the insistance on same p. 50), which is a sin [genuine] Bayesian analysis can hardly be found guilty of!

As pointed out by Adam La Caze in Notre-Dame Philosophical Reviews:

An exchange on Bayesian philosophy of science or Bayesian statistics would have been a welcome addition and would have benefited the dual goals of the volume. Bayesian philosophy of science and Bayesian statistics are a constant foil to Mayo’s work, but neither approach is given much of a voice. An exchange on Bayesian philosophy of science is made all the more relevant by the strength of Mayo’s challenge to a Bayesian account of theory appraisal. A virtue of the error-statistical account is its ability to capture the kind of detailed arguments that scientists make about data and the methods they employ to arrive at reliable inferences. Mayo clearly thinks that Bayesians are unable to supplement their view with any sort of prospective account of such methods. This seems contrary to practice where scientists make similar methodological arguments whether they utilise frequentist or Bayesian approaches to statistical inference. Indeed, Bayesian approaches to study design and statistical inference play a significant (and increasing) role in many sciences, often alongside frequentist approaches (clinical drug development provides a prominent example). It would have been interesting to see what, if any, common ground could be reached on these approaches to the philosophy of science (even if very little common ground seems possible in terms of their competing approach to statistical inference).