This paper by Andrew Gelman and Christian Hennig calls for the abandonment of the terms objective and subjective in (not solely Bayesian) statistics. And argue that there is more than mere prior information and data to the construction of a statistical analysis. The paper is articulated as the authors’ proposal, followed by four application examples, then a survey of the philosophy of science perspectives on objectivity and subjectivity in statistics and other sciences, next to a study of the subjective and objective aspects of the mainstream statistical streams, concluding with a discussion on the implementation of the proposed move. Continue reading
Archive for Error-Statistical philosophy
“This is, in this revised version, an outstanding paper that covers the Jeffreys-Lindley paradox (JLP) in exceptional depth and that unravels the philosophical differences between different schools of inference with the help of the JLP. From the analysis of this paradox, the author convincingly elaborates the principles of Bayesian and severity-based inferences, and engages in a thorough review of the latter’s account of the JLP in Spanos (2013).” Anonymous
I have now received a second round of reviews of my paper, “On the Jeffreys-Lindleys paradox” (submitted to Philosophy of Science) and the reports are quite positive (or even extremely positive as in the above quote!). The requests for changes are directed to clarify points, improve the background coverage, and simplify my heavy style (e.g., cutting Proustian sentences). These requests were easily addressed (hopefully to the satisfaction of the reviewers) and, thanks to the week in Warwick, I have already sent the paper back to the journal, with high hopes for acceptance. The new version has also been arXived. I must add that some parts of the reviews sounded much better than my original prose and I was almost tempted to include them in the final version. Take for instance
“As a result, the reader obtains not only a better insight into what is at stake in the JLP, going beyond the results of Spanos (2013) and Sprenger (2013), but also a much better understanding of the epistemic function and mechanics of statistical tests. This is a major achievement given the philosophical controversies that have haunted the topic for decades. Recent insights from Bayesian statistics are integrated into the article and make sure that it is mathematically up to date, but the technical and foundational aspects of the paper are well-balanced.” Anonymous
In connection with my series of posts on the book Error and Inference, and my recent collation of those into an arXiv document, Deborah Mayo has started a series of informal seminars at the LSE on the philosophy of errors in statistics and the likelihood principle. and has also posted a long comment on my argument about only using wrong models. (The title is inspired from the Rolling Stones’ “You can’t always get what you want“, very cool!) The discussion about the need or not to take into account all possible models (which is the meaning of the “catchall hypothesis” I had missed while reading the book) shows my point was not clear. I obviously do not claim in the review that all possible models should be accounted for at once, this was on the opposite my understanding of Mayo’s criticism of the Bayesian approach (I thought the following sentence was clear enough: “According to Mayo, this alternative hypothesis should “include all possible rivals, including those not even though of” (p.37)”)! So I see the Bayesian approach as a way to put on the table a collection of reasonable (if all wrong) models and give to those models a posterior probability, with the purpose that improbable ones are eliminated. Therefore, I am in agreement with most of the comments in the post, esp. because this has little to do with Bayesian versus frequentist testing! Even rejecting the less likely models from a collection seems compatible with a Bayesian approach, model averaging is not always an appropriate solution, depending on the loss function!
(This is my sixth and last post on Error and Inference, being as previously a raw and naïve reaction born from a linear and sluggish reading of the book, rather than a deeper and more informed criticism with philosophical bearings. Read at your own risk.)
“‘It is refreshing to see Cox and Mayo give a hard-nosed statement of what scientific objectivity demands of an account of statistics, show how it relates to frequentist statistics, and contrast that with the notion of “objectivity” used by O-Bayesians.”—A. Spanos, p.326, Error and Inference, 2010
In order to conclude my pedestrian traverse of Error and Inference, I read the discussion by Aris Spanos of the second part of the seventh chapter by David Cox’s and Deborah Mayo’s, discussed in the previous post. (In the train to the half-marathon to be precise, which may have added a sharper edge to the way I read it!) The first point in the discussion is that the above paper is “a harmonious blend of the Fisherian and N-P perspectives to weave a coherent frequentist inductive reasoning anchored firmly on error probabilities”(p.316). The discussion by Spanos is very much a-critical of the paper, so I will not engage into a criticism of the non-criticism, but rather expose some thoughts of mine that came from reading this apology. (Remarks about Bayesian inference are limited to some piques like the above, which only reiterates those found earlier [and later: “the various examples Bayesians employ to make their case involve some kind of “rigging” of the statistical model“, Aris Spanos, p.325; “The Bayesian epistemology literature is filled with shadows and illusions“, Clark Glymour, p. 335] in the book.) [I must add I do like the mention of O-Bayesians, as I coined the O’Bayes motto for the objective Bayes bi-annual meetings from 2003 onwards! It also reminds me of the O-rings and of the lack of proper statistical decision-making in the Challenger tragedy…]
The “general frequentist principle for inductive reasoning” (p.319) at the core of Cox and Mayo’s paper is obviously the central role of the p-value in “providing (strong) evidence against the null H0 (for a discrepancy from H0)”. Once again, I fail to see it as the epitome of a working principle in that
- it depends on the choice of a divergence d(z), which reduces the information brought by the data z;
- it does not articulate the level for labeling nor the consequences of finding a low p-value;
- it ignores the role of the alternative hypothesis.
Furthermore, Spanos’ discussion deals with “the fallacy of rejection” (pp.319-320) in a rather artificial (if common) way, namely by setting a buffer of discrepancy γ around the null hypothesis. While the choice of a maximal degree of precision sounds natural to me (in the sense that a given sample size should not allow for the discrimination between two arbitrary close values of the parameter), the fact that γ is in fine set by the data (so that the p-value is high) is fairly puzzling. If I understand correctly, the change from a p-value to a discrepancy γ is a fine device to make the “distance” from the null better understood, but it has an extremely limited range of application. If I do not understand correctly, the discrepancy γ is fixed by the statistician and then this sounds like an extreme form of prior selection.
There is at least one issue I do not understand in this part, namely the meaning of the severity evaluation probability
as the conditioning on the event seems impossible in a frequentist setting. This leads me to an idle and unrelated questioning as to whether there is a solution to
as this would be the ultimate discrepancy. Or whether this does not make any sense… because of the ambiguous role of z0, which needs somehow to be integrated out. (Otherwise, d can be chosen so that the probability is 1.)
“If one renounces the likelihood, the stopping rule, and the coherence principles, marginalizes the use of prior information as largely untrustworthy, and seek procedures with `good’ error probabilistic properties (whatever that means), what is left to render the inference Bayesian, apart from a belief (misguided in my view) that the only way to provide an evidential account of inference is to attach probabilities to hypotheses?”—A. Spanos, p.326, Error and Inference, 2010
The role of conditioning ancillary statistics is emphasized both in the paper and the discussion. This conditioning clearly reduces variability, however there is no reservation about the arbitrariness of such ancillary statistics. And the fact that conditioning any further would lead to conditioning upon the whole data, i.e. to a Bayesian solution. I also noted a curious lack of proper logical reasoning in the argument that, when
using the conditional ancillary distribution is enough, since, while “any departure from f(z|s) implies that the overall model is false” (p.322), but not the reverse. Hence, a poor choice of s may fail to detect a departure. (Besides the fact that fixed-dimension sufficient statistics do not exist outside exponential families.) Similarly, Spanos expands about the case of a minimal sufficient statistic that is independent from a maximal ancillary statistic, but such cases are quite rare and limited to exponential families [in the iid case]. Still in the conditioning category, he also supports Mayo’s argument against the likelihood principle being a consequence of the sufficiency and weak conditionality principles. A point I discussed in a previous post. However, he does not provide further evidence against Birnbaum’s result, arguing rather in favour of a conditional frequentist inference I have nothing to complain about. (I fail to perceive the appeal of the Welch uniform example in terms of the likelihood principle.)
In an overall conclusion, let me repeat and restate that this series of posts about Error and Inference is far from pretending at bringing a Bayesian reply to the philosophical arguments raised in the volume. The primary goal being of “taking some crucial steps towards legitimating the philosophy of frequentist statistics” (p.328), I should not feel overly concerned. It is only when the debate veered towards a comparison with the Bayesian approach [often too often of the “holier than thou” brand] that I felt allowed to put in my twopennies worth… I do hope I may crystallise this set of notes into a more constructed review of the book, if time allows, although I am pessimistic at the chances of getting it published given our current difficulties with the critical review of Murray Aitkin’s Statistical Inference. However, as a coincidence, we got back last weekend an encouraging reply from Statistics and Risk Modelling, prompting us towards a revision and the prospect of a reply by Murray.
“‘Frequentist methods achieve an objective connection to hypotheses about the data-generating process by being constrained and calibrated by the method’s error probabilities in relation to these models .”—D. Cox and D. Mayo, p.277, Error and Inference, 2010
The second part of the seventh chapter of Error and Inference, is David Cox’s and Deborah Mayo’s “Objectivity and conditionality in frequentist inference“. (Part of the section is available on Google books.) The purpose is clear and the chapter quite readable from a statistician’s perspective. I however find it difficult to quantify objectivity by first conditioning on “a statistical model postulated to have generated data”, as again this assumes the existence of a “true” probability model where “probabilities (…) are equal or close to the actual relative frequencies”. As earlier stressed by Andrew:
“I don’t think it’s helpful to speak of “objective priors.” As a scientist, I try to be objective as much as possible, but I think the objectivity comes in the principle, not the prior itself. A prior distribution–any statistical model–reflects information, and the appropriate objective procedure will depend on what information you have.”
The paper opposes the likelihood, Bayesian, and frequentist methods, reproducing what Gigerenzer called the “superego, the ego, and the id” in his paper on statistical significance. Cox and Mayo stress from the start that the frequentist approach is (more) objective because it is based on the sampling distribution of the test. My primary problem with this thesis is that the “hypothetical long run” (p.282) does not hold in realistic settings. Even in the event of a reproduction of similar or identical tests, a sequential procedure exploiting everything that has been observed so far is more efficient than the mere replication of the same procedure solely based on the current observation.
“Virtually all (…) models are to some extent provisional, which is precisely what is expected in the building up of knowledge.”—D. Cox and D. Mayo, p.283, Error and Inference, 2010
The above quote is something I completely agree with, being another phrasing of George Box’s “all models are wrong”, but this transience of working models is a good reason in my opinion to account for the possibility of alternative working models from the start of the statistical analysis. Hence for an inclusion of those models in the statistical analysis equally from the start. Which leads almost inevitably to a Bayesian formulation of the testing problem.
“‘Perhaps the confusion [over the role of sufficient statistics] stems in part because the various inference schools accept the broad, but not the detailed, implications of sufficiency.”—D. Cox and D. Mayo, p.286, Error and Inference, 2010
The discussion over the sufficiency principle is interesting, as always. The authors propose to solve the confusion between the sufficiency principle and the frequentist approach by assuming that inference “is relative to the particular experiment, the type of inference, and the overall statistical approach” (p.287). This creates a barrier between sampling distributions that avoids the binomial versus negative binomial paradox always stressed in the Bayesian literature. But the solution is somehow tautological: by conditioning on the sampling distribution, it avoids the difficulties linked with several sampling distributions all producing the same likelihood. After my recent work on ABC model choice, I am however less excited about the sufficiency principle as the existence of [non-trivial] sufficient statistics is quite the rare event. Especially across models. The section (pp. 288-289) is also revealing about the above “objectivity” of the frequentist approach in that the derivation of a test taking large value away from the null with a well-known distribution under the null is not an automated process, esp. when nuisance parameters cannot be escaped from (pp. 291-294). Achieving separation from nuisance parameters, i.e. finding statistics that can be conditioned upon to eliminate those nuisance parameters, does not seem feasible outside well-formalised models related with exponential families. Even in such formalised models, a (clear?) element of arbitrariness is involved in the construction of the separations, which implies that the objectivity is under clear threat. The chapter recognises this limitation in Section 9.2 (pp.293-294), however it argues that separation is much more common in the asymptotic sense and opposes the approach to the Bayesian averaging over the nuisance parameters, which “may be vitiated by faulty priors” (p.294). I am not convinced by the argument, given that the (approximate) condition approach amount to replace the unknown nuisance parameter by an estimator, without accounting for the variability of this estimator. Averaging brings the right (in a consistency sense) penalty.
A compelling section is the one about the weak conditionality principle (pp. 294-298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Example 1.3.7, p.18) that the classical confidence interval averages over the experiments… Mea culpa! The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this. I could however argue about “conditioning is warranted to achieve objective frequentist goals” (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”. Nonetheless, an approach based on the mixture remains frequentist if non-optimal… (The chapter later attacks the derivation of the likelihood principle, I will come back to it in a later post.)
“‘Many seem to regard reference Bayesian theory to be a resting point until satisfactory subjective or informative priors are available. It is hard to see how this gives strong support to the reference prior research program.”—D. Cox and D. Mayo, p.302, Error and Inference, 2010
A section also worth commenting is (unsurprisingly!) the one addressing the limitations of the Bayesian alternatives (pp. 298–302). It however dismisses right away the personalistic approach to priors by (predictably if hastily) considering it fails the objectivity canons. This seems a wee quick to me, as the choice of a prior is (a) the choice of a reference probability measure against which to assess the information brought by the data, not clearly less objective than picking one frequentist estimator or another, and (b) a personal construction of the prior can also be defended on objective grounds, based on the past experience of the modeler. That it varies from one modeler to the next is not an indication of subjectivity per se, simply of different past experiences. Cox and Mayo then focus on reference priors, à la Bernardo-Berger, once again pointing out the lack of uniqueness of those priors as a major flaw. While the sub-chapter agrees on the understanding of those priors as convention or reference priors, aiming at maximising the input from the data, it gets stuck on the impropriety of such priors: “if priors are not probabilities, what then is the interpretation of a posterior?” (p.299). This seems like a strange comment to me: the interpretation of a posterior is that it is a probability distribution and this is the only mathematical constraint one has to impose on a prior. (Which may be a problem in the derivation of reference priors.) As detailed in The Bayesian Choice among other books, there are many compelling reasons to invite improper priors into the game. (And one not to, namely the difficulty with point null hypotheses.) While I agree that the fact that some reference priors (like matching priors, whose discussion p. 302 escapes me) have good frequentist properties is not compelling within a Bayesian framework, it seems a good enough answer to the more general criticism about the lack of objectivity: in that sense, frequency-validated reference priors are part of the huge package of frequentist procedures and cannot be dismissed on the basis of being Bayesian. That reference priors are possibly at odd with the likelihood principle does not matter very much: the shape of the sampling distribution is part of the prior information, not of the likelihood per se. The final argument (Section 12) that Bayesian model choice requires the preliminary derivation of “the possible departures that might arise” (p.302) has been made at several points in Error and Inference. Besides being in my opinion a valid working principle, i.e. selecting the most appropriate albeit false model, this definition of well-defined alternatives is mimicked by the assumption of “statistics whose distribution does not depend on the model assumption” (p. 302) found in the same last paragraph.
In conclusion this (sub-)chapter by David Cox and Deborah Mayo is (as could be expected!) a deep and thorough treatment of the frequentist approach to the sufficiency and (weak) conditionality principle. It however fails to convince me that there exists a “unique and unambiguous” frequentist approach to all but the most simple problems. At least, from reading this chapter, I cannot find a working principle that would lead me to this single unambiguous frequentist procedure.
“‘The defining feature of an inductive inference is that the premises (evidence statements) can be true while the conclusion inferred may be false without a logical contradiction: the conclusion is “evidence transcending”.”—D. Mayo and D. Cox, p.249, Error and Inference, 2010
The seventh chapter of Error and Inference, entitled “New perspectives on (some old) problems of frequentist statistics“, is divided in four parts, written by David Cox, Deborah Mayo and Aris Spanos, in different orders and groups of authors. This is certainly the most statistical of all chapters, not a surprise when considering that David Cox is involved, and I thus have difficulties to explain why it took me so long to read through it…. Overall, this chapter is quite important by its contribution to the debate on the nature of statistical testing.
“‘The advantage in the modern statistical framework is that the probabilities arise from defining a probability model to represent the phenomenon of interest. Had Popper made use of the statistical testing ideas being developed at around the same time, he might have been able to substantiate his account of falsification.”—D. Mayo and D. Cox, p.251, Error and Inference, 2010
The first part of the chapter is Mayo’s and Cox’ “Frequentist statistics as a theory of inductive inference“. It was first published in the 2006 Erich Lehmann symposium. And available on line as an arXiv paper. There is absolutely no attempt there to link of clash with the Bayesian approach, this paper is only looking at frequentist statistical theory as the basis for inductive inference. The debate therein about deducing that H is correct from a dataset successfully facing a statistical test is classical (in both senses) but I [unsurprisingly] remain unconvinced by the arguments. The null hypothesis remains the calibrating distribution throughout the chapter, with very little (or at least not enough) consideration of what happens when the null hypothesis does not hold. Section 3.6 about confidence intervals being another facet of testing hypotheses is representative of this perspective. The p-value is defended as the central tool for conducting hypothesis assessment. (In this version of the paper, some p’s are written in roman characters and others in italics, which is a wee confusing until one realises that this is a mere typo!) The fundamental imbalance problem, namely that, in contiguous hypotheses, a test cannot be expected both to most often reject the null when it is [very moderately] false and to most often accept the null when it is right is not discussed there. The argument about substantive nulls in Section 3.5 considers a stylised case of well-separated scientific theories, however the real world of models is more similar to a greyish (and more Popperian?) continuum of possibles. In connection with this, I would have thought more likely that the book would address on philosophical grounds Box’s aphorism that “all models are wrong”. Indeed, one (philosophical?) difficulty with the p-values and the frequentist evidence principle (FEV) is that they rely on the strong belief that one given model can be exact or true (while criticising the subjectivity of the prior modelling in the Bayesian approach). Even in the typology of types of null hypotheses drawn by the authors in Section 3, the “possibility of model misspecification” is addressed in terms of the low power of an omnibus test, while agreeing that “an incomplete probability specification” is unavoidable (an argument found at several place in the book that the alternative cannot be completely specified).
“‘Sometimes we can find evidence for H0, understood as an assertion that a particular discrepancy, flaw, or error is absent, and we can do this by means of tests that, with high probability, would have reported a discrepancy had one been present.”—D. Mayo and D. Cox, p.255, Error and Inference, 2010
The above quote relates to the Failure and Confirmation section where the authors try to push the argument in favour of frequentist tests one step further, namely that that “moderate p-values” may sometimes be used as confirmation of the null. (I may have misunderstood, the end of the section defending a purely frequentist, as in repeated experiments, interpretation. This reproduces an earlier argument about the nature of probability in Section 1.2, as characterising the “stability of relative frequencies of results of repeated trials”) In fact, this chapter and other recent readings made me think afresh about the nature of probability, a debate that put me off so much in Keynes (1921) and even in Jeffreys (1939). From a mathematical perspective, there is only one “kind” of probability, the one defined via a reference measure and a probability, whether it applies to observations or to parameters. From a philosophical perspective, there is a natural issue about the “truth” or “realism” of the probability quantities and of the probabilistic statements. The book and in particular the chapter consider that a truthful probability statement is the one agreeing with “a hypothetical long-run of repeated sampling, an error probability”, while the statistical inference school of Keynes (1921), Jeffreys (1939), and Carnap (1962) “involves quantifying a degree of support or confirmation in claims or hypotheses”, which makes this (Bayesian) sound as less realistic… Obviously, I have no ambition to solve this long-going debate, however I see no reason in the first approach to be more realistic by being grounded on stable relative frequencies à la von Mises. If nothing else, the notion that a test should be evaluated on its long run performances is very idealistic as the concept relies on an ever-repeating, an infinite sequence of identical trials. Relying on probability measures as self-coherent mathematical measures of uncertainty carries (for me) as much (or as less) reality as the above infinite experiment. Now, the paper is not completely entrenched in this interpretation, when it concludes that “what makes the kind of hypothetical reasoning relevant to the case at hand is not the long-run low error rates associated with using the tool (or test) in this manner; it is rather what those error rates reveal about the data generating source or phenomenon” (p.273).
“‘If the data are so extensive that accordance with the null hypothesis implies the absence of an effect of practical importance, and a reasonably high p-value is achieved, then it may be taken as evidence of the absence of an effect of practical importance.”—D. Mayo and D. Cox, p.263, Error and Inference, 2010
The paper mentions several times conclusions to be drawn from a p-value near one, as in the above quote. This is an interpretation that does not sit well with my understanding of p-values being distributed as uniforms under the null: very high p-values should be as suspicious as very low p-values. (This criticism is not new, of course.) Unless one does not strictly adhere to the null model, which brings back the above issue of the approximativeness of any model… I also found fascinating to read the criticism that “power appertains to a prespecified rejection region, not to the specific data under analysis” as I thought this equally applied to the p-values, turning “the specific data under analysis” into a departure event of a prespecified kind.
(Given the unreasonable length of the above, I fear I will continue my snailpaced reading in yet another post!)
(This is the third post on Error and Inference, yet again being a raw and naïve reaction to a linear reading rather than a deeper and more informed criticism.)
“Statistical knowledge is independent of high-level theories.”—A. Spanos, p.242, Error and Inference, 2010
The sixth chapter of Error and Inference is written by Aris Spanos and deals with the issues of testing in econometrics. It provides on the one hand a fairly interesting entry in the history of economics and the resistance to data-backed theories, primarily because the buffers between data and theory are multifold (“huge gap between economic theories and the available observational data“, p.203). On the other hand, what I fail to understand in the chapter is the meaning of theory, as it seems very distinct from what I would call a (statistical) model. The sentence “statistical knowledge, stemming from a statistically adequate model allows data to `have a voice of its own’ (…) separate from the theory in question and its succeeds in securing the frequentist goal of objectivity in theory testing” (p.206) is puzzling in this respect. (Actually, I would have liked to see a clear meaning put to this “voice of its own”, as it otherwise sounds mostly as a catchy sentence…) Similarly, Spanos distinguishes between three types of models: primary/theoretical, experimental/structural: “the structural model contains a theory’s substantive subject matter information in light of the available data” (p.213), data/statistical: “the statistical model is built exclusively using the information contained in the data” (p.213). I have trouble to understand how testing can distinguish between those types of models: as a naïve reader, I would have thought that only the statistical model could be tested by a statistical procedure, even though I would not call the above a proper definition of a statistical model (esp. since Spanos writes a few lines below that the statistical model “would embed (nest) the structural model in its context” (p.213)). The normal example followed on pages 213-217 does not help [me] to put sense to this distinction: it simply illustrates the impact of failing some of the defining assumptions (normality, time homogeneity [in mean and variance], independence). (As an aside, the discussion about the poor estimation of the correlation p.214-215 does not help, because it involves a second variable Y that is not defined for this example.) It would be nice of course if the “noise” in a statistical/econometric model could be studied in complete separation from the structure of this model, however they seem to be irremediably intermingled to prevent this partition of roles. I thus do not see how the “statistically adequate model is independent from the substantive information” (p.217), i.e. by which rigorous process one can isolate the “chance” parts of the data to build and validate a statistical model per se. The simultaneous equation model (SEM, pp.230-231) is more illuminating of the distinction set by Spanos between structural and statistical models/parameters, even though the difference in this case boils down to a question of identifiability. Continue reading