My friends Randal Douc and Éric Moulines just published this new time series book with David Stoffer. (David also wrote Time Series Analysis and its Applications with Robert Shumway a year ago.) The books reflects well on the research of Randal and Éric over the past decade, namely convergence results on Markov chains for validating both inference in nonlinear time series and algorithms applied to those objects. The later includes MCMC, pMCMC, sequential Monte Carlo, particle filters, and the EM algorithm. While I am too close to the authors to write a balanced review for CHANCE (the book is under review by another researcher, before you ask!), I think this is an important book that reflects the state of the art in the rigorous study of those models. Obviously, the mathematical rigour advocated by the authors makes Nonlinear Time Series a rather advanced book (despite the authors’ reassuring statement that “nothing excessively deep is used”) more adequate for PhD students and researchers than starting graduates (and definitely not advised for self-study), but the availability of the R code (on the highly personal page of David Stoffer) comes to balance the mathematical bent of the book in the first and third parts. A great reference book!
Archive for statistical inference
Sometimes, if not that often, I forget about submitted papers to the point of thinking they are already accepted. This happened with the critical analysis of Murray Aitkin’s book Statistical Inference, already debated on the ‘Og, written with Andrew Gelman and Judith Rousseau, and resubmitted to Statistics and Risk Modeling in November…2011. As I had received a few months ago a response to our analysis from Murray, I was under the impression it was published or about to be published. Earlier this week I started looking for the reference in connection with the paper I was completing on the Jeffreys-Lindley paradox and could not find it. Checking emails on that topic I then discovered the latest one was from Novtember 2011 and the editor, when contacted, confirmed the paper was still under review! As it got accepted only a few hours later, my impression is that it had been misfiled and forgotten at some point, an impression reinforced by an earlier experience with the previous avatar of the journal, Statistics & Decisions. In the 1990’s George Casella and I had had a paper submitted to this journal for a while, which eventually got accepted. Then nothing happened for a year and more, until we contacted the editor who acknowledged the paper had been misfiled and forgotten! (This was before the electronic processing of papers, so it is quite plausible that the file corresponding to our accepted paper went under a drawer or into the wrong pile and that the editor was not keeping track of those accepted papers. After all, until Series B turned submission into an all-electronic experience, I was using a text file to keep track of daily submissions…) If you knew George, you can easily imagine his reaction when reading this reply… Anyway, all is well that ends well in that our review and Murray’s reply will appear in Statistics and Risk Modeling, hopefully in a reasonable delay.
- Andrew Gelman on Introducing Monte Carlo Methods with R
- Bill Strawderman on Statistical Inference
- Jean-Louis Foulley on Variance Components
- Larry Wasserman on Theory of Point Estimation
- Xiao-Li Meng on Monte Carlo Statistical Methods
Although all of those books have appeared between twenty and five years ago, the reviews are definitely worth reading! (Disclaimer: I am the editor of the Books Review section who contacted friends of George to write the reviews, as well as the co-author of two of those books!) They bring in my (therefore biased) opinion a worthy evaluation of the depths and impacts of those major books, and they also reveal why George was a great teacher, bringing much into the classroom and to his students… (Unless I am confused the whole series of reviews is available to all, and not only to CHANCE subscribers. Thanks, Sam!)
Paulo (a.k.a., Zen) posted a comment in StackExchange on Larry Wasserman‘s paradox about Bayesians and likelihoodists (or likelihood-wallahs, to quote Basu!) being unable to solve the problem of estimating the normalising constant c of the sample density, f, known up to a constant
(Example 11.10, page 188, of All of Statistics)
My own comment is that, with all due respect to Larry!, I do not see much appeal in this example, esp. as a potential criticism of Bayesians and likelihood-wallahs…. The constant c is known, being equal to
If c is the only “unknown” in the picture, given a sample x1,…,xn, then there is no statistical issue whatsoever about the “problem” and I do not agree with the postulate that there exist estimators of c. Nor priors on c (other than the Dirac mass on the above value). This is not in the least a statistical problem but rather a numerical issue.That the sample x1,…,xn can be (re)used through a (frequentist) density estimate to provide a numerical approximation of c
is a mere curiosity. Not a criticism of alternative statistical approaches: e.g., I could also use a Bayesian density estimate…
Furthermore, the estimate provided by the sample x1,…,xn is not of particular interest since its precision is imposed by the sample size n (and converging at non-parametric rates, which is not a particularly relevant issue!), while I could use importance sampling (or even numerical integration) if I was truly interested in c. I however find the discussion interesting for many reasons
- it somehow relates to the infamous harmonic mean estimator issue, often discussed on the’Og!;
- it brings more light on the paradoxical differences between statistics and Monte Carlo methods, in that statistics is usually constrained by the sample while Monte Carlo methods have more freedom in generating samples (up to some budget limits). It does not make sense to speak of estimators in Monte Carlo methods because there is no parameter in the picture, only “unknown” constants. Both fields rely on samples and probability theory, and share many features, but there is nothing like a “best unbiased estimator” in Monte Carlo integration, see the case of the “optimal importance function” leading to a zero variance;
- in connection with the previous point, the fascinating Bernoulli factory problem is not a statistical problem because it requires an infinite sequence of Bernoullis to operate;
- the discussion induced Chris Sims to contribute to StackExchange!
(This is my sixth and last post on Error and Inference, being as previously a raw and naïve reaction born from a linear and sluggish reading of the book, rather than a deeper and more informed criticism with philosophical bearings. Read at your own risk.)
“‘It is refreshing to see Cox and Mayo give a hard-nosed statement of what scientific objectivity demands of an account of statistics, show how it relates to frequentist statistics, and contrast that with the notion of “objectivity” used by O-Bayesians.”—A. Spanos, p.326, Error and Inference, 2010
In order to conclude my pedestrian traverse of Error and Inference, I read the discussion by Aris Spanos of the second part of the seventh chapter by David Cox’s and Deborah Mayo’s, discussed in the previous post. (In the train to the half-marathon to be precise, which may have added a sharper edge to the way I read it!) The first point in the discussion is that the above paper is “a harmonious blend of the Fisherian and N-P perspectives to weave a coherent frequentist inductive reasoning anchored firmly on error probabilities”(p.316). The discussion by Spanos is very much a-critical of the paper, so I will not engage into a criticism of the non-criticism, but rather expose some thoughts of mine that came from reading this apology. (Remarks about Bayesian inference are limited to some piques like the above, which only reiterates those found earlier [and later: "the various examples Bayesians employ to make their case involve some kind of "rigging" of the statistical model", Aris Spanos, p.325; "The Bayesian epistemology literature is filled with shadows and illusions", Clark Glymour, p. 335] in the book.) [I must add I do like the mention of O-Bayesians, as I coined the O'Bayes motto for the objective Bayes bi-annual meetings from 2003 onwards! It also reminds me of the O-rings and of the lack of proper statistical decision-making in the Challenger tragedy...]
The “general frequentist principle for inductive reasoning” (p.319) at the core of Cox and Mayo’s paper is obviously the central role of the p-value in “providing (strong) evidence against the null H0 (for a discrepancy from H0)”. Once again, I fail to see it as the epitome of a working principle in that
- it depends on the choice of a divergence d(z), which reduces the information brought by the data z;
- it does not articulate the level for labeling nor the consequences of finding a low p-value;
- it ignores the role of the alternative hypothesis.
Furthermore, Spanos’ discussion deals with “the fallacy of rejection” (pp.319-320) in a rather artificial (if common) way, namely by setting a buffer of discrepancy γ around the null hypothesis. While the choice of a maximal degree of precision sounds natural to me (in the sense that a given sample size should not allow for the discrimination between two arbitrary close values of the parameter), the fact that γ is in fine set by the data (so that the p-value is high) is fairly puzzling. If I understand correctly, the change from a p-value to a discrepancy γ is a fine device to make the “distance” from the null better understood, but it has an extremely limited range of application. If I do not understand correctly, the discrepancy γ is fixed by the statistician and then this sounds like an extreme form of prior selection.
There is at least one issue I do not understand in this part, namely the meaning of the severity evaluation probability
as the conditioning on the event seems impossible in a frequentist setting. This leads me to an idle and unrelated questioning as to whether there is a solution to
as this would be the ultimate discrepancy. Or whether this does not make any sense… because of the ambiguous role of z0, which needs somehow to be integrated out. (Otherwise, d can be chosen so that the probability is 1.)
“If one renounces the likelihood, the stopping rule, and the coherence principles, marginalizes the use of prior information as largely untrustworthy, and seek procedures with `good’ error probabilistic properties (whatever that means), what is left to render the inference Bayesian, apart from a belief (misguided in my view) that the only way to provide an evidential account of inference is to attach probabilities to hypotheses?”—A. Spanos, p.326, Error and Inference, 2010
The role of conditioning ancillary statistics is emphasized both in the paper and the discussion. This conditioning clearly reduces variability, however there is no reservation about the arbitrariness of such ancillary statistics. And the fact that conditioning any further would lead to conditioning upon the whole data, i.e. to a Bayesian solution. I also noted a curious lack of proper logical reasoning in the argument that, when
using the conditional ancillary distribution is enough, since, while “any departure from f(z|s) implies that the overall model is false” (p.322), but not the reverse. Hence, a poor choice of s may fail to detect a departure. (Besides the fact that fixed-dimension sufficient statistics do not exist outside exponential families.) Similarly, Spanos expands about the case of a minimal sufficient statistic that is independent from a maximal ancillary statistic, but such cases are quite rare and limited to exponential families [in the iid case]. Still in the conditioning category, he also supports Mayo’s argument against the likelihood principle being a consequence of the sufficiency and weak conditionality principles. A point I discussed in a previous post. However, he does not provide further evidence against Birnbaum’s result, arguing rather in favour of a conditional frequentist inference I have nothing to complain about. (I fail to perceive the appeal of the Welch uniform example in terms of the likelihood principle.)
In an overall conclusion, let me repeat and restate that this series of posts about Error and Inference is far from pretending at bringing a Bayesian reply to the philosophical arguments raised in the volume. The primary goal being of “taking some crucial steps towards legitimating the philosophy of frequentist statistics” (p.328), I should not feel overly concerned. It is only when the debate veered towards a comparison with the Bayesian approach [often too often of the "holier than thou" brand] that I felt allowed to put in my twopennies worth… I do hope I may crystallise this set of notes into a more constructed review of the book, if time allows, although I am pessimistic at the chances of getting it published given our current difficulties with the critical review of Murray Aitkin’s Statistical Inference. However, as a coincidence, we got back last weekend an encouraging reply from Statistics and Risk Modelling, prompting us towards a revision and the prospect of a reply by Murray.
“‘Frequentist methods achieve an objective connection to hypotheses about the data-generating process by being constrained and calibrated by the method’s error probabilities in relation to these models .”—D. Cox and D. Mayo, p.277, Error and Inference, 2010
The second part of the seventh chapter of Error and Inference, is David Cox’s and Deborah Mayo’s “Objectivity and conditionality in frequentist inference“. (Part of the section is available on Google books.) The purpose is clear and the chapter quite readable from a statistician’s perspective. I however find it difficult to quantify objectivity by first conditioning on “a statistical model postulated to have generated data”, as again this assumes the existence of a “true” probability model where “probabilities (…) are equal or close to the actual relative frequencies”. As earlier stressed by Andrew:
“I don’t think it’s helpful to speak of “objective priors.” As a scientist, I try to be objective as much as possible, but I think the objectivity comes in the principle, not the prior itself. A prior distribution–any statistical model–reflects information, and the appropriate objective procedure will depend on what information you have.”
The paper opposes the likelihood, Bayesian, and frequentist methods, reproducing what Gigerenzer called the “superego, the ego, and the id” in his paper on statistical significance. Cox and Mayo stress from the start that the frequentist approach is (more) objective because it is based on the sampling distribution of the test. My primary problem with this thesis is that the “hypothetical long run” (p.282) does not hold in realistic settings. Even in the event of a reproduction of similar or identical tests, a sequential procedure exploiting everything that has been observed so far is more efficient than the mere replication of the same procedure solely based on the current observation.
“Virtually all (…) models are to some extent provisional, which is precisely what is expected in the building up of knowledge.”—D. Cox and D. Mayo, p.283, Error and Inference, 2010
The above quote is something I completely agree with, being another phrasing of George Box’s “all models are wrong”, but this transience of working models is a good reason in my opinion to account for the possibility of alternative working models from the start of the statistical analysis. Hence for an inclusion of those models in the statistical analysis equally from the start. Which leads almost inevitably to a Bayesian formulation of the testing problem.
“‘Perhaps the confusion [over the role of sufficient statistics] stems in part because the various inference schools accept the broad, but not the detailed, implications of sufficiency.”—D. Cox and D. Mayo, p.286, Error and Inference, 2010
The discussion over the sufficiency principle is interesting, as always. The authors propose to solve the confusion between the sufficiency principle and the frequentist approach by assuming that inference “is relative to the particular experiment, the type of inference, and the overall statistical approach” (p.287). This creates a barrier between sampling distributions that avoids the binomial versus negative binomial paradox always stressed in the Bayesian literature. But the solution is somehow tautological: by conditioning on the sampling distribution, it avoids the difficulties linked with several sampling distributions all producing the same likelihood. After my recent work on ABC model choice, I am however less excited about the sufficiency principle as the existence of [non-trivial] sufficient statistics is quite the rare event. Especially across models. The section (pp. 288-289) is also revealing about the above “objectivity” of the frequentist approach in that the derivation of a test taking large value away from the null with a well-known distribution under the null is not an automated process, esp. when nuisance parameters cannot be escaped from (pp. 291-294). Achieving separation from nuisance parameters, i.e. finding statistics that can be conditioned upon to eliminate those nuisance parameters, does not seem feasible outside well-formalised models related with exponential families. Even in such formalised models, a (clear?) element of arbitrariness is involved in the construction of the separations, which implies that the objectivity is under clear threat. The chapter recognises this limitation in Section 9.2 (pp.293-294), however it argues that separation is much more common in the asymptotic sense and opposes the approach to the Bayesian averaging over the nuisance parameters, which “may be vitiated by faulty priors” (p.294). I am not convinced by the argument, given that the (approximate) condition approach amount to replace the unknown nuisance parameter by an estimator, without accounting for the variability of this estimator. Averaging brings the right (in a consistency sense) penalty.
A compelling section is the one about the weak conditionality principle (pp. 294-298), as it objects to the usual statement that a frequency approach breaks this principle. In a mixture experiment about the same parameter θ, inferences made conditional on the experiment “are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed” (p. 296). This seems hardly objectionable, as stated. And I must confess the sin of stating the opposite as The Bayesian Choice has this remark (Example 1.3.7, p.18) that the classical confidence interval averages over the experiments… Mea culpa! The term experiment validates the above conditioning in that several experiments could be used to measure θ, each with a different p-value. I will not argue with this. I could however argue about “conditioning is warranted to achieve objective frequentist goals” (p. 298) in that the choice of the conditioning, among other things, weakens the objectivity of the analysis. In a sense the above pirouette out of the conditioning principle paradox suffers from the same weakness, namely that when two distributions characterise the same data (the mixture and the conditional distributions), there is a choice to be made between “good” and “bad”. Nonetheless, an approach based on the mixture remains frequentist if non-optimal… (The chapter later attacks the derivation of the likelihood principle, I will come back to it in a later post.)
“‘Many seem to regard reference Bayesian theory to be a resting point until satisfactory subjective or informative priors are available. It is hard to see how this gives strong support to the reference prior research program.”—D. Cox and D. Mayo, p.302, Error and Inference, 2010
A section also worth commenting is (unsurprisingly!) the one addressing the limitations of the Bayesian alternatives (pp. 298–302). It however dismisses right away the personalistic approach to priors by (predictably if hastily) considering it fails the objectivity canons. This seems a wee quick to me, as the choice of a prior is (a) the choice of a reference probability measure against which to assess the information brought by the data, not clearly less objective than picking one frequentist estimator or another, and (b) a personal construction of the prior can also be defended on objective grounds, based on the past experience of the modeler. That it varies from one modeler to the next is not an indication of subjectivity per se, simply of different past experiences. Cox and Mayo then focus on reference priors, à la Bernardo-Berger, once again pointing out the lack of uniqueness of those priors as a major flaw. While the sub-chapter agrees on the understanding of those priors as convention or reference priors, aiming at maximising the input from the data, it gets stuck on the impropriety of such priors: “if priors are not probabilities, what then is the interpretation of a posterior?” (p.299). This seems like a strange comment to me: the interpretation of a posterior is that it is a probability distribution and this is the only mathematical constraint one has to impose on a prior. (Which may be a problem in the derivation of reference priors.) As detailed in The Bayesian Choice among other books, there are many compelling reasons to invite improper priors into the game. (And one not to, namely the difficulty with point null hypotheses.) While I agree that the fact that some reference priors (like matching priors, whose discussion p. 302 escapes me) have good frequentist properties is not compelling within a Bayesian framework, it seems a good enough answer to the more general criticism about the lack of objectivity: in that sense, frequency-validated reference priors are part of the huge package of frequentist procedures and cannot be dismissed on the basis of being Bayesian. That reference priors are possibly at odd with the likelihood principle does not matter very much: the shape of the sampling distribution is part of the prior information, not of the likelihood per se. The final argument (Section 12) that Bayesian model choice requires the preliminary derivation of “the possible departures that might arise” (p.302) has been made at several points in Error and Inference. Besides being in my opinion a valid working principle, i.e. selecting the most appropriate albeit false model, this definition of well-defined alternatives is mimicked by the assumption of “statistics whose distribution does not depend on the model assumption” (p. 302) found in the same last paragraph.
In conclusion this (sub-)chapter by David Cox and Deborah Mayo is (as could be expected!) a deep and thorough treatment of the frequentist approach to the sufficiency and (weak) conditionality principle. It however fails to convince me that there exists a “unique and unambiguous” frequentist approach to all but the most simple problems. At least, from reading this chapter, I cannot find a working principle that would lead me to this single unambiguous frequentist procedure.
“‘The defining feature of an inductive inference is that the premises (evidence statements) can be true while the conclusion inferred may be false without a logical contradiction: the conclusion is “evidence transcending”.”—D. Mayo and D. Cox, p.249, Error and Inference, 2010
The seventh chapter of Error and Inference, entitled “New perspectives on (some old) problems of frequentist statistics“, is divided in four parts, written by David Cox, Deborah Mayo and Aris Spanos, in different orders and groups of authors. This is certainly the most statistical of all chapters, not a surprise when considering that David Cox is involved, and I thus have difficulties to explain why it took me so long to read through it…. Overall, this chapter is quite important by its contribution to the debate on the nature of statistical testing.
“‘The advantage in the modern statistical framework is that the probabilities arise from defining a probability model to represent the phenomenon of interest. Had Popper made use of the statistical testing ideas being developed at around the same time, he might have been able to substantiate his account of falsification.”—D. Mayo and D. Cox, p.251, Error and Inference, 2010
The first part of the chapter is Mayo’s and Cox’ “Frequentist statistics as a theory of inductive inference“. It was first published in the 2006 Erich Lehmann symposium. And available on line as an arXiv paper. There is absolutely no attempt there to link of clash with the Bayesian approach, this paper is only looking at frequentist statistical theory as the basis for inductive inference. The debate therein about deducing that H is correct from a dataset successfully facing a statistical test is classical (in both senses) but I [unsurprisingly] remain unconvinced by the arguments. The null hypothesis remains the calibrating distribution throughout the chapter, with very little (or at least not enough) consideration of what happens when the null hypothesis does not hold. Section 3.6 about confidence intervals being another facet of testing hypotheses is representative of this perspective. The p-value is defended as the central tool for conducting hypothesis assessment. (In this version of the paper, some p’s are written in roman characters and others in italics, which is a wee confusing until one realises that this is a mere typo!) The fundamental imbalance problem, namely that, in contiguous hypotheses, a test cannot be expected both to most often reject the null when it is [very moderately] false and to most often accept the null when it is right is not discussed there. The argument about substantive nulls in Section 3.5 considers a stylised case of well-separated scientific theories, however the real world of models is more similar to a greyish (and more Popperian?) continuum of possibles. In connection with this, I would have thought more likely that the book would address on philosophical grounds Box’s aphorism that “all models are wrong”. Indeed, one (philosophical?) difficulty with the p-values and the frequentist evidence principle (FEV) is that they rely on the strong belief that one given model can be exact or true (while criticising the subjectivity of the prior modelling in the Bayesian approach). Even in the typology of types of null hypotheses drawn by the authors in Section 3, the “possibility of model misspecification” is addressed in terms of the low power of an omnibus test, while agreeing that “an incomplete probability specification” is unavoidable (an argument found at several place in the book that the alternative cannot be completely specified).
“‘Sometimes we can find evidence for H0, understood as an assertion that a particular discrepancy, flaw, or error is absent, and we can do this by means of tests that, with high probability, would have reported a discrepancy had one been present.”—D. Mayo and D. Cox, p.255, Error and Inference, 2010
The above quote relates to the Failure and Confirmation section where the authors try to push the argument in favour of frequentist tests one step further, namely that that “moderate p-values” may sometimes be used as confirmation of the null. (I may have misunderstood, the end of the section defending a purely frequentist, as in repeated experiments, interpretation. This reproduces an earlier argument about the nature of probability in Section 1.2, as characterising the “stability of relative frequencies of results of repeated trials”) In fact, this chapter and other recent readings made me think afresh about the nature of probability, a debate that put me off so much in Keynes (1921) and even in Jeffreys (1939). From a mathematical perspective, there is only one “kind” of probability, the one defined via a reference measure and a probability, whether it applies to observations or to parameters. From a philosophical perspective, there is a natural issue about the “truth” or “realism” of the probability quantities and of the probabilistic statements. The book and in particular the chapter consider that a truthful probability statement is the one agreeing with “a hypothetical long-run of repeated sampling, an error probability”, while the statistical inference school of Keynes (1921), Jeffreys (1939), and Carnap (1962) “involves quantifying a degree of support or confirmation in claims or hypotheses”, which makes this (Bayesian) sound as less realistic… Obviously, I have no ambition to solve this long-going debate, however I see no reason in the first approach to be more realistic by being grounded on stable relative frequencies à la von Mises. If nothing else, the notion that a test should be evaluated on its long run performances is very idealistic as the concept relies on an ever-repeating, an infinite sequence of identical trials. Relying on probability measures as self-coherent mathematical measures of uncertainty carries (for me) as much (or as less) reality as the above infinite experiment. Now, the paper is not completely entrenched in this interpretation, when it concludes that “what makes the kind of hypothetical reasoning relevant to the case at hand is not the long-run low error rates associated with using the tool (or test) in this manner; it is rather what those error rates reveal about the data generating source or phenomenon” (p.273).
“‘If the data are so extensive that accordance with the null hypothesis implies the absence of an effect of practical importance, and a reasonably high p-value is achieved, then it may be taken as evidence of the absence of an effect of practical importance.”—D. Mayo and D. Cox, p.263, Error and Inference, 2010
The paper mentions several times conclusions to be drawn from a p-value near one, as in the above quote. This is an interpretation that does not sit well with my understanding of p-values being distributed as uniforms under the null: very high p-values should be as suspicious as very low p-values. (This criticism is not new, of course.) Unless one does not strictly adhere to the null model, which brings back the above issue of the approximativeness of any model… I also found fascinating to read the criticism that “power appertains to a prespecified rejection region, not to the specific data under analysis” as I thought this equally applied to the p-values, turning “the specific data under analysis” into a departure event of a prespecified kind.
(Given the unreasonable length of the above, I fear I will continue my snailpaced reading in yet another post!)