A few years ago, I was asked by Isabelle Drouet to contribute a chapter to a multi-disciplinary book on the Bayesian paradigm, book that is now soon to appear. In French. It has this rather ugly title of Bayesianism today. Not that I had hear of Bayesianism or bayésianime previously. There are chapters on the Bayesian notion(s) of probability, game theory, statistics, on applications, and on the (potentially) Bayesian structure of human intelligence. Most of it is thus outside statistics, but I will certainly read through it when I receive my copy.
Archive for foundations
Natesh pointed out to me this recent arXival with a somewhat grandiose abstract:
In this paper, we argue that the primary goal of the foundations of statistics is to provide data analysts with a set of guiding principles that are guaranteed to lead to valid statistical inference. This leads to two new questions: “what is valid statistical inference?” and “do existing methods achieve this?” Towards answering these questions, this paper makes three contributions. First, we express statistical inference as a process of converting observations into degrees of belief, and we give a clear mathematical definition of what it means for statistical inference to be valid. Second, we evaluate existing approaches Bayesian and frequentist approaches relative to this definition and conclude that, in general, these fail to provide valid statistical inference. This motivates a new way of thinking, and our third contribution is a demonstration that the inferential model framework meets the proposed criteria for valid and prior-free statistical inference, thereby solving perhaps the most important unsolved problem in statistics.
Since solving the “most important unsolved problem in statistics” sounds worth pursuing, I went and checked the paper‘s contents.
“To us, the primary goal of the foundations of statistics is to provide a set of guiding principles that, if followed, will guarantee validity of the resulting inference. Our motivation for writing this paper is to be clear about what is meant by valid inference and to provide the necessary principles to help data analysts achieve validity.”
Which can be interpreted in so many ways that it is somewhat meaningless…
“…if real subjective prior information is available, we recommend using it. However, there is an expanding collection of work (e.g., machine learning, etc) that takes the perspective that no real prior information is available. Even a large part of the literature claiming to be Bayesian has abandoned the interpretation of the prior as a serious part of the model, opting for “default” prior that “works.” Our choice to omit a prior from the model is not for the (misleading) purpose of being “objective”—subjectivity is necessary—but, rather, for the purpose of exploring what can be done in cases where a fully satisfactory prior is not available, to see what improvements can be made over the status quo.”
This is a pretty traditional criticism of the Bayesian approach, namely that if a “true” prior is provided (by whom?) then it is optimal to use it. But this amounts to turn the prior into another piece of the sampling distribution and is not in my opinion a Bayesian argument! Most of the criticisms in the paper are directed at objective Bayes approaches, with the surprising conclusion that, because there exist cases where no matching prior is available, “the objective Bayesian approach [cannot] be considered as a general framework for scientific inference.” (p.9)
Another section argues that a Bayesian modelling cannot describe a state of total ignorance. This is formally correct, which is why there is no such thing as a non-informative or the non-informative prior, as often discussed here, but is this truly relevant, in that the inference problem contains one way or another information about the parameter, for instance through a loss function or a pseudo-likelihood.
“This is a desirable property that most existing methods lack.”
The proposal central to the paper thesis is to replace posterior probabilities by belief functions b(.|X), called statistical inference, that are interpreted as measures of evidence about subsets A of the parameter space. If not necessarily as probabilities. This is not very novel, witness the works of Dempster, Shafer and subsequent researchers. And not very much used outside Bayesian and fiducial statistics because of the mostly impossible task of defining a function over all subsets of the parameter space. Because of the subjectivity of such “beliefs”, they will be “valid” only if they are well-calibrated in the sense of b(A|X) being sub-uniform, that is, more concentrated near zero than a uniform variate (i.e., small) under the alternative, i.e. when θ is not in A. At this stage, since this is a mix of a minimax and proper coverage condition, my interest started to quickly wane… Especially because the sub-uniformity condition is highly demanding, if leading to controls over the Type I error and the frequentist coverage. As often, I wonder at the meaning of a calibration property obtained over all realisations of the random variable and all values of the parameter. So for me stability is neither “desirable” nor “essential”. Overall, I have increasing difficulties in perceiving proper coverage as a relevant property. Which has no stronger or weaker meaning that the coverage derived from a Bayesian construction.
“…frequentism does not provide any guidance for selecting a particular rule or procedure.”
I agree with this assessment, which means that there is no such thing as frequentist inference, but rather a philosophy for assessing procedures. That the Gleser-Hwang paradox invalidates this philosophy sounds a bit excessive, however. Especially when the bounded nature of Bayesian credible intervals is also analysed as a failure. A more relevant criticism is the lack of directives for picking procedures.
“…we are the first to recognize that the belief function’s properties are necessary in order for the inferential output to satisfy the required validity property”
The construction of the “inferential model” proposed by the authors offers similarities withn fiducial inference, in that it builds upon the representation of the observable X as X=a(θ,U). With further constraints on the function a() to ensure the validity condition holds… An interesting point is that the functional connection X=a(θ,U) means that the nature of U changes once X is observed, albeit in a delicate manner outside a Bayesian framework. When illustrated on the Gleser-Hwang paradox, the resolution proceeds from an arbitrary choice of a one-dimensional summary, though. (As I am reading the paper, I realise it builds on other and earlier papers by the authors, papers that I cannot read for lack of time. I must have listned to a talk by one of the authors last year at JSM as this rings a bell. Somewhat.) In conclusion of a quick Sunday afternoon read, I am not convinced by the arguments in the paper and even less by the impression of a remaining arbitrariness in setting the resulting procedure.
Keli Liu and Xiao-Li Meng completed a paper on the very nature of inference, to appear in The Annual Review of Statistics and Its Application. This paper or chapter is addressing a fundamental (and foundational) question on drawing inference based a sample on a new observation. That is, in making prediction. To what extent should the characteristics of the sample used for that prediction resemble those of the future observation? In his 1921 book, A Treatise on Probability, Keynes thought this similarity (or individualisation) should be pushed to its extreme, which led him to somewhat conclude on the impossibility of statistics and never to return to the field again. Certainly missing the incoming possibility of comparing models and selecting variables. And not building so much on the “all models are wrong” tenet. On the contrary, classical statistics use the entire data available and the associated model to run the prediction, including Bayesian statistics, although it is less clear how to distinguish between data and control there. Liu & Meng debate about the possibility of creating controls from the data alone. Or “alone” as the model behind always plays a capital role.
“Bayes and Frequentism are two ends of the same spectrum—a spectrum defined in terms of relevance and robustness. The nominal contrast between them (…) is a red herring.”
The paper makes for an exhilarating if definitely challenging read. With a highly witty writing style. If only because the perspective is unusual, to say the least!, and requires constant mental contortions to frame the assertions into more traditional terms. For instance, I first thought that Bayesian procedures were in agreement with the ultimate conditioning approach, since it conditions on the observables and nothing else (except for the model!). Upon reflection, I am not so convinced that there is such a difference with the frequentist approach in the (specific) sense that they both take advantage of the entire dataset. Either from the predictive or from the plug-in distribution. It all boils down to how one defines “control”.
“Probability and randomness, so tightly yoked in our minds, are in fact distinct concepts (…) at the end of the day, probability is essentially a tool for bookkeeping, just like the abacus.”
Some sentences from the paper made me think of ABC, even though I am not trying to bring everything back to ABC!, as drawing controls is the nature of the ABC game. ABC draws samples or control from the prior predictive and only keeps those for which the relevant aspects (or the summary statistics) agree with those of the observed data. Which opens similar questions about the validity and precision of the resulting inference, as well as the loss of information due to the projection over the summary statistics. While ABC is not mentioned in the paper, it can be used as a benchmark to walk through it.
“In the words of Jack Kiefer, we need to distinguish those problems with `luck data’ from those with `unlucky data’.”
I liked very much recalling discussions we had with George Casella and Costas Goutis in Cornell about frequentist conditional inference, with the memory of Jack Kiefer still lingering around. However, I am not so excited about the processing of models here since, from what I understand in the paper (!), the probabilistic model behind the statistical analysis must be used to some extent in producing the control case and thus cannot be truly assessed with a critical eye. For instance, of which use is the mean square error when the model behind is unable to produce the observed data? In particular, the variability of this mean squared error is directly driven by this model. Similarly the notion of ancillaries is completely model-dependent. In the classification diagrams opposing robustness to relevance, all methods included therein are parametric. While non-parametric types of inference could provide a reference or a calibration ruler, at the very least.
Also, by continuously and maybe a wee bit heavily referring to the doctor-and-patient analogy, the paper is somewhat confusing as to which parts are analogy and which parts are methodology and to which type of statistical problem is covered by the discussion (sometimes it feels like all problems and sometimes like medical trials).
“The need to deliver individualized assessments of uncertainty are more pressing than ever.”
A final question leads us to an infinite regress: if the statistician needs to turn to individualized inference, at which level of individuality should the statistician be assessed? And who is going to provide the controls then? In any case, this challenging paper is definitely worth reading by (only mature?) statisticians to ponder about the nature of the game!
Because the title intrigued me (who would dream of claiming connection with Tony Blair’s “new” Labour move to centre-right?!) , I downloaded William Briggs‘ paper the Third Way of Probability & Statistics from arXiv and read it while secluded away, with no connection to the outside world, at Longmire, Mount Rainier National Park. Early morning at Paradise Inn. The subtitle of the document is “Beyond Testing and Estimation To Importance, Relevance, and Skill“. Actually, Longmire may have been the only place where I would read through the entire paper and its 14 pages, as the document somewhat sounds like a practical (?) joke. And almost made me wonder whether Mr Briggs was a pseudonym… And where the filter behind arXiv publishing principles was that day.
The notion behind Briggs’ third way is that parameters do not exist and that only conditional probability exists. Not exactly a novel perspective then. The first five pages go on repeating this principle in various ways, without ever embarking into the implementation of the idea, at best referring to a future book in search of a friendly publisher… The remainder of the paper proceeds to analyse a college GPA dataset without ever explaining how the predictive distribution was constructed. The only discussion is about devising a tool to compare predictors, which is chosen as the continuous rank probability score of Gneiting and Raftery (2007). Looking at those scores seems to encompass this third way advocated by the author, then, which sounds to me to be an awfully short lane into statistics. With no foray whatsoever into probability.
My first morning session was about inference for philogenies. While I was expecting some developments around Kingman’s coalescent models my coauthors needed and developped ABC for, I was surprised to see models that were producing closed form (or close enough to) likelihoods. Due to strong restrictions on the population sizes and migration possibilities, as explained later to me by Vladimir Minin. No need for ABC there since MCMC was working on the species trees, with Vladimir Minin making use of [the Savage Award winner] Vinayak Rao’s approach on trees that differ from the coalescent. And enough structure to even consider and demonstrate tree identifiability in Laura Kubatko’s case.
I then stopped by the astrostatistics session as the first talk by Gwendolin Eddie was about galaxy mass estimation, a problem I may actually be working on in the Fall, but it ended up being a completely different problem and I was further surprised that the issue of whether or not the data was missing at random was not considered by the authors.
Christening a session Unifying foundation(s) may be calling for trouble, at least from me! In this spirit, Xiao Li Meng gave a talk attempting at a sort of unification of the frequentist, Bayesian, and fiducial paradigms by introducing the notion of personalized inference, which is a notion I had vaguely thought of in the past. How much or how far do you condition upon? However, I have never thought of this justifying fiducial inference in any way and Xiao Li’s lively arguments during and after the session not enough to convince me of the opposite: Prior-free does not translate into (arbitrary) choice-free. In the earlier talk about confidence distributions by Regina Liu and Minge Xie, that I partly missed for Galactic reasons, I just entered into the room at the very time when ABC was briefly described as a confidence distribution because it was not producing a convergent approximation to the exact posterior, a logic that escapes me (unless those confidence distributions are described in such a loose way as to include about any method f inference). Dongchu Sun also gave us a crash course on reference priors, with a notion of random posteriors I had not heard of before… As well as constructive posteriors… (They seemed to mean constructible matching priors as far as I understood.)
The final talk in this session by Chuanhei Liu on a new approach (modestly!) called inferential model was incomprehensible, with the speaker repeatedly stating that the principles were too hard to explain in five minutes and needed an incoming book… I later took a brief look at an associated paper, which relates to fiducial inference and to Dempster’s belief functions. For me, it has the same Münchhausen feeling of creating a probability out of nothing, creating a distribution on the parameter by ignoring the fact that the fiducial equation x=a(θ,u) modifies the distribution of u once x is observed.
Pier Giovanni Bissiri, Chris Holmes and Stephen Walker have recently arXived the paper related to Sephen’s talk in London for Bayes 250. When I heard the talk (of which some slides are included below), my interest was aroused by the facts that (a) the approach they investigated could start from a statistics, rather than from a full model, with obvious implications for ABC, & (b) the starting point could be the dual to the prior x likelihood pair, namely the loss function. I thus read the paper with this in mind. (And rather quickly, which may mean I skipped important aspects. For instance, I did not get into Section 4 to any depth. Disclaimer: I wasn’t nor is a referee for this paper!)
The core idea is to stick to a Bayesian (hardcore?) line when missing the full model, i.e. the likelihood of the data, but wishing to infer about a well-defined parameter like the median of the observations. This parameter is model-free in that some degree of prior information is available in the form of a prior distribution. (This is thus the dual of frequentist inference: instead of a likelihood w/o a prior, they have a prior w/o a likelihood!) The approach in the paper is to define a “posterior” by using a functional type of loss function that balances fidelity to prior and fidelity to data. The prior part (of the loss) ends up with a Kullback-Leibler loss, while the data part (of the loss) is an expected loss wrt to l(THETASoEUR,x), ending up with the definition of a “posterior” that is
the loss thus playing the role of the log-likelihood.
I like very much the problematic developed in the paper, as I think it is connected with the real world and the complex modelling issues we face nowadays. I also like the insistence on coherence like the updating principle when switching former posterior for new prior (a point sorely missed in this book!) The distinction between M-closed M-open, and M-free scenarios is worth mentioning, if only as an entry to the Bayesian processing of pseudo-likelihood and proxy models. I am however not entirely convinced by the solution presented therein, in that it involves a rather large degree of arbitrariness. In other words, while I agree on using the loss function as a pivot for defining the pseudo-posterior, I am reluctant to put the same faith in the loss as in the log-likelihood (maybe a frequentist atavistic gene somewhere…) In particular, I think some of the choices are either hard or impossible to make and remain unprincipled (despite a call to the LP on page 7). I also consider the M-open case as remaining unsolved as finding a convergent assessment about the pseudo-true parameter brings little information about the real parameter and the lack of fit of the superimposed model. Given my great expectations, I ended up being disappointed by the M-free case: there is no optimal choice for the substitute to the loss function that sounds very much like a pseudo-likelihood (or log thereof). (I thought the talk was more conclusive about this, I presumably missed a slide there!) Another great expectation was to read about the proper scaling of the loss function (since L and wL are difficult to separate, except for monetary losses). The authors propose a “correct” scaling based on balancing both faithfulness for a single observation, but this is not a completely tight argument (dependence on parametrisation and prior, notion of a single observation, &tc.)
The illustration section contains two examples, one of which is a full-size or at least challenging genetic data analysis. The loss function is based on a logistic pseudo-likelihood and it provides results where the Bayes factor is in agreement with a likelihood ratio test using Cox’ proportional hazard model. The issue about keeping the baseline function as unkown reminded me of the Robbins-Wasserman paradox Jamie discussed in Varanasi. The second example offers a nice feature of putting uncertainties onto box-plots, although I cannot trust very much the 95% of the credibles sets. (And I do not understand why a unique loss would come to be associated with the median parameter, see p.25.)
Watch out: Tomorrow’s post contains a reply from the authors!
Last Monday, my student Li Chenlu presented the foundational 1962 JASA paper by Allan Birnbaum, On the Foundations of Statistical Inference. The very paper that derives the Likelihood Principle from the cumulated Conditional and Sufficiency principles and that had been discussed [maybe ad nauseam] on this ‘Og!!! Alas, thrice alas!, I was still stuck in the plane flying back from Atlanta as she was presenting her understanding of the paper, as the flight had been delayed four hours thanks to (or rather woe to!) the weather conditions in Paris the day before (chain reaction…):
I am sorry I could not attend this lecture and this for many reasons: first and foremost, I wanted to attend every talk from my students both out of respect for them and to draw a comparison between their performances. My PhD student Sofia ran the seminar that day in my stead, for which I am quite grateful, but I do do wish I had been there… Second, this a.s. has been the most philosophical paper in the series.and I would have appreciated giving the proper light on the reasons for and the consequences of this paper as Li Chenlu stuck very much on the paper itself. (She provided additional references in the conclusion but they did not seem to impact the slides.) Discussing for instance Berger’s and Wolpert’s (1988) new lights on the topic, as well as Deborah Mayo‘s (2010) attacks, and even Chang‘s (2012) misunderstandings, would have clearly helped the students.