inferential models: reasoning with uncertainty [book review]
“the field of statistics (…) is still surprisingly underdeveloped (…) the subject lacks a solid theory for reasoning with uncertainty [and] there has been very little progress on the foundations of statistical inference” (p.xvi)
A book that starts with such massive assertions is certainly hoping to attract some degree of attention from the field and likely to induce strong reactions to this dismissal of the not inconsiderable amount of research dedicated so far to statistical inference and in particular to its foundations. Or even attarcting flak for not accounting (in this introduction) for the past work of major statisticians, like Fisher, Kiefer, Lindley, Cox, Berger, Efron, Fraser and many many others…. Judging from the references and the tone of this 254 pages book, it seems like the two authors, Ryan Martin and Chuanhai Liu, truly aim at single-handedly resetting the foundations of statistics to their own tune, which sounds like a new kind of fiducial inference augmented with calibrated belief functions. Be warned that five chapters of this book are built on as many papers written by the authors in the past three years. Which makes me question, if I may, the relevance of publishing a book on a brand-new approach to statistics without further backup from a wider community.
“…it is possible to calibrate our belief probabilities for a common interpretation by intelligent minds.” (p.14)
Chapter 1 contains a description of the new perspective in Section 1.4.2, which I find useful to detail here. When given an observation x from a Normal N(θ,1) model, the authors rewrite X as θ+Z, with Z~N(0,1), as in fiducial inference, and then want to find a “meaningful prediction of Z independently of X”. This seems difficult to accept given that, once X=x is observed, Z=X-θ⁰, θ⁰ being the true value of θ, which belies the independence assumption. The next step is to replace Z~N(0,1) by a random set S(Z) containing Z and to define a belief function bel() on the parameter space Θ by
bel(A|X) = P(X-S(Z)⊆A)
which induces a pseudo-measure on Θ derived from the distribution of an independent Z, since X is already observed. When Z~N(0,1), this distribution does not depend on θ⁰ the true value of θ… The next step is to choose the belief function towards a proper frequentist coverage, in the approximate sense that the probability that bel(A|X) be more than 1-α is less than α when the [arbitrary] parameter θ is not in A. And conversely. This property (satisfied when bel(A|X) is uniform) is called validity or exact inference by the authors: in my opinion, restricted frequentist calibration would certainly sound more adequate.
“When there is no prior information available, [the philosophical justifications for Bayesian analysis] are less than fully convincing.” (p.30)
“Is it logical that an improper “ignorance” prior turns into a proper “non-ignorance” prior when combined with some incomplete information on the whereabouts of θ?” (p.44)
Chapter 2 goes over standard probabilistic inference methods, such as Bayesian statistics, to address their shortcomings. By probabilistic inference, the authors mean a procedure that associates to subsets of Θ a number in [0,1]. Which excludes most classical “frequentist” procedures, except maybe p-values. Despite the above constraint on the belief function being inherently frequentist. Surprisingly the (objective) Bayesian approach is only considered through matching priors while the posterior distribution is a direct mean to fit probabilistic inference. Incidentally, a proposal is made towards a new definition of non-informative priors, namely to gather conditional Jeffreys priors into a joint prior. Which only works in special cases since those conditional priors have no reason to be compatible. The book also criticises the lack of calibration of posterior probabilities, those being “quick and dirty” in Don Fraser‘s sense. Again based on the postulate that a meaningful belief function is to have frequentist guarantees. A few pages later, a one-page section dismisses the Bayesian approach as unable to “represent the knowledge of ignorance” (p.39) by a tautological argument that defines ignorance as a measure being zero everywhere except on the entire space Θ and concluding that
“probability cannot properly describe the scientifically relevant position of having no genuine prior information.” (p.40)
which certainly paves the way for another paradigm, all “existing methods [being] a sort of approximation to something else”, this something else being in all modesty the inferential model approach supported by the authors.
“In this sense [of having no meaningful interpretation], fiducial inference (…) has some difficulties similar to those outlined for the objective Bayes [sic].” (p.32)
The treatment of fiducial inference is very terse, which is understandable when considering fiducial inference does not even have a generic definition, but surprising given the proximity with the current proposal. To wit, the starting point is the pivotal representation X=a(θ,Z) where Z is a pivotal quantity with a fixed distribution. For instance, Z~N(0,1) as in the previous chapter. The previous chapter was suggesting to keep processing Z as a N(0,1) variate despite the observation of X, while this chapter “continues to regard Z as a quantity” with the same distribution. (I never understood the suspension of belief represented by the inversion of the probabilities in this relationship X=a(θ,Z) as it does not turn θ into a random variable.)
“the IM approach (…) may well be that elusive “best possible inference” that existing methods are [only] approximating.” (p.48)
Chapter 3 and its 5 pages set the background for the authors’ theory, by introducing their own notion of validity, calibration, and efficiency. Validity means here (as already discussed earlier in the book) that the belief function is stochastically larger than a uniform variate, while efficiency means that “probabilistic inference should be made as efficient as possible” (p.47). Which does not sound like a definition to me. The chapter mostly repeats notions found in other chapters, but contains the pearl of wisdom that there is “conditioning on the observed data in Bayes’ formula only because [this] produces more efficient inference in the long-run”, making Bayes a precursor to IM.
Chapter 4 is based on the 2013 JASA paper by the authors. Beside a lot of repetitions from earlier chapters, it exposes more clearly the fact that IM expands fiducial inference in that it replaces the set of θ’s such that X=a(θ,Z) for a simulated Z with a et of θ’s such that X=a(θ,Z) for a collection of Z’s within a simulated random set S, the goal being to reach the desired stochastic domination for the belief function. (Conversely, fiducial inference appears as a special case when the random set is a singleton, see p.63.) The chapter also highlights the fact that the validity of the approach entirely rests upon the property of the support of the random set, made of nested sets, meaning the statistical model itself is of no relevance (see the proof of Theorem 4.2). Other interesting threads are the connection between optimality in the authors’ sense and test unbiasedness, and the lack of policy for choosing the distribution of random sets, except when trying to recover a classical procedure like a Neyman-Pearson UMP test. As recognised by the authors:
“Admittedly, the final IM depends on the user’s choice of association and predictive random set, but we do not believe this is particularly daunting [because] neither a frequentist sampling distribution nor a Bayesian prior distribution adequately describes the source of uncertainty about θ” (p.75)
who still manage to turn this drawback into a positive feature!
“Exercise 5.5. A reasonable validity condition for the posterior is that its cdf, evaluated at the true θ is uniformly distributed as a function of x.” (p.102)
Chapter 5 is expanding on a 2012 paper of Ermini Leaf and Liu published in International Journal of Approximate Reasoning. It returns to the validity of the random sets at the core of the method described in the previous chapter. Among other things, the sufficient condition of Theorem 4.1 becomes a definition of an admissible random set (which should be an admissible distribution of random sets), with a confusing typo in the coverage property (5.8) since it uses the wrong font for the support. Unfortunately, as the chapter is based on an earlier paper, additional notions like elastic belief are introduced, which makes connecting with the rest of the book more delicate. It is also unclear to me how introducing a parameter in the distribution of the random set (on the latent variable Z) helps in constructing a generic methodology of inferential models: choosing a collection of nested sets is already a challenge, but building a collections of such collections sounds beyond the manageable. (One of the arguments for using elastic versions is about constrained parameter sets, where the authors claim default non-informative priors do not work, “see Exercise 5.5″…)
“…the definition of sufficiency implies that we can define a conditional association via, say, the minimal distribution of the minimal sufficient statistic.” (p.113)
Chapter 6 is about conditional inferential models. in connection with a 2015 JRSS Series B paper by the authors. To start with, I found a quote from the second author in a set of slides when he muses that it “it may take years or centuries to complete the theory of CIMs” where CIM stands for conditional IM. The goal of conditioning is to reduce the dimension of the predicted latent or auxiliary variable Z. With a strange argument that some functions of Z are actually observed. Or free of θ (p.108), which I thought was a prerequisite of the current method. And an approach found in much older approaches to statistics, like Jack Kiefer‘s. (Surprisingly, the Borel paradox pops in most unexpectedly in the middle of a proof. And of page 110.) In one simple example of Student’s t variables with an unknown location θ (p.114), the resulting procedure is based on the MLE of θ, T(x), conditional on the value of x-T(x) which happens to be ancillary in this case.
At this stage of my reading, a deep and persistent IM fatigue started to settle in, namely, each time I would get back to the book and try to proceed further… The impression to read the same thing over and over. And a lack of vision on the implementation of such lofty principles, as the random sets proposed by the authors always seemed to pop out of nowhere, as. e.g., in Example 4.1. And hence the choice of set seems to carry some (high) degree of arbitrariness. (A defect shared by fiducial inference, I believe.) But more deeply, the feeling that this self-promoted theory does not lead to a novel vision of statistical inference, since the conclusion is often to recover a classical procedure, when it exists. (To wit, the call to sufficiency in the above quote.) This fatigue means that to get on with it I went over the final chapters of the book with less in-depth reading than for the first chapters, hence possibly missing noteworthy gems.
Chapter 7 is about marginal IM (and another JASA paper), focussing on the elimination of nuisance parameters, where, as can be expected, a Bayesian approach is unsuited since “difficulties arise from the requirement of a prior”(p.125). When there exists an association variable that eliminates the nuisance parameter via a pivot, the solution falls back to the standard IM approach. The Fieller-Creasy problem—estimating the ratio of two normal means—allows for this setting, happily enough, albeit leading to unbounded confidence sets. When this is not the case (and how often is it the case?), the auxiliary variable is allowed to depend on the nuisance parameter, which now must satisfy stochastic ordering constraints for all values of this nuisance parameter (again, how likely is this?). In the Behrens-Fisher problem, when concentrating on the difference of two normal means, there exists such a construction, which requires some effort and make me wonder anew how can one expect to extent the approach beyond those classical examples?
“The main challenge in the IM construction is the [association] step (…) outside a relatively simple class of problems, it may be difficult to identify such an association and/or justify the choice of any particular association.” (p.201)
Chapter 8 reassesses the Gaussian linear model, mostly recovering standard p-values and confidence intervals. Plus an extension to the mixed model where, presumably, the distribution of the random effect is perfectly acceptable as “real”, contrary to prior distributions! The marginalisation for the so-called heritability coefficient σ²/σ²+τ², proportion of the variance due to the random effect, does require intense work over five pages, though. And does not give hints about similar derivations for other quantities.
“The take-away message is that the IM approach provides an easily implementable and general method for constructing meaningful prior-free probabilistic summaries of the information in observed data for inference or prediction.” (p.162)
Chapter 9 is about prediction (and a paper in Technometrics). Due to the fiducial flavour of the approach, the predictive for a future observation is deduced from the pseudo-posterior on the parameter constructed by IM. Unsurprisingly, as this future observation is a function of the parameter and of an auxiliary variable. In the normal example, this leads to the standard t confidence interval. For a binomial model, the solution is indistinguishable from the Jeffreys and fiducial solutions. Chapter 10 extends the IM scope to collections of assertions, most likely forecasting another extension to model choice since one section (10.5) is about variable selection. Since the “admissible” random set in that case is the hypercube on a t auxiliary (Corollary 10.1), the outcome is somewhat anticlimactic. Chapter 11 (and two more papers, including one in JASA) moves to generalised inferential models, although inferential models have just been introduced, to “relax the requirement that the association fully characterise [sic] the sampling model” (p.202). To bypass the requirement, the authors introduce a loss function, þ(y,θ), but with the strange interpretation of evaluating the fit of data y by the model with parameter θ (p.203). And mentioning the use of such losses in the recent literature as a substitute to likelihood derivation, but missing the Bayesian part, as in Bissiri et al. (2016). The association at the core of IM is then built upon an exponentiated relative loss that makes believable sets appear like highest likelihood regions. The argument in favour of this exotic construction is that this leads to “frequentist procedures with exact error rate control, independent of the model or sample size” (p.205), which sounds contradictory with the association construction of the previous page depending on the cdf of the exponentiated relative loss.
“We interpret Efron’s comment [on the unresolved issue of prior construction in the absence of prior information] as a challenge to develop a framework that provides meaningful probabilistic inference in the absence of prior information, therefore bridging the gap between frequentist and Bayesian thinking and solidifying the foundations of statistical inference.” (p.229)
Chapter 12 is a conclusion opening future research topics on IM, with a top-ten to-do list (and presumably some of it already found in incoming papers by the authors). From p»n, to model comparison, to computational implementation, to optimality in non-regular marginal IMs, to non-parametric EM. The last item mentioning MCM which left me perplexed. Concluding with a thank to the persistent reader. And with the same grandiose tone the book had started.
As must be obvious from the above ranting, I am fairly flabbergasted by this book and by the positive reception this theory got from our major journals. Reading through the book did not show me how this IM approach would help in solving new problems or in bringing better understanding of existing problems, compared with more traditional approaches. I would thus be genuinely grateful to anyone perceiving the added value of this approach to comment on that review and enlighten me! Since most of the material is contained in the referenced papers, I would otherwise suggest to any potential interested reader to first check those papers. Or some of those.