## inferential models: reasoning with uncertainty [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , on October 6, 2016 by xi'an

“the field of statistics (…) is still surprisingly underdeveloped (…) the subject lacks a solid theory for reasoning with uncertainty [and] there has been very little progress on the foundations of statistical inference” (p.xvi)

A book that starts with such massive assertions is certainly hoping to attract some degree of attention from the field and likely to induce strong reactions to this dismissal of the not inconsiderable amount of research dedicated so far to statistical inference and in particular to its foundations. Or even attarcting flak for not accounting (in this introduction) for the past work of major statisticians, like Fisher, Kiefer, Lindley, Cox, Berger, Efron, Fraser and many many others…. Judging from the references and the tone of this 254 pages book, it seems like the two authors, Ryan Martin and Chuanhai Liu, truly aim at single-handedly resetting the foundations of statistics to their own tune, which sounds like a new kind of fiducial inference augmented with calibrated belief functions. Be warned that five chapters of this book are built on as many papers written by the authors in the past three years. Which makes me question, if I may, the relevance of publishing a book on a brand-new approach to statistics without further backup from a wider community.

“…it is possible to calibrate our belief probabilities for a common interpretation by intelligent minds.” (p.14)

Chapter 1 contains a description of the new perspective in Section 1.4.2, which I find useful to detail here. When given an observation x from a Normal N(θ,1) model, the authors rewrite X as θ+Z, with Z~N(0,1), as in fiducial inference, and then want to find a “meaningful prediction of Z independently of X”. This seems difficult to accept given that, once X=x is observed, Z=X-θ⁰, θ⁰ being the true value of θ, which belies the independence assumption. The next step is to replace Z~N(0,1) by a random set S(Z) containing Z and to define a belief function bel() on the parameter space Θ by

bel(A|X) = P(X-S(Z)⊆A)

which induces a pseudo-measure on Θ derived from the distribution of an independent Z, since X is already observed. When Z~N(0,1), this distribution does not depend on θ⁰ the true value of θ… The next step is to choose the belief function towards a proper frequentist coverage, in the approximate sense that the probability that bel(A|X) be more than 1-α is less than α when the [arbitrary] parameter θ is not in A. And conversely. This property (satisfied when bel(A|X) is uniform) is called validity or exact inference by the authors: in my opinion, restricted frequentist calibration would certainly sound more adequate.

“When there is no prior information available, [the philosophical justifications for Bayesian analysis] are less than fully convincing.” (p.30)

“Is it logical that an improper “ignorance” prior turns into a proper “non-ignorance” prior when combined with some incomplete information on the whereabouts of θ?” (p.44)

## Bayes’ Theorem in the 21st Century, really?!

Posted in Books, Statistics with tags , , , , , , on June 20, 2013 by xi'an

“In place of past experience, frequentism considers future behavior: an optimal estimator is one that performs best in hypothetical repetitions of the current experiment. The resulting gain in scientific objectivity has carried the day…”

Julien Cornebise sent me this Science column by Brad Efron about Bayes’ theorem. I am a tad surprised that it got published in the journal, given that it does not really contain any new item of information. However, being unfamiliar with Science, it may also be that it also publishes major scientists’ opinions or warnings, a label that can fit this column in Science. (It is quite a proper coincidence that the post appears during Bayes 250.)

Efron’s piece centres upon the use of objective Bayes approaches in Bayesian statistics, for which Laplace was “the prime violator”. He argues through examples that noninformative “Bayesian calculations cannot be uncritically accepted, and should be checked by other methods, which usually means “frequentistically”. First, having to write “frequentistically” once is already more than I can stand! Second, using the Bayesian framework to build frequentist procedures is like buying top technical outdoor gear to climb the stairs at the Sacré-Coeur on Butte Montmartre! The naïve reader is then left clueless as to why one should use a Bayesian approach in the first place. And perfectly confused about the meaning of objectivity. Esp. given the above quote! I find it rather surprising that this old saw of a  claim of frequentism to objectivity resurfaces there. There is an infinite range of frequentist procedures and, while some are more optimal than others, none is “the” optimal one (except for the most baked-out examples like say the estimation of the mean of a normal observation).

“A Bayesian FDA (there isn’t one) would be more forgiving. The Bayesian posterior probability of drug A’s superiority depends only on its final evaluation, not whether there might have been earlier decisions.”

The second criticism of Bayesianism therein is the counter-intuitive irrelevance of stopping rules. Once again, the presentation is fairly biased, because a Bayesian approach opposes scenarii rather than evaluates the likelihood of a tail event under the null and only the null. And also because, as shown by Jim Berger and co-authors, the Bayesian approach is generally much more favorable to the null than the p-value.

“Bayes’ Theorem is an algorithm for combining prior experience with current evidence. Followers of Nate Silver’s FiveThirtyEight column got to see it in spectacular form during the presidential campaign: the algorithm updated prior poll results with new data on a daily basis, nailing the actual vote in all 50 states.”

It is only fair that Nate Silver’s book and column are mentioned in Efron’s column. Because it is a highly valuable and definitely convincing illustration of Bayesian principles. What I object to is the criticism “that most cutting-edge science doesn’t enjoy FiveThirtyEight-level background information”. In my understanding, the poll model of FiveThirtyEight built up in a sequential manner a weight system over the different polling companies, hence learning from the data if in a Bayesian manner about their reliability (rather than forgetting the past). This is actually what caused Larry Wasserman to consider that Silver’s approach was actually more frequentist than Bayesian…

“Empirical Bayes is an exciting new statistical idea, well-suited to modern scientific technology, saying that experiments involving large numbers of parallel situations carry within them their own prior distribution.”

My last point of contention is about the (unsurprising) defence of the empirical Bayes approach in the Science column. Once again, the presentation is biased towards frequentism: in the FDR gene example, the empirical Bayes procedure is motivated by being the frequentist solution. The logical contradiction in “estimat[ing] the relevant prior from the data itself” is not discussed and the conclusion that Brad Efron uses “empirical Bayes methods in the parallel case [in the absence of prior information”, seemingly without being cautious and “uncritically”, does not strike me as the proper last argument in the matter! Nor does it give a 21st Century vision of what nouveau Bayesianism should be, faced with the challenges of Big Data and the like…

Posted in Statistics, University life with tags , , , , , , , , , , on November 29, 2012 by xi'an

Another read today and not from JRSS B for once, namely,  Efron‘s (an)other look at the Jackknife, i.e. the 1979 bootstrap classic published in the Annals of Statistics. My Master students in the Reading Classics Seminar course thus listened today to Marco Brandi’s presentation, whose (Beamer) slides are here:

In my opinion this was an easier paper to discuss, more because of its visible impact than because of the paper itself, where the comparison with the jackknife procedure does not sound so relevant nowadays. again mostly algorithmic and requiring some background on how it impacted the field. Even though Marco also went through Don Rubin’s Bayesian bootstrap and Michael Jordan bag of little bootstraps, he struggled to get away from the technicality towards the intuition and the relevance of the method. The Bayesian bootstrap extension was quite interesting in that we discussed a lot the connections with Dirichlet priors and the lack of parameters that sounded quite antagonistic with the Bayesian principles. However, at the end of the day, I feel that this foundational paper was not explored in proportion to its depth and that it would be worth another visit.

## Who’s #1?

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on May 2, 2012 by xi'an

First, apologies for this teaser of a title! This post is not about who is #1 in whatever category you can think of, from statisticians to climbs [the Eiger Nordwand, to be sure!], to runners (Gebrselassie?), to books… (My daughter simply said “c’est moi!” when she saw the cover of this book on my desk.) So this is in fact a book review of…a book with this catching title I received a month or so ago!

We decided to forgo purely statistical methodology, which is probably a disappointment to the hardcore statisticians.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 225)

This book may be one of the most boring ones I have had to review so far! The reason for this disgruntled introduction to “Who’s #1? The Science of Rating and Ranking” by Langville and Meyer is that it has very little if any to do with statistics and modelling. (And also that it is mostly about American football, a sport I am not even remotely interested in.) The purpose of the book is to present ways of building rating and ranking within a population, based on pairwise numerical connections between some members of this population. The methods abound, at least eight are covered by the book, but they all suffer from the same drawback that they are connected to no grand truth, to no parameter from an underlying probabilistic model, to no loss function that would measure the impact of a “wrong” rating. (The closer it comes to this is when discussing spread betting in Chapter 9.) It is thus a collection of transformation rules, from matrices to ratings. I find this the more disappointing in that there exists a branch of statistics called ranking and selection that specializes in this kind of problems and that statistics in sports is a quite active branch of our profession, witness the numerous books by Jim Albert. (Not to mention Efron’s analysis of baseball data in the 70’s.)

First suppose that in some absolutely perfect universe there is a perfect rating vector.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 117)

The style of the book is disconcerting at first, and then some, as it sounds written partly from Internet excerpts (at least for most of the pictures) and partly from local student dissertations… The mathematical level is highly varying, in that the authors take the pain to define what a matrix is (page 33), only to jump to Perron-Frobenius theorem a few pages later (page 36). It also mentions Laplace’s succession rule (only justified as a shrinkage towards the center, i.e. away from 0 and 1), the Sinkhorn-Knopp theorem, the traveling salesman problem, Arrow and Condorcet, relaxation and evolutionary optimization, and even Kendall’s and Spearman’s rank tests (Chapter 16), even though no statistical model is involved. (Nothing as terrible as the completely inappropriate use of Spearman’s rho coefficient in one of Belfiglio’s studies…)

Since it is hard to say which ranking is better, our point here is simply that different methods can produce vastly different rankings.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 78)

I also find irritating the association of “science” with “rating”, because the techniques presented in this book are simply tricks to turn pairwise comparison into a general ordering of a population, nothing to do with uncovering ruling principles explaining the difference between the individuals. Since there is no validation for one ordering against another, we can see no rationality in proposing any of those, except to set a convention. The fascination of the authors for the Markov chain approach to the ranking problem is difficult to fathom as the underlying structure is not dynamical (there is not evolving ranking along games in this book) and the Markov transition matrix is just constructed to derive a stationary distribution, inducing a particular “Markov” ranking.

The Elo rating system is the epitome of simple elegance.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 64)

An interesting input of the book is its description of the Elo ranking system used in chess, of which I did not know anything apart from its existence. Once again, there is a high degree of arbitrariness in the construction of the ranking, whose sole goal is to provide a convention upon which most people agree. A convention, mind, not a representation of truth! (This chapter contains a section on the Social Network movie, where a character writes a logistic transform on a window, missing the exponent. This should remind Andrew of someone he often refer to in his blog!)

Perhaps the largest lesson is not to put an undue amount of faith in anyone’s rating.” A.N. Langville & C.D. Meyer, Who’s #1? The Science of Rating and Ranking (page 125)

In conclusion, I see little point in suggesting reading this book, unless one is interested in matrix optimization problems and/or illustrations in American football… Or unless one wishes to write a statistics book on the topic!

## empirical Bayes (CHANCE)

Posted in Books, Statistics, University life with tags , , , , , , on April 23, 2012 by xi'an

As I decided to add a vignette on empirical Bayes methods to my review of Brad Efron’s Large-scale Inference in the next issue of CHANCE [25(3)], here it is.

Empirical Bayes methods can crudely be seen as the poor man’s Bayesian analysis. They start from a Bayesian modelling, for instance the parameterised prior

$x\sim f(x|\theta)\,,\quad \theta\sim\pi(\theta|\alpha)$

and then, instead of setting α to a specific value or of assigning an hyperprior to this hyperparameter α, as in a regular or a hierarchical Bayes approach, the empirical Bayes paradigm consists in estimating α from the data. Hence the “empirical” label. The reference model used for the estimation is the integrated likelihood (or conditional marginal)

$m(x|\alpha) = \int f(x|\theta) \pi(\theta|\alpha)\,\text{d}\theta$

which defines a distribution density indexed by α and thus allows for the use of any statistical estimation method (moments, maximum likelihood or even Bayesian!). A classical example is provided by the normal exchangeable sample: if

$x_i\sim \mathcal{N}(\theta_i,\sigma^2)\qquad \theta_i\sim \mathcal{N}(\mu,\tau^2)\quad i=1,\ldots,p$

then, marginally,

$x_i \sim \mathcal{N}(\mu,\tau^2+\sigma^2)$

and μ can be estimated by the empirical average of the observations. The next step in an empirical Bayes analysis is to act as if α had not been estimated from the data and to conduct a regular Bayesian processing of the data with this estimated prior distribution. In the above normal example, this means estimating the θi‘s by

$\dfrac{\sigma^2 \bar{x} + \tau^2 x_i}{\sigma^2+\tau^2}$

with the characteristic shrinkage (to the average) property of the resulting estimator (Efron and Morris, 1973).

…empirical Bayes isn’t Bayes.” B. Efron (p.90)

While using Bayesian tools, this technique is outside of the Bayesian paradigm for several reasons: (a) the prior depends on the data, hence it lacks foundational justifications; (b) the prior varies with the data, hence it lacks theoretical validations like Walk’s complete class theorem; (c) the prior uses the data once, hence the posterior uses the data twice (see the vignette about this sin in the previous issue); (d) the prior relies of an estimator, whose variability is not accounted for in the subsequent analysis (Morris, 1983). The original motivation for the approach (Robbins, 1955) was more non-parametric, however it gained popularity in the 70’s and 80’s both in conjunction with the Stein effect and as a practical mean of bypassing complex Bayesian computations. As illustrated by Efron’s book, it recently met with renewed interest in connection with multiple testing.

## Large-scale Inference

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , on February 24, 2012 by xi'an

Large-scale Inference by Brad Efron is the first IMS Monograph in this new series, coordinated by David Cox and published by Cambridge University Press. Since I read this book immediately after Cox’ and Donnelly’s Principles of Applied Statistics, I was thinking of drawing a parallel between the two books. However, while none of them can be classified as textbooks [even though Efron’s has exercises], they differ very much in their intended audience and their purpose. As I wrote in the review of Principles of Applied Statistics, the book has an encompassing scope with the goal of covering all the methodological steps  required by a statistical study. In Large-scale Inference, Efron focus on empirical Bayes methodology for large-scale inference, by which he mostly means multiple testing (rather than, say, data mining). As a result, the book is centred on mathematical statistics and is more technical. (Which does not mean it less of an exciting read!) The book was recently reviewed by Jordi Prats for Significance. Akin to the previous reviewer, and unsurprisingly, I found the book nicely written, with a wealth of R (colour!) graphs (the R programs and dataset are available on Brad Efron’s home page).

I have perhaps abused the “mono” in monograph by featuring methods from my own work of the past decade.” (p.xi)

Sadly, I cannot remember if I read my first Efron’s paper via his 1977 introduction to the Stein phenomenon with Carl Morris in Pour la Science (the French translation of Scientific American) or through his 1983 Pour la Science paper with Persi Diaconis on computer intensive methods. (I would bet on the later though.) In any case, I certainly read a lot of the Efron’s papers on the Stein phenomenon during my thesis and it was thus with great pleasure that I saw he introduced empirical Bayes notions through the Stein phenomenon (Chapter 1). It actually took me a while but I eventually (by page 90) realised that empirical Bayes was a proper subtitle to Large-Scale Inference in that the large samples were giving some weight to the validation of empirical Bayes analyses. In the sense of reducing the importance of a genuine Bayesian modelling (even though I do not see why this genuine Bayesian modelling could not be implemented in the cases covered in the book).

Large N isn’t infinity and empirical Bayes isn’t Bayes.” (p.90)

The core of Large-scale Inference is multiple testing and the empirical Bayes justification/construction of Fdr’s (false discovery rates). Efron wrote more than a dozen papers on this topic, covered in the book and building on the groundbreaking and highly cited Series B 1995 paper by Benjamini and Hochberg. (In retrospect, it should have been a Read Paper and so was made a “retrospective read paper” by the Research Section of the RSS.) Frd are essentially posterior probabilities and therefore open to empirical Bayes approximations when priors are not selected. Before reaching the concept of Fdr’s in Chapter 4, Efron goes over earlier procedures for removing multiple testing biases. As shown by a section title (“Is FDR Control “Hypothesis Testing”?”, p.58), one major point in the book is that an Fdr is more of an estimation procedure than a significance-testing object. (This is not a surprise from a Bayesian perspective since the posterior probability is an estimate as well.)

Scientific applications of single-test theory most often suppose, or hope for rejection of the null hypothesis (…) Large-scale studies are usually carried out with the expectation that most of the N cases will accept the null hypothesis.” (p.89)

On the innovations proposed by Efron and described in Large-scale Inference, I particularly enjoyed the notions of local Fdrs in Chapter 5 (essentially pluggin posterior probabilities that a given observation stems from the null component of the mixture) and of the (Bayesian) improvement brought by empirical null estimation in Chapter 6 (“not something one estimates in classical hypothesis testing”, p.97) and the explanation for the inaccuracy of the bootstrap (which “stems from a simpler cause”, p.139), but found less crystal-clear the empirical evaluation of the accuracy of Fdr estimates (Chapter 7, ‘independence is only a dream”, p.113), maybe in relation with my early career inability to explain Morris’s (1983) correction for empirical Bayes confidence intervals (pp. 12-13). I also discovered the notion of enrichment in Chapter 9, with permutation tests resembling some low-key bootstrap, and multiclass models in Chapter 10, which appear as if they could benefit from a hierarchical Bayes perspective. The last chapter happily concludes with one of my preferred stories, namely the missing species problem (on which I hope to work this very Spring).

## Bayesian inference and the parametric bootstrap

Posted in R, Statistics, University life with tags , , , , , , , , on December 16, 2011 by xi'an

This paper by Brad Efron came to my knowledge when I was looking for references on Bayesian bootstrap to answer a Cross Validated question. After reading it more thoroughly, “Bayesian inference and the parametric bootstrap” puzzles me, which most certainly means I have missed the main point. Indeed, the paper relies on parametric bootstrap—a frequentist approximation technique mostly based on simulation from a plug-in distribution and a robust inferential method estimating distributions from empirical cdfs—to assess (frequentist) coverage properties of Bayesian posteriors. The manuscript mixes a parametric bootstrap simulation output for posterior inference—even though bootstrap produces simulations of estimators while the posterior distribution operates on the parameter space, those  estimator simulations can nonetheless be recycled as parameter simulation by a genuine importance sampling argument—and the coverage properties of Jeffreys posteriors vs. the BCa [which stands for bias-corrected and accelerated, see Efron 1987] confidence density—which truly take place in different spaces. Efron however connects both spaces by taking advantage of the importance sampling connection and defines a corrected BCa prior to make the confidence intervals match. While in my opinion this does not define a prior in the Bayesian sense, since the correction seems to depend on the data. And I see no strong incentive to match the frequentist coverage, because this would furthermore define a new prior for each component of the parameter. This study about the frequentist properties of Bayesian credible intervals reminded me of the recent discussion paper by Don Fraser on the topic, which follows the same argument that Bayesian credible regions are not necessarily good frequentist confidence intervals.

The conclusion of the paper is made of several points, some of which may not be strongly supported by the previous analysis:

1. “The parametric bootstrap distribution is a favorable starting point for importance sampling computation of Bayes posterior distributions.” [I am not so certain about this point given that the bootstrap is based on a pluggin estimate, hence fails to account for the variability of this estimate, and may thus induce infinite variance behaviour, as in the harmonic mean estimator of Newton and Raftery (1994). Because the tails of the importance density are those of the likelihood, the heavier tails of the posterior induced by the convolution with the prior distribution are likely to lead to this fatal misbehaviour of the importance sampling estimator.]
2. “This computation is implemented by reweighting the bootstrap replications rather than by drawing observations directly from the posterior distribution as with MCMC.” [Computing the importance ratio requires the availability both of the likelihood function and of the likelihood estimator, which means a setting where Bayesian computations are not particularly hindered and do not necessarily call for advanced MCMC schemes.]
3. “The necessary weights are easily computed in exponential families for any prior, but are particularly simple starting from Jeffreys invariant prior, in which case they depend only on the deviance difference.” [Always from a computational perspective, the ease of computing the importance weights is mirrored by the ease in handling the posterior distributions.]
4. “The deviance difference depends asymptotically on the skewness of the family, having a cubic normal form.” [No relevant comment.]
5. “In our examples, Jeffreys prior yielded posterior distributions not much different than the unweighted bootstrap distribution. This may be unsatisfactory for single parameters of interest in multi-parameter families.” [The frequentist confidence properties of Jeffreys priors have already been examined in the past and be found to be lacking in multidimensional settings. This is an assessment finding Jeffreys priors lacking from a frequentist perspective. However, the use of Jeffreys prior is not justified on this particular ground.]
6. “Better uninformative priors, such as the Welch and Peers family or reference priors, are closely related to the frequentist BCa reweighting formula.” [The paper only finds proximities in two examples, but it does not assess this relation in a wider generality. Again, this is not particularly relevant from a Bayesian viewpoint.]
7. “Because of the i.i.d. nature of bootstrap resampling, simple formulas exist for the accuracy of posterior computations as a function of the number B of bootstrap replications. Even with excessive choices of B, computation time was measured in seconds for our examples.” [This is not very surprising. It however assesses Bayesian procedures from a frequentist viewpoint, so this may be lost on both Bayesian and frequentist users…]
8. “An efficient second-level bootstrap algorithm (“bootstrap-after-bootstrap”) provides estimates for the frequentist accuracy of Bayesian inferences.” [This is completely correct and why bootstrap is such an appealing technique for frequentist inference. I spent the past two weeks teaching non-parametric bootstrap to my R class and the students are now fluent with the concept, even though they are unsure about the meaning of estimation and testing!]
9. “This can be important in assessing inferences based on formulaic priors, such as those of Jeffreys, rather than on genuine prior experience.” [Again, this is neither very surprising nor particularly appealing to Bayesian users.]

In conclusion, I found the paper quite thought-provoking and stimulating, definitely opening new vistas in a very elegant way. I however remain unconvinced by the simulation aspects from a purely Monte Carlo perspective.