beware, nefarious Bayesians threaten to take over frequentism using loss functions as Trojan horses!

“It is not a coincidence that textbooks written by Bayesian statisticians extol the virtue of the decision-theoretic perspective and then proceed to present the Bayesian approach as its natural extension.” (p.19)

“According to some Bayesians (see Robert, 2007), the risk function does represent a legitimate frequentist error because it is derived by taking expectations with respect to [the sampling density]. This argument is misleading for several reasons.” (p.18)

During my R exam, I read the recent arXiv posting by Aris Spanos on why “the decision theoretic perspective misrepresents the frequentist viewpoint”. The paper is entitled “Why the Decision Theoretic Perspective Misrepresents Frequentist Inference: ‘Nuts and Bolts’ vs. Learning from Data” and I found it at the very least puzzling…. The main theme is the one caricatured in the title of this post, namely that the decision-theoretic analysis of frequentist procedures is a trick brought by Bayesians to justify their own procedures. The fundamental argument behind this perspective is that decision theory operates in a “for all θ” referential while frequentist inference (in Spanos’ universe) is only concerned by one θ, the true value of the parameter. (Incidentally, the “nuts and bolt” refers to the only case when a decision-theoretic approach is relevant from a frequentist viewpoint, namely in factory quality control sampling.)

“The notions of a risk function and admissibility are inappropriate for frequentist inference because they do not represent legitimate error probabilities.” (p.3)

“An important dimension of frequentist inference that has not been adequately appreciated in the statistics literature concerns its objectives and underlying reasoning.” (p.10)

“The factual nature of frequentist reasoning in estimation also brings out the impertinence of the notion of admissibility stemming from its reliance on the quantifier ‘for all’.” (p.13)

One strange feature of the paper is that Aris Spanos seems to appropriate for himself the notion of frequentism, rejecting the choices made by (what I would call frequentist) pioneers like Wald, Neyman, “Lehmann and LeCam [sic]”, Stein. Apart from Fisher—and the paper is strongly grounded in neo-Fisherian revivalism—, the only frequentists seemingly finding grace in the eyes of the author are George Box, David Cox, and George Tiao. (The references are mostly to textbooks, incidentally.) Modern authors that clearly qualify as frequentists like Bickel, Donoho, Johnstone, or, to mention the French school, e.g., Birgé, Massart, Picard, Tsybakov, none of whom can be suspected of Bayesian inclinations!, do not appear either as satisfying those narrow tenets of frequentism. Furthermore, the concept of frequentist inference is never clearly defined within the paper. As in the above quote, the notion of “legitimate error probabilities” pops up repeatedly (15 times) within the whole manifesto without being explicitely defined. (The closest to a definition is found on page 17, where the significance level and the p-value are found to be legitimate.) Aris Spanos even rejects what I would call the von Mises basis of frequentism: “contrary to Bayesian claims, those error probabilities have nothing to to do with the temporal or the physical dimension of the long-run metaphor associated with repeated samples” (p.17), namely that a statistical  procedure cannot be evaluated on its long term performance…

“The primary objective of frequentist inference is to learn from data x0 about the `true’ generating mechanism, described in terms of a particular (true) value of θ” (p.3)

“The quantifier ‘for all θ’ is inappropriate for evaluating frequentist inference procedures because their primary objective is to learn from data about the true value [of] θ! What matters for a good frequentist procedure is not its behavior for all possible values θ, but how well it does in shedding light on the true value [of] θ.” (p.19) .

” What is less surprisingly [sic] is that Bayesian textbook writers, like Robert (2007) (…) invariably adopt the definition with the quantifier ‘for all θ’.” (p.12)

This opposition of universal and existence quantifiers in decision-theoretic and frequentist perspectives is equally puzzling: when I consider a risk function as a function of θ, the parameter taking any value over the parameter space, I assume that the true value of the parameter can be anything. (Even though Spanos considers that this very reasoning adds “insult to injury” on page 14.) There is no arguing about the fact that both Bayesian and frequentist goals are to characterise with the uttermost precision the true value of the parameter. I thus find the criticism incomprehensible. The ‘for all θ’ found in decision-theoretic texts, including my book, simply means that the procedure is evaluated for all possible values of the (true) parameter… Similarly, to state that the unbiasedness definition using the `for all’ quantifier “makes no sense in frequentist estimation” (p.11) sounds like cheap rhetoric: the unbiasedness equation

\mathbb{E}_\theta[\hat\theta(X)]=\theta\text{ for all }\theta

means that when the (true) parameter is θ (as expressed by the index of the expectation, missing from the original paper!), the equation holds.

“When Bayesians claim that all the relevant information for any inference concerning θ is given by π(|x0) they only admit to half the truth. The other half is that for selecting a Bayesian ‘optimal’ estimator of θ one needs to invoke additional information like a loss (or utility) function.” (p.4)

There is nothing objectionable to the above quote. It actually reminds me of a mantra Hermann Rubin used (and maybe uses) to repeat, namely that the prior modelling and the loss function are intermingled and cannot be separated. There is information in the loss as well as in the prior, namely what the inference is about. A minor point about this section is that the inclusion of the MAP as a loss-related estimator is only correct for discrete parameter spaces: otherwise, there is no loss function leading to the MAP, as discussed in this earlier post. However, to argue later that Bayesian decision-theory takes “admissibility as the criterion for choosing among estimators” (p.6) is clearly off-track: admissibility is a minimal requirement in that inadmissible estimators should not be considered at all. But it does not provide a total ordering among estimators, thus cannot serve for selecting an estimator. (Note also that the positive part James-Stein estimator is not admissible, contrary to the assertion page 7.)

“The key concept underlying the James-Stein result, that of admissibility with respect to a particular loss function, seems inappropriate for frequentist inference in general and optimal estimation in particular.” (p.14)

I find—and I am obviously biased in this respect, having converged to Bayesianism by this route—that the arguments against the relevance of the James-Stein effect are not thoroughly honest, since they start with the usual `apples-and-bananas’ objection that one should not mix the “individual” estimators in a combined loss function as “there is no statistical reason to do so” (p.7). Then comes the argument that “the James-Stein is practically useless because [the James-Stein estimators] are inconsistent estimators of since there is essentially one observation for each unknown parameter.” (p.13) This does not make sense, as there is no `big n’ in the picture: if one accumulates observations about each unknown parameter, the Stein effect remains, and all estimators involved are equally consistent. Further objections (p.14) as the one quoted above are mostly tautological: “the overall mean square error is not the relevant statistical error”, “the evaluation of the overall mean square error (..) depend on extraneous information”… The final `blow’ is delivered in the setting of a standard linear regression: when using a standard square error loss, i.e. one with the identity matrix in the quadratic form, the penalty “treated as a unitless numerical measure of how costly are the various consequences of potential decisions associated with” (p.15) one estimator. This is a caricature of the use of James-Stein estimators in this setting as a penalised loss, using e.g. XTX as its quadratic form, overcomes the objection in a straightforward manner… The debate conveniently forgets about the persistence of the Stein effect in a wide range of situations, i.e. for varying loss functions and including the estimation of confidence regions. (Let me stress again that there is no such thing as “James-Stein risk optimality” (p.20) and an admissible James-Stein estimator (p.7).)

“There exists an inherent statistical distance function, often relating to the score function, and thus on information contained in the data.” (p.10)

A last surprising feature of the paper is this reference to “inherent distance functions”. I do not understand why those loss (or error) functions are more justified than others and the paper only refers to the introductory textbooks of Casella and Berger and of Shao for support. (In particular, the supremum loss function associated with the uniform distribution (p.10) sounds particularly suspicious as it eliminates the value of the estimator from the picture.).To me, if there was to be an “inherent distance functions”, it would be the intrinsic loss based on the Kullback-Leibler divergence as it naturally appears in asymptotic statistics and is parameterisation invariant. (Of course, it does not apply to the uniform case.) In any case, I do not see how a loss function could be “stemming from information contained in the data”. The minimum requirement is that it stems from the model generating the data, in which case intrinsic loss functions sound fine. A better solution is however to understand how the inference is going to be used to construct an appropriate loss: there is nothing reprehensible with exploiting this sort of prior information.

Decisions are driven exclusively by the risk function and not by any aspiration to learn from data about the true θ.” (p.9)

“It is important to note that the overwhelming majority of these confusions have been introduced into frequentist inference by Bayesians by deploying rigged examples.” (p.17)

In conclusion of this lengthy read of Spanos’ pamphlet, I remain, surprise, surprise!, unmoved by his arguments. Beside exhibiting his anti-Bayesian agenda, the author does not make a convincing argument as to why decision-theory is not compatible with frequentist inference, if only because his definition of what is frequentist inference  remains illusive throughout the paper. The issue of evaluating statistical procedures in a non-Bayesian way is not addressed therein. More importantly, I think, the notion of predictive loss is not once mentioned, while it is a quintessential quantity in the evaluation of procedures, were they frequentist or Bayesian (see, e.g., Gelman et al., 2013). When building a statistical model for scientific purposes, a logical—I would write Popperian if I was not involved in a philosophical debate!—step is to test for its predictive abilities, in which case there is a natural loss function implied by this perspective, namely how far from the truth the prediction is. Globally, the vision of frequentist statistics transmitted by the author appears as a philosophical construct at odds with statistical practice and unable to justify the selection of a given statistical procedure.

6 Responses to “beware, nefarious Bayesians threaten to take over frequentism using loss functions as Trojan horses!”

  1. I just found this post. I was about to send you an extended critique of Spanos’s working paper that I wrote up in the summer. But looking my critique over now, I find its style elliptical to the point of incomprehensibility — a common failing of my writing…

  2. Thanks for posting all this – particularly as it’s such a painfully bad manuscript.

    It’s hard to credit an author who (pg 13) argues admissibilty is worthless without noting that it eliminates e.g. the MLE+white noise as an estimator, and then (pg 17) complains that people use “rigged examples”.

    • Thanks. I indeed wonder about the background of the author, with references mostly to introductory frequentist textbooks like Casella and Berger…

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: