the philosophical importance of Stein’s paradox
I recently came across this paper written by three philosophers of Science, attempting to set the Stein paradox in a philosophical light. Given my past involvement, I was obviously interested about which new perspective could be proposed, close to sixty years after Stein (1956). Paper that we should actually celebrate next year! However, when reading the document, I did not find a significantly innovative approach to the phenomenon…
The paper does not start in the best possible light since it seems to justify the use of a sample mean through maximum likelihood estimation, which only is the case for a limited number of probability distributions (including the Normal distribution, which may be an implicit assumption). For instance, when the data is Student’s t, the MLE is not the sample mean, no matter how shocking that might sounds! (And while this is a minor issue, results about the Stein effect taking place in non-normal settings appear much earlier than 1998. And earlier than in my dissertation. See, e.g., Berger and Bock (1975). Or in Brandwein and Strawderman (1978).)
While the linear regression explanation for the Stein effect is already exposed in Steve Stigler’s Neyman Lecture, I still have difficulties with the argument in that for instance we do not know the value of the parameter, which makes the regression and the inverse regression of parameter means over Gaussian observations mere concepts and nothing practical. (Except for the interesting result that two observations make both regressions coincide.) And it does not seem at all intuitive (to me) that imposing a constraint should improve the efficiency of a maximisation program…
Another difficulty I have with the discussion of the case against the MLE is not that there exist admissible estimators that dominate the MLE (when k≥5, as demonstrated by Bill Strawderman in 1975), but on the contrary that (a) there is an infinity of them and (b) they do not come out as closed-form expressions. Even for James’ and Stein’s or Efron’s and Morris’, shrinkage estimators, there exists a continuum of them, with no classical reason for picking one against the other.
Not that it really matters, but I also find rechristening the Stein phenomenon as holistic pragmatism somewhat inappropriate. Or just ungrounded. It seems to me the phenomenon simply relates to collective decision paradoxes, with multidimensional or multi-criteria utility functions having no way of converging to a collective optimum. As illustrated in [Lakanal alumni] Allais’ paradox.
“We think the most plausible Bayesian response to Stein’s results is to either reject them outright or to adopt an instrumentalist view of personal probabilities.”
The part connecting Stein with Bayes again starts on the wrong foot, since it is untrue that any shrinkage estimator can be expressed as a Bayes posterior mean. This is not even true for the original James-Stein estimator, i.e., it is not a Bayes estimator and cannot be a Bayes posterior mean. I also do neither understand nor relate to the notion of “Bayesians of the first kind”, especially when it merges with an empirical Bayes argument. More globally, the whole discourse about Bayesians “taking account of Stein’s result” does not stand on very sound ground because Bayesians automatically integrate the shrinkage phenomenon when minimising a posterior loss. Rather than trying to accommodate it as a special goal. Laughing (in the paper) at the prior assumption that all means should be “close” to zero or “close together” does not account for the choice of the location (zero) or scale (one) when measuring quantities of interest. And for the fact that Stein’s effect holds even when the means are far from zero or from being similar, albeit as a minuscule effect. That is, when the prior disagrees with the data, because Stein’s phenomenon is a frequentist occurrence. What I find amusing is instead to mention a “prior whose probability mass is centred about the sample mean”. (I am unsure the authors are aware that the shrinkage effect is irrelevant for all practical purposes unless the true means are close to the shrinkage centre.) And to state that improper priors “integrate to a number larger than 1” and that “it’s not possible to be more than 100% confident in anything”… And to confuse the Likelihood Principle with the prohibition of data dependent priors. And to consider that the MLE and any shrinkage estimator have the same expected utility under a flat prior (since, if they had, there would be no Bayes estimator!). The only part with which I can agree is, again, that Stein’s phenomenon is a frequentist notion. But one that induces us to use Bayes estimators as the only coherent way to make use of the loss function. The paper is actually silent about the duality existing between losses and priors, duality that would put Stein’s effect into a totally different light. As expressed e.g. in Herman Rubin’s paper. Because shrinkage both in existence and in magnitude is deeply connected with the choice of the loss function, arguing against an almost universal Bayesian perspective of shrinkage while adhering to a single loss function is rather paradoxical. Similarly, very little of substance can be found about empirical Bayes estimation and its philosophical foundations.
While it is generally agreed that shrinkage estimators trade some bias for a decrease in variance, the connection with AIC is at best tenuous. Because AIC or other model choice tools are not estimation devices per se. And because they force infinite shrinkage, namely to have some components of the estimator precisely equal to zero. Which is an impossibility for Bayes estimates. A much more natural (and already made) connection would be to relate shrinkage and LASSO estimators, since the difference can be rephrased as the opposition between Gaussian and Laplace priors.
I also object at the concept of “linguistic invariance”, which simply means (for me) absolute invariance, namely that the estimate of the transform must be the transform of the estimate for every and all transforms. Which holds for the MLE. But also, contrary to the author’s assertion, for Bayes estimation under my intrinsic loss functions.
“But when and how problems should be lumped together or split apart remains an important open problem in statistics.”
The authors correctly point out the accuracy of AIC (over BIC) for making predictions, but shrinkage does not necessarily suffer from this feature as Stein’s phenomenon also holds for prediction, if predicting enough values at the same time… I also object to the envisioned possibility of a shrinkage estimator that would improve every component of the MLE (in a uniform sense) as it contradicts the admissibility of the single component MLE! And the above quote shows the decision theoretic part of inference is not properly incorporated.
Overall, I thus clearly wonder at the purpose of the paper, given the detailed coverage of many aspects of the Stein phenomenon provided by Stephen Stigler and others over the years. Obviously, a new perspective is always welcome, but this paper somewhat lacks enough appeal. While missing essential features making the Stein phenomenon look like a poor relative of Bayesian inference. In my opinion, Stein’s phenomenon remains an epi-phenomenon, which rather signals the end of the search for a golden standard in frequentist estimation than the opening of a new era of estimation. It pushed me almost irresistibly into Bayesianism, a move I do not regret to this day! In fine, I also have trouble seeing Stein’s phenomenon as durably impacting the field, more than 50 years later, and hence think it remains of little importance for epistemology and philosophy of science. Except maybe for marking the end of an era, where the search for “the” ideal estimator was still on the agenda.