## the philosophical importance of Stein’s paradox

**I** recently came across this paper written by three philosophers of Science, attempting to set the Stein paradox in a philosophical light. Given my past involvement, I was obviously interested about which new perspective could be proposed, close to sixty years after Stein (1956). Paper that we should actually celebrate next year! However, when reading the document, I did not find a significantly innovative approach to the phenomenon…

The paper does not start in the best possible light since it seems to justify the use of a sample mean through maximum likelihood estimation, which only is the case for a limited number of probability distributions (including the Normal distribution, which may be an implicit assumption). For instance, when the data is Student’s t, the MLE is not the sample mean, no matter how shocking that might sounds! (And while this is a minor issue, results about the Stein effect taking place in non-normal settings appear much earlier than 1998. And earlier than in my dissertation. See, e.g., Berger and Bock (1975). Or in Brandwein and Strawderman (1978).)

While the linear regression explanation for the Stein effect is already exposed in Steve Stigler’s Neyman Lecture, I still have difficulties with the argument in that for instance we do not know the value of the parameter, which makes the regression and the inverse regression of parameter means over Gaussian observations mere concepts and nothing practical. (Except for the interesting result that two observations make both regressions coincide.) And it does not seem at all intuitive (to me) that imposing a constraint should improve the efficiency of a maximisation program…

Another difficulty I have with the discussion of the case against the MLE is not that there exist admissible estimators that dominate the MLE (when k≥5, as demonstrated by Bill Strawderman in 1975), but on the contrary that (a) there is an infinity of them and (b) they do not come out as closed-form expressions. Even for James’ and Stein’s or Efron’s and Morris’, shrinkage estimators, there exists a continuum of them, with no classical reason for picking one against the other.

Not that it really matters, but I also find rechristening the Stein phenomenon as *holistic pragmatism* somewhat inappropriate. Or just ungrounded. It seems to me the phenomenon simply relates to collective decision paradoxes, with multidimensional or multi-criteria utility functions having no way of converging to a collective optimum. As illustrated in [Lakanal alumni] Allais’ paradox.

“We think the most plausible Bayesian response to Stein’s results is to either reject them outright or to adopt an instrumentalist view of personal probabilities.”

The part connecting Stein with Bayes again starts on the wrong foot, since it is untrue that *any* shrinkage estimator can be expressed as a Bayes posterior mean. This is not even true for the *original* James-Stein estimator, i.e., it is not a Bayes estimator and cannot be a Bayes posterior mean. I also do neither understand nor relate to the notion of “Bayesians of the first kind”, especially when it merges with an empirical Bayes argument. More globally, the whole discourse about Bayesians “taking account of Stein’s result” does not stand on very sound ground because Bayesians automatically integrate the shrinkage phenomenon when minimising a posterior loss. Rather than trying to accommodate it as a special goal. Laughing (in the paper) at the prior assumption that all means should be “close” to zero or “close together” does not account for the choice of the location (zero) or scale (one) when measuring quantities of interest. And for the fact that Stein’s effect holds even when the means are far from zero or from being similar, albeit as a minuscule effect. That is, when the prior disagrees with the data, because Stein’s phenomenon is a frequentist occurrence. What I find amusing is instead to mention a “prior whose probability mass is centred about the sample mean”. (I am unsure the authors are aware that the shrinkage effect is irrelevant for all practical purposes unless the true means are close to the shrinkage centre.) *And* to state that improper priors “integrate to a number larger than 1” and that “it’s not possible to be more than 100% confident in anything”… *And* to confuse the Likelihood Principle with the prohibition of data dependent priors. *And* to consider that the MLE and any shrinkage estimator have the same expected utility under a flat prior (since, if they had, there would be no Bayes estimator!). The only part with which I can agree is, again, that Stein’s phenomenon is a *frequentist* notion. But one that induces us to use Bayes estimators as the only coherent way to make use of the loss function. The paper is actually silent about the duality existing between losses and priors, duality that would put Stein’s effect into a totally different light. As expressed e.g. in Herman Rubin’s paper. Because shrinkage both in existence and in magnitude is deeply connected with the choice of the loss function, arguing against an almost universal Bayesian perspective of shrinkage while adhering to a single loss function is rather paradoxical. Similarly, very little of substance can be found about empirical Bayes estimation and its philosophical foundations.

While it is generally agreed that shrinkage estimators trade some bias for a decrease in variance, the connection with AIC is at best tenuous. Because AIC or other model choice tools are not estimation devices *per se*. And because they force infinite shrinkage, namely to have some components of the estimator precisely equal to zero. Which is an impossibility for Bayes estimates. A much more natural (and already made) connection would be to relate shrinkage and LASSO estimators, since the difference can be rephrased as the opposition between Gaussian and Laplace priors.

I also object at the concept of “linguistic invariance”, which simply means (for me) absolute invariance, namely that the estimate of the transform must be the transform of the estimate for every and all transforms. Which holds for the MLE. But also, contrary to the author’s assertion, for Bayes estimation under my intrinsic loss functions.

“But when and how problems should be lumped together or split apart remains an important open problem in statistics.”

The authors correctly point out the accuracy of AIC (over BIC) for making predictions, but shrinkage does not necessarily suffer from this feature as Stein’s phenomenon also holds for prediction, if predicting enough values at the same time… I also object to the envisioned possibility of a shrinkage estimator that would improve every component of the MLE (in a uniform sense) as it contradicts the admissibility of the single component MLE! And the above quote shows the decision theoretic part of inference is not properly incorporated.

Overall, I thus clearly wonder at the purpose of the paper, given the detailed coverage of many aspects of the Stein phenomenon provided by Stephen Stigler and others over the years. Obviously, a new perspective is always welcome, but this paper somewhat lacks enough appeal. While missing essential features making the Stein phenomenon look like a poor relative of Bayesian inference. In my opinion, Stein’s phenomenon remains an epi-phenomenon, which rather signals the end of the search for a golden standard in frequentist estimation than the opening of a new era of estimation. It pushed me almost irresistibly into Bayesianism, a move I do not regret to this day! *In fine*, I also have trouble seeing Stein’s phenomenon as durably impacting the field, more than 50 years later, and hence think it remains of little importance for epistemology and philosophy of science. Except maybe for marking the end of an era, where the search for “the” ideal estimator was still on the agenda.

January 12, 2016 at 2:37 am

Thank you for reading our paper and discussing it on your blog! Our purpose with the paper was to give an introduction to Stein’s phenomenon for a philosophical audience; it was not meant to — and probably will not — offer a new and interesting perspective for a statistician who is already very familiar with Stein’s phenomenon and its extensive literature.

I have a few more specific comments:

1. We don’t rechristen Stein’s phenomenon as “holistic pragmatism.” Rather, holistic pragmatism is the attitude to frequentist estimation that we think is underwritten by Stein’s phenomenon. Since MLE is sometimes admissible and sometimes not, depending on the number of parameters estimated, the researcher has to take into account his or her goals (whether total accuracy or individual-parameter accuracy is more important) when picking an estimator. To a statistician, this might sound obvious, but to philosophers it’s a pretty radical idea.

2. “The part connecting Stein with Bayes again starts on the wrong foot, since it is untrue that any shrinkage estimator can be expressed as a Bayes posterior mean. This is not even true for the original James-Stein estimator, i.e., it is not a Bayes estimator and cannot be a Bayes posterior mean.”

That seems to depend on what you mean by a “Bayes estimator.” It is possible to have an empirical Bayes prior (constructed from the sample) whose posterior mean is identical to the original James-Stein estimator. But if you don’t count empirical Bayes priors as Bayesian, then you are right.

3. “And to state that improper priors “integrate to a number larger than 1” and that “it’s not possible to be more than 100% confident in anything”… And to confuse the Likelihood Principle with the prohibition of data dependent priors. And to consider that the MLE and any shrinkage estimator have the same expected utility under a flat prior (since, if they had, there would be no Bayes estimator!).”

I’m not sure I completely understand your criticisms here. First, as for the relation between the LP and data-dependent priors — it does seem to me that the LP precludes the use of data-dependent priors. If you use data from an experiment to construct your prior, then — contrary to the LP — it will not be true that all the information provided by the experiment regarding which parameter is true is contained in the likelihood function, since some of the information provided by the experiment will also be in your prior.

Second, as to our claim that the ML estimator has the same expected utility (under the flat prior) as a shrinkage prior that it is dominated by—we incorporated this claim into our paper because it was an objection made by a statistician who read and commented on our paper. Are you saying the claim is false? If so, we would certainly like to know so that we can revise the paper to make it more accurate.

4. I was aware of Rubin’s idea that priors and utility functions (supposedly) are non-separable, but I didn’t (and don’t) quite see the relevance of that idea to Stein estimation.

5. “Similarly, very little of substance can be found about empirical Bayes estimation and its philosophical foundations.”

What we say about empirical Bayes priors is that they cannot be interpreted as degrees of belief; they are just tools. It will be surprising to many philosophers that priors are sometimes used in such an instrumentalist fashion in statistics.

6. The reason why we made a comparison between Stein estimation and AIC was two-fold: (a) for sociological reasons, philosophers are much more familiar with model selection than they are with, say, the LASSO or other regularized regression methods. (b) To us, it’s precisely because model selection and estimation are such different enterprises that it’s interesting that they have such a deep connection: despite being very different, AIC and shrinkage both rely on a bias-variance trade-off.

7. “I also object to the envisioned possibility of a shrinkage estimator that would improve every component of the MLE (in a uniform sense) as it contradicts the admissibility of the single component MLE!”

I don’t think our suggestion here contradicts the admissibility of single component MLE. The idea is just that if we have data D and D’ about parameters phi and phi’, then the estimates of both phi and phi’ can sometimes be improved if the estimation problems are lumped together and a shrinkage estimator is used. This doesn’t contradict the admissibility of MLE, because MLE is still admissible on each of the data sets for each of the parameters.

Again, thanks for reading the paper and for the feedback—we really do want to make sure our paper is accurate, so your feedback is much appreciated. Lastly, I apologize for the length of this comment.

Olav Vassend

January 12, 2016 at 8:33 am

Thank you Olav and no worries for the “long” reply, it will make a perfect blog entry!

December 1, 2015 at 3:34 am

Hi, X. Yes, in a very real sense the Stein paradox is no paradox at all, because the sample mean is not necessarily a good estimate of the population mean at all! As can be seen, for example, in those many recent Psychological-Science-style papers. This is a point perhaps worth making more fully, not just in a blog comment…

November 30, 2015 at 10:59 pm

> does not seem at all intuitive (to me) that imposing a constraint should improve the efficiency of a maximisation program…

If you have not already read Meng, X.-L. (2005). From Unit Root to Stein’s Estimator to Fisher’s k Statistics: If You Have a Moment, I Can Tell You More. 20, 141-162 the second application – you might wish to.

Keith O’Rourke

December 8, 2015 at 11:46 am

Thank you, Keith, I just went back to the 2003 Read Paper of Kong et al, in preparation for my NIPS talk on Friday in Montréal (one year later after our meeting there). In this setting, constraining the space of measures makes it smaller and as a result improves the variance of the resulting estimator, assuming there is no induced increase in the computational cost. So I essentially agree that this perspective on constraints leads to more efficiency!