## the philosophical importance of Stein’s paradox [a reply from the authors]

*[In the wake of my comment on this paper written by three philosophers of Science, I received this reply from Olav Vassend.]*

Thank you for reading our paper and discussing it on your blog! Our purpose with the paper was to give an introduction to Stein’s phenomenon for a philosophical audience; it was not meant to — and probably will not — offer a new and interesting perspective for a statistician who is already very familiar with Stein’s phenomenon and its extensive literature.

I have a few more specific comments:

1. We don’t rechristen Stein’s phenomenon as “holistic pragmatism.” Rather, holistic pragmatism is the attitude to frequentist estimation that we think is underwritten by Stein’s phenomenon. Since MLE is sometimes admissible and sometimes not, depending on the number of parameters estimated, the researcher has to take into account his or her goals (whether total accuracy or individual-parameter accuracy is more important) when picking an estimator. To a statistician, this might sound obvious, but to philosophers it’s a pretty radical idea.

2.* “The part connecting Stein with Bayes again starts on the wrong foot, since it is untrue that any shrinkage estimator can be expressed as a Bayes posterior mean. This is not even true for the original James-Stein estimator, i.e., it is not a Bayes estimator and cannot be a Bayes posterior mean.”*

That seems to depend on what you mean by a “Bayes estimator.” It is possible to have an empirical Bayes prior (constructed from the sample) whose posterior mean is identical to the original James-Stein estimator. But if you don’t count empirical Bayes priors as Bayesian, then you are right.

3. *“And to state that improper priors “integrate to a number larger than 1” and that “it’s not possible to be more than 100% confident in anything”… And to confuse the Likelihood Principle with the prohibition of data dependent priors. And to consider that the MLE and any shrinkage estimator have the same expected utility under a flat prior (since, if they had, there would be no Bayes estimator!).”*

I’m not sure I completely understand your criticisms here. First, as for the relation between the LP and data-dependent priors — it does seem to me that the LP precludes the use of data-dependent priors. If you use data from an experiment to construct your prior, then — contrary to the LP — it will not be true that all the information provided by the experiment regarding which parameter is true is contained in the likelihood function, since some of the information provided by the experiment will also be in your prior.

Second, as to our claim that the ML estimator has the same expected utility (under the flat prior) as a shrinkage prior that it is dominated by—we incorporated this claim into our paper because it was an objection made by a statistician who read and commented on our paper. Are you saying the claim is false? If so, we would certainly like to know so that we can revise the paper to make it more accurate.

4. I was aware of Rubin’s idea that priors and utility functions (supposedly) are non-separable, but I didn’t (and don’t) quite see the relevance of that idea to Stein estimation.

5. *“Similarly, very little of substance can be found about empirical Bayes estimation and its philosophical foundations.”*

What we say about empirical Bayes priors is that they cannot be interpreted as degrees of belief; they are just tools. It will be surprising to many philosophers that priors are sometimes used in such an instrumentalist fashion in statistics.

6. The reason why we made a comparison between Stein estimation and AIC was two-fold: (a) for sociological reasons, philosophers are much more familiar with model selection than they are with, say, the LASSO or other regularized regression methods. (b) To us, it’s precisely because model selection and estimation are such different enterprises that it’s interesting that they have such a deep connection: despite being very different, AIC and shrinkage both rely on a bias-variance trade-off.

7. *“I also object to the envisioned possibility of a shrinkage estimator that would improve every component of the MLE (in a uniform sense) as it contradicts the admissibility of the single component MLE!”*

I don’t think our suggestion here contradicts the admissibility of single component MLE. The idea is just that if we have data D and D’ about parameters φ and φ’, then the estimates of both φ and φ’ can sometimes be improved if the estimation problems are lumped together and a shrinkage estimator is used. This doesn’t contradict the admissibility of MLE, because MLE is still admissible on each of the data sets for each of the parameters.

Again, thanks for reading the paper and for the feedback—we really do want to make sure our paper is accurate, so your feedback is much appreciated. Lastly, I apologize for the length of this comment.

Olav Vassend

January 16, 2016 at 5:37 am

If there is no uncertainty about the model, bayes implies one thing. If there is uncertainty about the model, you can easily get data dependant priors, using nothing more than the sum/product rules of probability theory (all that’s needed to prove Bayes Theorem). See here:

http://www.bayesianphilosophy.com/the-data-can-change-the-prior/

What’s happening here is that Bayes is being equated with one particular application and context of the sum/product rules (Bayes theorem with no doubt about the model) for historical reasons. In reality, Bayes is anything derivable from the sum/product rules for any given context.

When you start exploring more general consequences of the sum/product rules in other contexts, something funny happens. Many things which don’t seem “Bayesian” according that limited historical understanding fall right out of the equations.

January 15, 2016 at 6:23 pm

A few more words then:

1. Using the maximum likelihood estimator (MLE) is not a frequentist move per se since the MLE is a special form of MAP. Whether it has nice or poor frequentist properties depends on the setting and on the properties one is interested in.

2. For me empirical Bayes is not Bayes.

3. The (i) likelihood principle and the (ii) prohibition of data-dependent priors seem to be orthogonal principles. Once I observe my data, I can set my prior according to this data irrelevant of my choice of likelihood hence of (i) and still be violating (ii). For the second part, I remain confused by the statement: the posterior expected loss is not the same for the MLE and for an arbitrary shrinkage estimator. Unless the Bayes risks are infinite for both, they also differ.

4. One can create shrinkage by using a different prior or by using a different loss. (I have mostly forgotten what I meant there!)

Nothing of substance to add about the following points!

January 15, 2016 at 2:41 pm

Thank you for making the comment into a post!