Archive for James-Stein estimator

efficiency and the Fréchet-Darmois-Cramèr-Rao bound

Posted in Books, Kids, Statistics with tags , , , , , , , , , , , on February 4, 2019 by xi'an

 

Following some entries on X validated, and after grading a mathematical statistics exam involving Cramèr-Rao, or Fréchet-Darmois-Cramèr-Rao to include both French contributors pictured above, I wonder as usual at the relevance of a concept of efficiency outside [and even inside] the restricted case of unbiased estimators. The general (frequentist) version is that the variance of an estimator δ of [any transform of] θ with bias b(θ) is

I(θ)⁻¹ (1+b'(θ))²

while a Bayesian version is the van Trees inequality on the integrated squared error loss

(E(I(θ))+I(π))⁻¹

where I(θ) and I(π) are the Fisher information and the prior entropy, respectively. But this opens a whole can of worms, in my opinion since

  • establishing that a given estimator is efficient requires computing both the bias and the variance of that estimator, not an easy task when considering a Bayes estimator or even the James-Stein estimator. I actually do not know if any of the estimators dominating the standard Normal mean estimator has been shown to be efficient (although there exist results for closed form expressions of the James-Stein estimator quadratic risk, including one of mine the Canadian Journal of Statistics published verbatim in 1988). Or is there a result that a Bayes estimator associated with the quadratic loss is by default efficient in either the first or second sense?
  • while the initial Fréchet-Darmois-Cramèr-Rao bound is restricted to unbiased estimators (i.e., b(θ)≡0) and unable to produce efficient estimators in all settings but for the natural parameter in the setting of exponential families, moving to the general case means there exists one efficiency notion for every bias function b(θ), which makes the notion quite weak, while not necessarily producing efficient estimators anyway, the major impediment to taking this notion seriously;
  • moving from the variance to the squared error loss is not more “natural” than using any [other] convex combination of variance and squared bias, creating a whole new class of optimalities (a grocery of cans of worms!);
  • I never got into the van Trees inequality so cannot say much, except that the comparison between various priors is delicate since the integrated risks are against different parameter measures.

Larry Brown (1940-2018)

Posted in Books, pictures, Statistics, University life with tags , , , , , , on February 21, 2018 by xi'an

Just learned a few minutes ago that my friend Larry Brown has passed away today, after fiercely fighting cancer till the end. My thoughts of shared loss and deep support first go to my friend Linda, his wife, and to their children. And to all their colleagues and friends at Wharton. I have know Larry for all of my career, from working on his papers during my PhD to being a temporary tenant in his Cornell University office in White Hall while he was mostly away in sabbatical during the academic year 1988-1989, and then periodically meeting with him in Cornell and then Wharton along the years. He and Linday were always unbelievably welcoming and I fondly remember many times at their place or in superb restaurants in Phillie and elsewhere.  And of course remembering just as fondly the many chats we had along these years about decision theory, admissibility, James-Stein estimation, and all aspects of mathematical statistics he loved and managed at an ethereal level of abstraction. His book on exponential families remains to this day one of the central books in my library, to which I kept referring on a regular basis… For certain, I will miss the friend and the scholar along the coming years, but keep returning to this book and have shared memories coming back to me as I will browse through its yellowed pages and typewriter style. Farewell, Larry, and thanks for everything!

Charles M. Stein [1920-2016]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , on November 26, 2016 by xi'an

I have just heard that Charles Stein, Professor at Stanford University, passed away last night. Although the following image is definitely over-used, I truly feel this is the departure of a giant of statistics.  He has been deeply influential on the fields of probability and mathematical statistics, primarily in decision theory and approximation techniques. On the first field, he led to considerable changes in the perception of optimality by exhibiting the Stein phenomenon, where the aggregation of several admissible estimators of unrelated quantities may (and will) become inadmissible for the joint estimation of those quantities! Although the result can be explained by mathematical and statistical reasoning, it was still dubbed a paradox due to its counter-intuitive nature. More foundationally, it led to expose the ill-posed nature of frequentist optimality criteria and certainly contributed to the Bayesian renewal of the 1980’s, before the MCMC revolution. (It definitely contributed to my own move, as I started working on the Stein phenomenon during my thesis, before realising the fundamentally Bayesian nature of the domination results.)

“…the Bayesian point of view is often accompanied by an insistence that people ought to agree to a certain doctrine even without really knowing what this doctrine is.” (Statistical Science, 1986)

The second major contribution of Charles Stein was the introduction of a new technique for normal approximation that is now called the Stein method. It relies on a differential operator and produces estimates of approximation error in Central Limit theorems, even in dependent settings. While I am much less familiar with this aspect of Charles Stein’s work, I believe the impact it has had on the field is much more profound and durable than the Stein effect in Normal mean estimation.

(During the Vietnam War, he was quite active in the anti-war movement and the above picture from 2003 shows that his opinions had not shifted over time!) A giant truly has gone.

the philosophical importance of Stein’s paradox [a reply from the authors]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on January 15, 2016 by xi'an

[In the wake of my comment on this paper written by three philosophers of Science, I received this reply from Olav Vassend.]

Thank you for reading our paper and discussing it on your blog! Our purpose with the paper was to give an introduction to Stein’s phenomenon for a philosophical audience; it was not meant to — and probably will not — offer a new and interesting perspective for a statistician who is already very familiar with Stein’s phenomenon and its extensive literature.

I have a few more specific comments:

1. We don’t rechristen Stein’s phenomenon as “holistic pragmatism.” Rather, holistic pragmatism is the attitude to frequentist estimation that we think is underwritten by Stein’s phenomenon. Since MLE is sometimes admissible and sometimes not, depending on the number of parameters estimated, the researcher has to take into account his or her goals (whether total accuracy or individual-parameter accuracy is more important) when picking an estimator. To a statistician, this might sound obvious, but to philosophers it’s a pretty radical idea.

2. “The part connecting Stein with Bayes again starts on the wrong foot, since it is untrue that any shrinkage estimator can be expressed as a Bayes posterior mean. This is not even true for the original James-Stein estimator, i.e., it is not a Bayes estimator and cannot be a Bayes posterior mean.”

That seems to depend on what you mean by a “Bayes estimator.” It is possible to have an empirical Bayes prior (constructed from the sample) whose posterior mean is identical to the original James-Stein estimator. But if you don’t count empirical Bayes priors as Bayesian, then you are right.

3. “And to state that improper priors “integrate to a number larger than 1” and that “it’s not possible to be more than 100% confident in anything”… And to confuse the Likelihood Principle with the prohibition of data dependent priors. And to consider that the MLE and any shrinkage estimator have the same expected utility under a flat prior (since, if they had, there would be no Bayes estimator!).”

I’m not sure I completely understand your criticisms here. First, as for the relation between the LP and data-dependent priors — it does seem to me that the LP precludes the use of data-dependent priors.  If you use data from an experiment to construct your prior, then — contrary to the LP — it will not be true that all the information provided by the experiment regarding which parameter is true is contained in the likelihood function, since some of the information provided by the experiment will also be in your prior.

Second, as to our claim that the ML estimator has the same expected utility (under the flat prior) as a shrinkage prior that it is dominated by—we incorporated this claim into our paper because it was an objection made by a statistician who read and commented on our paper. Are you saying the claim is false? If so, we would certainly like to know so that we can revise the paper to make it more accurate.

4. I was aware of Rubin’s idea that priors and utility functions (supposedly) are non-separable, but I didn’t (and don’t) quite see the relevance of that idea to Stein estimation.

5. “Similarly, very little of substance can be found about empirical Bayes estimation and its philosophical foundations.”

What we say about empirical Bayes priors is that they cannot be interpreted as degrees of belief; they are just tools. It will be surprising to many philosophers that priors are sometimes used in such an instrumentalist fashion in statistics.

6. The reason why we made a comparison between Stein estimation and AIC was two-fold: (a) for sociological reasons, philosophers are much more familiar with model selection than they are with, say, the LASSO or other regularized regression methods. (b) To us, it’s precisely because model selection and estimation are such different enterprises that it’s interesting that they have such a deep connection: despite being very different, AIC and shrinkage both rely on a bias-variance trade-off.

7. “I also object to the envisioned possibility of a shrinkage estimator that would improve every component of the MLE (in a uniform sense) as it contradicts the admissibility of the single component MLE!”

I don’t think our suggestion here contradicts the admissibility of single component MLE. The idea is just that if we have data D and D’ about parameters φ and φ’, then the estimates of both φ and φ’ can sometimes be improved if the estimation problems are lumped together and a shrinkage estimator is used. This doesn’t contradict the admissibility of MLE, because MLE is still admissible on each of the data sets for each of the parameters.

Again, thanks for reading the paper and for the feedback—we really do want to make sure our paper is accurate, so your feedback is much appreciated. Lastly, I apologize for the length of this comment.

Olav Vassend

the philosophical importance of Stein’s paradox

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on November 30, 2015 by xi'an

I recently came across this paper written by three philosophers of Science, attempting to set the Stein paradox in a philosophical light. Given my past involvement, I was obviously interested about which new perspective could be proposed, close to sixty years after Stein (1956). Paper that we should actually celebrate next year! However, when reading the document, I did not find a significantly innovative approach to the phenomenon…

The paper does not start in the best possible light since it seems to justify the use of a sample mean through maximum likelihood estimation, which only is the case for a limited number of probability distributions (including the Normal distribution, which may be an implicit assumption). For instance, when the data is Student’s t, the MLE is not the sample mean, no matter how shocking that might sounds! (And while this is a minor issue, results about the Stein effect taking place in non-normal settings appear much earlier than 1998. And earlier than in my dissertation. See, e.g., Berger and Bock (1975). Or in Brandwein and Strawderman (1978).)

While the linear regression explanation for the Stein effect is already exposed in Steve Stigler’s Neyman Lecture, I still have difficulties with the argument in that for instance we do not know the value of the parameter, which makes the regression and the inverse regression of parameter means over Gaussian observations mere concepts and nothing practical. (Except for the interesting result that two observations make both regressions coincide.) And it does not seem at all intuitive (to me) that imposing a constraint should improve the efficiency of a maximisation program… Continue reading

Bayesian propaganda?

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , on April 20, 2015 by xi'an

“The question is about frequentist approach. Bayesian is admissable [sic] only by wrong definition as it starts with the assumption that the prior is the correct pre-information. James-Stein beats OLS without assumptions. If there is an admissable [sic] frequentist estimator then it will correspond to a true objective prior.”

I had a wee bit of a (minor, very minor!) communication problem on X validated, about a question on the existence of admissible estimators of the linear regression coefficient in multiple dimensions, under squared error loss. When I first replied that all Bayes estimators with finite risk were de facto admissible, I got the above reply, which clearly misses the point, and as I had edited the OP question to include more tags, the edited version was reverted with a comment about Bayesian propaganda! This is rather funny, if not hilarious, as (a) Bayes estimators are indeed admissible in the classical or frequentist sense—I actually fail to see a definition of admissibility in the Bayesian sense—and (b) the complete class theorems of Wald, Stein, and others (like Jack Kiefer, Larry Brown, and Jim Berger) come from the frequentist quest for best estimator(s). To make my point clearer, I also reproduced in my answer the Stein’s necessary and sufficient condition for admissibility from my book but it did not help, as the theorem was “too complex for [the OP] to understand”, which shows in fine the point of reading textbooks!

top model choice week (#2)

Posted in Statistics, University life with tags , , , , , , , , , , , , on June 18, 2013 by xi'an

La Défense and Maison-Lafitte from my office, Université Paris-Dauphine, Nov. 05, 2011Following Ed George (Wharton) and Feng Liang (University of Illinois at Urbana-Champaign) talks today in Dauphine, Natalia Bochkina (University of Edinburgh) will  give a talk on Thursday, June 20, at 2pm in Room 18 at ENSAE (Malakoff) [not Dauphine!]. Here is her abstract:

2 am: Simultaneous local and global adaptivity of Bayesian wavelet estimators in nonparametric regression by Natalia Bochkina

We consider wavelet estimators in the context of nonparametric regression, with the aim of finding estimators that simultaneously achieve the local and global adaptive minimax rate of convergence. It is known that one estimator – James-Stein block thresholding estimator of T.Cai (2008) – achieves simultaneously both optimal rates of convergence but over a limited set of Besov spaces; in particular, over the sets of spatially inhomogeneous functions (with 1≤ p<2) the upper bound on the global rate of this estimator is slower than the optimal minimax rate.

Another possible candidate to achieve both rates of convergence simultaneously is the Empirical Bayes estimator of Johnstone and Silverman (2005) which is an adaptive estimator that achieves the global minimax rate over a wide rage of Besov spaces and Besov balls. The maximum marginal likelihood approach is used to estimate the hyperparameters, and it can be interpreted as a Bayesian estimator with a uniform prior. We show that it also achieves the adaptive local minimax rate over all Besov spaces, and hence it does indeed achieve both local and global rates of convergence simultaneously over Besov spaces. We also give an example of how it works in practice.