Archive for MLE

about paradoxes

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , on December 5, 2017 by xi'an

An email I received earlier today about statistical paradoxes:

I am a PhD student in biostatistics, and an avid reader of your work. I recently came across this blog post, where you review a text on statistical paradoxes, and I was struck by this section:

“For instance, the author considers the MLE being biased to be a paradox (p.117), while omitting the much more substantial “paradox” of the non-existence of unbiased estimators of most parameters—which simply means unbiasedness is irrelevant. Or the other even more puzzling “paradox” that the secondary MLE derived from the likelihood associated with the distribution of a primary MLE may differ from the primary. (My favourite!)”

I found this section provocative, but I am unclear on the nature of these “paradoxes”. I reviewed my stat inference notes and came across the classic example that there is no unbiased estimator for 1/p w.r.t. a binomial distribution, but I believe you are getting at a much more general result. If it’s not too much trouble, I would sincerely appreciate it if you could point me in the direction of a reference or provide a bit more detail for these two “paradoxes”.

The text is Chang’s Paradoxes in Scientific Inference, which I indeed reviewed negatively. To answer about the bias “paradox”, it is indeed a neglected fact that, while the average of any transform of a sample obviously is an unbiased estimator of its mean (!), the converse does not hold, namely, an arbitrary transform of the model parameter θ is not necessarily enjoying an unbiased estimator. In Lehmann and Casella, Chapter 2, Section 4, this issue is (just slightly) discussed. But essentially, transforms that lead to unbiased estimators are mostly the polynomial transforms of the mean parameters… (This also somewhat connects to a recent X validated question as to why MLEs are not always unbiased. Although the simplest explanation is that the transform of the MLE is the MLE of the transform!) In exponential families, I would deem the range of transforms with unbiased estimators closely related to the collection of functions that allow for inverse Laplace transforms, although I cannot quote a specific result on this hunch.

The other “paradox” is that, if h(X) is the MLE of the model parameter θ for the observable X, the distribution of h(X) has a density different from the density of X and, hence, its maximisation in the parameter θ may differ. An example (my favourite!) is the MLE of ||a||² based on x N(a,I) which is ||x||², a poor estimate, and which (strongly) differs from the MLE of ||a||² based on ||x||², which is close to (1-p/||x||²)²||x||² and (nearly) admissible [as discussed in the Bayesian Choice].

empirical Bayes, reference priors, entropy & EM

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , on January 9, 2017 by xi'an

Klebanov and co-authors from Berlin arXived this paper a few weeks ago and it took me a quiet evening in Darjeeling to read it. It starts with the premises that led Robbins to introduce empirical Bayes in 1956 (although the paper does not appear in the references), where repeated experiments with different parameters are run. Except that it turns non-parametric in estimating the prior. And to avoid resorting to the non-parametric MLE, which is the empirical distribution, it adds a smoothness penalty function to the picture. (Warning: I am not a big fan of non-parametric MLE!) The idea seems to have been Good’s, who acknowledged using the entropy as penalty is missing in terms of reparameterisation invariance. Hence the authors suggest instead to use as penalty function on the prior a joint relative entropy on both the parameter and the prior, which amounts to the average of the Kullback-Leibler divergence between the sampling distribution and the predictive based on the prior. Which is then independent of the parameterisation. And of the dominating measure. This is the only tangible connection with reference priors found in the paper.

The authors then introduce a non-parametric EM algorithm, where the unknown prior becomes the “parameter” and the M step means optimising an entropy in terms of this prior. With an infinite amount of data, the true prior (meaning the overall distribution of the genuine parameters in this repeated experiment framework) is a fixed point of the algorithm. However, it seems that the only way it can be implemented is via discretisation of the parameter space, which opens a whole Pandora box of issues, from discretisation size to dimensionality problems. And to motivating the approach by regularisation arguments, since the final product remains an atomic distribution.

While the alternative of estimating the marginal density of the data by kernels and then aiming at the closest entropy prior is discussed, I find it surprising that the paper does not consider the rather natural of setting a prior on the prior, e.g. via Dirichlet processes.

twilight zone [of statistics]

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , on February 26, 2016 by xi'an

mixture with unknown means“I have decided that mixtures, like tequila, are inherently evil and should be avoided at all costs.” L. Wasserman

Larry Wasserman once remarked that finite mixtures were like the twilight zone of statistics, thanks to the numerous idiosyncrasies associated with such models. And George Casella had similar strong reservations about mixture estimation. Avi Feller and co-authors [including Natesh Pillai] have just arXived a paper on this topic, exhibiting shocking (!) properties of the MLE! Their core example is a mixture of two normal distributions with known common variance and known weight different from 0.5, which ensures identifiability. This is a favourite example of mine that we used for instance in our book Introducing Monte Carlo methods with R. If only because we can plot the likelihood and posterior surfaces. (Warning: I wrote those notes on an earlier version of the paper, so mileage may vary in terms of accuracy!)

The “shocking” discovery in the paper is that the MLE is wrong as often as not in selecting the sign of the difference Δ between both means, with an additional accumulation point at zero. The global mode may thus be in the wrong place for small enough sample sizes. And even for larger sizes: when the difference between the means is small the likelihood is likely to be unimodal with a mode quite close to zero. (An interesting remark is that the likelihood derivative is always zero at Δ=0 when considering the special case of both means equal to -Δ and to πΔ/(1-π), respectively, which implies that the overall mean of the mixture is equal to zero. A potential connection with our reparameterisation paper, maybe?)

The alternative proposed by Avi and his co-authors is to proceed through moments, i.e., to revert to Pearson (1892). There are however difficulties with this approach, first and foremost the non-uniqueness of the moment equations used to estimate Δ. For instance, the second cumulant equation chosen by the authors is not always defined as opposed to the third cumulant equation (why not using this third cumulant then). Which does not always produce the right sign… But, in a strange twist, the authors turn those deficiencies into signals for both pathologies (wrong sign and “pile-up” at zero).

“…the grid bootstrap yields an exact p-value for any valid test statistic.”

The most importance issue in this framework being in estimating the parameters, the authors opt for an approach based on tests, which is definitely surprising given the well-known deficiencies of standard tests in mixtures. The test chosen here is a Wald test with a statistic equal to the χ² version of the first cumulant differences. I am surprised that the χ² approximation works in such an unfriendly setting. And I do not understand how the grid is used, unless a certain degree of approximation is accepted, which takes us back to the “dark ages” of imposing a minimal distance Δ to achieve consistency, as in Ghosh and Sen (1985).

muminusmu0 muminusmu1

“..our concern about sign error is trivial in the Bayesian setting: the global mode is simply a poor summary of a multi-modal posterior. More broadly, the weak identification issues we highlight in this paper are not necessarily relevant to a strict Bayesian.”

A priori, I do not think pathologies of the MLE always transfer to Bayes estimators, unless one uses the MAP as an [poor] estimator. But using the MAP is not necessary since posterior means are meaningful in this identified setting, where label switching should not occur. However, running the same experiments with a Gaussian prior on both means and using the posterior mean as my estimator, I did obtain the same pathology of Bayes estimates [also produced in the supplementary material] not concentrating on the true value of the difference, but putting weight on the opposite value and at zero. Using a less standard prior inspired by David Rossell’s talk on non-local priors two weeks ago, which avoids a neighbourhood of zero, I did not get a much different picture as illustrated below:

muminusmux0 muminusmux0

Overall, I remain somewhat uncertain as to what to conclude from this pathological behaviour. When both means are close enough, the sign of the difference is often estimated wrongly. But that could simply mean that the means are not significantly different, for that sample size…

post-grading weekend

Posted in Kids, pictures, Statistics, University life with tags , , , , , , on January 19, 2015 by xi'an

IMG_2767Now my grading is over, I can reflect on the unexpected difficulties in the mathematical statistics exam. I knew that the first question in the multiple choice exercise, borrowed from Cross Validation, was going to  be quasi-impossible and indeed only one student out of 118 managed to find the right solution. More surprisingly, most students did not manage to solve the (absence of) MLE when observing that n unobserved exponential Exp(λ) were larger than a fixed bound δ. I was also amazed that they did poorly on a N(0,σ²) setup, failing to see that

\mathbb{E}[\mathbb{I}(X_1\le -1)] = \Phi(-1/\sigma)

and determine an unbiased estimator that can be improved by Rao-Blackwellisation. No student reached the conditioning part. And a rather frequent mistake more understandable due to the limited exposure they had to Bayesian statistics: many confused parameter λ with observation x in the prior, writing

\pi(\lambda|x) \propto \lambda \exp\{-\lambda x\} \times x^{a-1} \exp\{-bx\}

instead of

\pi(\lambda|x) \propto \lambda \exp\{-\lambda x\} \times \lambda^{a-1} \exp\{-b\lambda\}

hence could not derive a proper posterior.

paradoxes in scientific inference: a reply from the author

Posted in Books, Statistics, University life with tags , , , , , , , , , on December 26, 2012 by xi'an

(I received the following set of comments from Mark Chang after publishing a review of his book on the ‘Og. Here they are, verbatim, except for a few editing and spelling changes. It’s a huge post as Chang reproduces all of my comments as well.)

Professor Christian Robert reviewed my book: “Paradoxes in Scientific Inference”. I found that the majority of his criticisms had no foundation and were based on his truncated way of reading. I gave point-by-point responses below. For clarity, I kept his original comments.

Robert’s Comments: This CRC Press book was sent to me for review in CHANCE: Paradoxes in Scientific Inference is written by Mark Chang, vice-president of AMAG Pharmaceuticals. The topic of scientific paradoxes is one of my primary interests and I have learned a lot by looking at Lindley-Jeffreys and Savage-Dickey paradoxes. However, I did not find a renewed sense of excitement when reading the book. The very first (and maybe the best!) paradox with Paradoxes in Scientific Inference is that it is a book from the future! Indeed, its copyright year is 2013 (!), although I got it a few months ago. (Not mentioning here the cover mimicking Escher’s “paradoxical” pictures with dices. A sculpture due to Shigeo Fukuda and apparently not quoted in the book. As I do not want to get into another dice cover polemic, I will abstain from further comments!)

Thank you, Robert for reading and commenting on part of my book. I had the same question on the copyright year being 2013 when it was actually published in previous year. I believe the same thing had happened to my other books too. The incorrect year causes confusion for future citations. The cover was designed by the publisher. They gave me few options and I picked the one with dices. I was told that the publisher has the copyright for the art work. I am not aware of the original artist. Continue reading

estimating the measure and hence the constant

Posted in pictures, Running, Statistics, University life with tags , , , , , , , on December 6, 2012 by xi'an

Dawn in Providence, Nov. 30, 2012As mentioned on my post about the final day of the ICERM workshop, Xiao-Li Meng addresses this issue of “estimating the constant” in his talk. It is even his central theme. Here are his (2011) slides as he sent them to me (with permission to post them!):

He therefore points out in slide #5 why the likelihood cannot be expressed in terms of the normalising constant because this is not a free parameter. Right! His explanation for the approximation of the unknown constant is then to replace the known but intractable dominating measure—in the sense that it cannot compute the integral—with a discrete (or non-parametric) measure supported by the sample. Because the measure is defined up to a constant, this leads to sample weights being proportional to the inverse density. Of course, this representation of the problem is open to criticism: why focus only on measures supported by the sample? The fact that it is the MLE is used as an argument in Xiao-Li’s talk, but this can alternatively be seen as a drawback: I remember reviewing Dankmar Böhning’s Computer-Assisted Analysis of Mixtures and being horrified when discovering this feature! I am currently more agnostic since this appears as an alternative version of empirical likelihood. There are still questions about the measure estimation principle: for instance, when handling several samples from several distributions, why should they all contribute to a single estimate of μ rather than to a product of measures? (Maybe because their models are all dominated by the same measure μ.) Now, getting back to my earlier remark, and as a possible answer to Larry’s quesiton, there could well be a Bayesian version of the above, avoiding the rough empirical likelihood via Gaussian or Drichlet process prior modelling.

bounded normal mean

Posted in R, Statistics, University life with tags , , , , , , , , , on November 25, 2011 by xi'an

A few days ago, one of my students, Jacopo Primavera (from La Sapienza, Roma) presented his “reading the classic” paper, namely the terrific bounded normal mean paper by my friends George Casella and Bill Strawderman (1981, Annals of Statistics). Even though I knew this paper quite well, having read (and studied) it myself many times, starting in 1987 in Purdue with Mary Ellen Bock, it was a pleasure to spend another hour on it, as I came up with new perspectives and new questions. Above are my scribbled notes on the back of the [Epson] beamer documentation. One such interesting question is whether or not it is possible to devise a computer code that would [approximately] produce the support of the least favourable prior for a given bound m (in a reasonable time). Another open question is to find the limiting bounds for which a 2 point, a 3 point, &tc., support prior is the least favourable prior. This was established in Casella and Strawderman for bounds less than 1.08 and for bounds between 1.4 and 1.6, but I am not aware of other results in that direction… Here are the slides used by Jacopo: