## dynamic mixtures and frequentist ABC

Posted in Statistics with tags , , , , , , , , , , , , , , , on November 30, 2022 by xi'an

This early morning in NYC, I spotted this new arXival by Marco Bee (whom I know from the time he was writing his PhD with my late friend Bernhard Flury) and found he has been working for a while on ABC related problems. The mixture model he considers therein is a form of mixture of experts, where the weights of the mixture components are not constant but functions on (0,1) of the entry as well. This model was introduced by Frigessi, Haug and Rue in 2002 and is often used as a benchmark for ABC methods, since it is missing its normalising constant as in e.g.

$f(x) \propto p(x) f_1(x) + (1-p(x)) f_2(x)$

even with all entries being standard pdfs and cdfs. Rather than using a (costly) numerical approximation of the “constant” (as a function of all unknown parameters involved), Marco follows the approximate maximum likelihood approach of my Warwick colleagues, Javier Rubio [now at UCL] and Adam Johansen. It is based on the [SAME] remark that under a uniform prior and using an approximation to the actual likelihood the MAP estimator is also the MLE for that approximation. The approximation is ABC-esque in that a pseudo-sample is generated from the true model (attached to a simulation of the parameter) and the pair is accepted if the pseudo-sample stands close enough to the observed sample. The paper proposes to use the Cramér-von Mises distance, which only involves ranks. Given this “posterior” sample, an approximation of the posterior density is constructed and then numerically optimised. From a frequentist view point, a direct estimate of the mode would be preferable. From my Bayesian perspective, this sounds like a step backwards, given that once a posterior sample is available, reconnecting with an approximate MLE does not sound highly compelling.

## a neat EM resolution

Posted in Books, Kids, Statistics, University life with tags , , , , , , on February 3, 2021 by xi'an

Read (and answered) this question on X validation about finding the maximum likelihood estimator of a 2×2 Gaussian covariance matrix when some observations are partly missing.  The neat thing is that, in this case, the maximisation step is identical to the maximum likelihood estimation of the 2×2 Gaussian covariance matrix by redefining the empirical covariance matrix into Z and maximising

$-n\log|\Sigma|-\text{trace}(Z\Sigma^{-1})$

in Σ. Nothing involved but fun to explain, nonetheless. (In my final exam this year, no student even approached the EM questions!)

## double if not exponential

Posted in Books, Kids, Statistics, University life with tags , , , , , , on December 10, 2020 by xi'an

In one of my last quizzes for the year, as the course is about to finish, I asked whether mean or median was the MLE for a double exponential sample of odd size, without checking for the derivation of the result, as I was under the impression it was a straightforward result. Despite being outside exponential families. As my students found it impossible to solve within the allocated 5 minutes, I had a look, could not find an immediate argument (!), and used instead this nice American Statistician note by Robert Norton based on the derivative being the number of observations smaller than θ minus the number of observations larger than θ.  This leads to the result as well as the useful counter-example of a range of MLE solutions when the number of observations is even.

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , on December 5, 2017 by xi'an

I am a PhD student in biostatistics, and an avid reader of your work. I recently came across this blog post, where you review a text on statistical paradoxes, and I was struck by this section:

“For instance, the author considers the MLE being biased to be a paradox (p.117), while omitting the much more substantial “paradox” of the non-existence of unbiased estimators of most parameters—which simply means unbiasedness is irrelevant. Or the other even more puzzling “paradox” that the secondary MLE derived from the likelihood associated with the distribution of a primary MLE may differ from the primary. (My favourite!)”

I found this section provocative, but I am unclear on the nature of these “paradoxes”. I reviewed my stat inference notes and came across the classic example that there is no unbiased estimator for 1/p w.r.t. a binomial distribution, but I believe you are getting at a much more general result. If it’s not too much trouble, I would sincerely appreciate it if you could point me in the direction of a reference or provide a bit more detail for these two “paradoxes”.

The text is Chang’s Paradoxes in Scientific Inference, which I indeed reviewed negatively. To answer about the bias “paradox”, it is indeed a neglected fact that, while the average of any transform of a sample obviously is an unbiased estimator of its mean (!), the converse does not hold, namely, an arbitrary transform of the model parameter θ is not necessarily enjoying an unbiased estimator. In Lehmann and Casella, Chapter 2, Section 4, this issue is (just slightly) discussed. But essentially, transforms that lead to unbiased estimators are mostly the polynomial transforms of the mean parameters… (This also somewhat connects to a recent X validated question as to why MLEs are not always unbiased. Although the simplest explanation is that the transform of the MLE is the MLE of the transform!) In exponential families, I would deem the range of transforms with unbiased estimators closely related to the collection of functions that allow for inverse Laplace transforms, although I cannot quote a specific result on this hunch.

The other “paradox” is that, if h(X) is the MLE of the model parameter θ for the observable X, the distribution of h(X) has a density different from the density of X and, hence, its maximisation in the parameter θ may differ. An example (my favourite!) is the MLE of ||a||² based on x N(a,I) which is ||x||², a poor estimate, and which (strongly) differs from the MLE of ||a||² based on ||x||², which is close to (1-p/||x||²)²||x||² and (nearly) admissible [as discussed in the Bayesian Choice].

## empirical Bayes, reference priors, entropy & EM

Posted in Mountains, Statistics, Travel, University life with tags , , , , , , , , , , , on January 9, 2017 by xi'an

Klebanov and co-authors from Berlin arXived this paper a few weeks ago and it took me a quiet evening in Darjeeling to read it. It starts with the premises that led Robbins to introduce empirical Bayes in 1956 (although the paper does not appear in the references), where repeated experiments with different parameters are run. Except that it turns non-parametric in estimating the prior. And to avoid resorting to the non-parametric MLE, which is the empirical distribution, it adds a smoothness penalty function to the picture. (Warning: I am not a big fan of non-parametric MLE!) The idea seems to have been Good’s, who acknowledged using the entropy as penalty is missing in terms of reparameterisation invariance. Hence the authors suggest instead to use as penalty function on the prior a joint relative entropy on both the parameter and the prior, which amounts to the average of the Kullback-Leibler divergence between the sampling distribution and the predictive based on the prior. Which is then independent of the parameterisation. And of the dominating measure. This is the only tangible connection with reference priors found in the paper.

The authors then introduce a non-parametric EM algorithm, where the unknown prior becomes the “parameter” and the M step means optimising an entropy in terms of this prior. With an infinite amount of data, the true prior (meaning the overall distribution of the genuine parameters in this repeated experiment framework) is a fixed point of the algorithm. However, it seems that the only way it can be implemented is via discretisation of the parameter space, which opens a whole Pandora box of issues, from discretisation size to dimensionality problems. And to motivating the approach by regularisation arguments, since the final product remains an atomic distribution.

While the alternative of estimating the marginal density of the data by kernels and then aiming at the closest entropy prior is discussed, I find it surprising that the paper does not consider the rather natural of setting a prior on the prior, e.g. via Dirichlet processes.