## artificial EM

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on October 28, 2020 by xi'an

When addressing an X validated question on the use of the EM algorithm when estimating a Normal mean, my first comment was that it was inappropriate since there is no missing data structure to anchor by (right preposition?). However I then reflected upon the infinite number of ways to demarginalise the normal density into a joint density

$$∫ f(x,z;μ)dz = φ(x–μ)$$

from the (slice sampler) call to an indicator function for $$f(x,z;μ)$$ to a joint Normal distribution with an arbitrary correlation. While the joint Normal representation produces a sequence converging to the MLE, the slice representation utterly fails as the indicator functions make any starting value of $$μ$$ a fixed point for EM.

Incidentally, when quoting from Wikipedia on the purpose of the EM algorithm, the following passage

Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations.

struck me as confusing and possibly wrong since it seems to suggest to seek a maximum in both the parameter and the latent variables. Which does not produce the same value as the observed likelihood maximisation.

## visualising bias and unbiasedness

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , , on April 29, 2019 by xi'an

A question on X validated led me to wonder at the point made by Christopher Bishop in his Pattern Recognition and Machine Learning book about the MLE of the Normal variance being biased. As it is illustrated by the above graph that opposes the true and green distribution of the data (made of two points) against the estimated and red distribution. While it is true that the MLE under-estimates the variance on average, the pictures are cartoonist caricatures in their deviance permanence across three replicas. When looking at 10⁵ replicas, rather than three, and at samples of size 10, rather than 2, the distinction between using the MLE (left) and the unbiased estimator of σ² (right).

When looking more specifically at the case n=2, the humongous variability of the density estimate completely dwarfs the bias issue:

Even when averaging over all 10⁵ replications, the difference is hard to spot (and both estimations are more dispersed than the truth!):

## posterior distribution missing the MLE

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , on April 25, 2019 by xi'an

An X validated question as to why the MLE is not necessarily (well) covered by a posterior distribution. Even for a flat prior… Which in restrospect highlights the fact that the MLE (and the MAP) are invasive species in a Bayesian ecosystem. Since they do not account for the dominating measure. And hence do not fare well under reparameterisation. (As a very much to the side comment, I also managed to write an almost identical and simultaneous answer to the first answer to the question.)

## almost uniform but far from straightforward

Posted in Books, Kids, Statistics with tags , , , , , , , on October 24, 2018 by xi'an

A question on X validated about a [not exactly trivial] maximum likelihood for a triangular function led me to a fascinating case, as exposed by Olver in 1972 in The American Statistician. When considering an asymmetric triangle distribution on (0,þ), þ being fixed, the MLE for the location of the tip of the triangle is necessarily one of the observations [which was not the case in the original question on X validated ]. And not in an order statistic of rank j that does not stand in the j-th uniform partition of (0,þ). Furthermore there are opportunities for observing several global modes… In the X validated case of the symmetric triangular distribution over (0,θ), with ½θ as tip of the triangle, I could not figure an alternative to the pedestrian solution of looking separately at each of the (n+1) intervals where θ can stand and returning the associated maximum on that interval. Definitely a good (counter-)example about (in)sufficiency for class or exam!

## Implicit maximum likelihood estimates

Posted in Statistics with tags , , , , , , , , , , on October 9, 2018 by xi'an

An ‘Og’s reader pointed me to this paper by Li and Malik, which made it to arXiv after not making it to NIPS. While the NIPS reviews were not particularly informative and strongly discordant, the authors point out in the comments that they are available for the sake of promoting discussion. (As made clear in earlier posts, I am quite supportive of this attitude! Disclaimer: I was not involved in an evaluation of this paper, neither for NIPS nor for another conference or journal!!) Although the paper does not seem to mention ABC in the setting of implicit likelihoods and generative models, there is a reference to the early (1984) paper by Peter Diggle and Richard Gratton that is often seen as the ancestor of ABC methods. The authors point out numerous issues with solutions proposed for parameter estimation in such implicit models. For instance, for GANs, they signal that “minimizing the Jensen-Shannon divergence or the Wasserstein distance between the empirical data distribution and the model distribution does not necessarily minimize the same between the true data distribution and the model distribution.” (Not mentioning the particular difficulty with Bayesian GANs.) Their own solution is the implicit maximum likelihood estimator, which picks the value of the parameter θ bringing a simulated sample the closest to the observed sample. Closest in the sense of the Euclidean distance between both samples. Or between the minimum of several simulated samples and the observed sample. (The modelling seems to imply the availability of n>1 observed samples.) They advocate using a stochastic gradient descent approach for finding the optimal parameter θ which presupposes that the dependence between θ and the simulated samples is somewhat differentiable. (And this does not account for using a min, which would make differentiation close to impossible.) The paper then meanders in a lengthy discussion as to whether maximising the likelihood makes sense, with a rather naïve view on why using the empirical distribution in a Kullback-Leibler divergence does not make sense! What does not make sense is considering the finite sample approximation to the Kullback-Leibler divergence with the true distribution in my opinion.