Archive for EM algorithm

artificial EM

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on October 28, 2020 by xi'an

When addressing an X validated question on the use of the EM algorithm when estimating a Normal mean, my first comment was that it was inappropriate since there is no missing data structure to anchor by (right preposition?). However I then reflected upon the infinite number of ways to demarginalise the normal density into a joint density

f(x,z;μ)dz = φ(xμ)

from the (slice sampler) call to an indicator function for f(x,z;μ) to a joint Normal distribution with an arbitrary correlation. While the joint Normal representation produces a sequence converging to the MLE, the slice representation utterly fails as the indicator functions make any starting value of μ a fixed point for EM.

Incidentally, when quoting from Wikipedia on the purpose of the EM algorithm, the following passage

Finding a maximum likelihood solution typically requires taking the derivatives of the likelihood function with respect to all the unknown values, the parameters and the latent variables, and simultaneously solving the resulting equations.

struck me as confusing and possibly wrong since it seems to suggest to seek a maximum in both the parameter and the latent variables. Which does not produce the same value as the observed likelihood maximisation.

mining gold [ABC in PNAS]

Posted in Books, Statistics with tags , , , , , , , , , , , on March 13, 2020 by xi'an

Johann Brehmer and co-authors have just published a paper in PNAS entitled “Mining gold from implicit models to improve likelihood-free inference”. (Besides the pun about mining gold, the paper also involves techniques named RASCAL and SCANDAL, respectively! For Ratio And SCore Approximate Likelihood ratio and SCore-Augmented Neural Density Approximates Likelihood.) This setup is not ABC per se in that their simulator is used both to generate training data and construct a tractable surrogate model. Exploiting Geyer’s (1994) classification trick of expressing the likelihood ratio as the optimal classification ratio when facing two equal-size samples from one density and the other.

“For all these inference strategies, the augmented data is particularly powerful for enhancing the power of simulation-based inference for small changes in the parameter θ.”

Brehmer et al. argue that “the most important novel contribution that differentiates our work from the existing methods is the observation that additional information can be extracted from the simulator, and the development of loss functions that allow us to use this “augmented” data to more efficiently learn surrogates for the likelihood function.” Rather than starting from a statistical model, they also seem to use a scientific simulator made of multiple layers of latent variables z, where

x=F⁰(u⁰,z¹,θ), z¹=G¹(u¹,z²), z²=G¹(u²,z³), …

although they also call the marginal of x, p(x|θ), an (intractable) likelihood.

“The integral of the log is not the log of the integral!”

The central notion behind the improvement is a form of Rao-Blackwellisation, exploiting the simulated z‘s. Joint score functions and joint likelihood ratios are then available. Ignoring biases, the authors demonstrate that the closest approximation to the joint likelihood ratio and the joint score function that only depends on x is the actual likelihood ratio and the actual score function, respectively. Which sounds like an older EM result, except that the roles of estimate and target quantity are somehow inverted: one is approximating the marginal with the joint, while the marginal is the “best” approximation of the joint. But in the implementation of the method, an estimate of the (observed and intractable) likelihood ratio is indeed produced towards minimising an empirical loss based on two simulated samples. Learning this estimate ê(x) then allows one to use it for the actual data. It however requires fitting a new ê(x) for each pair of parameters. Providing as well an estimator of the likelihood p(x|θ). (Hence the SCANDAL!!!) A second type of approximation of the likelihood starts from the approximate value of the likelihood p(x|θ⁰) at a fixed value θ⁰ and expands it locally as an exponential family shift, with the score t(x|θ⁰) as sufficient statistic.

I find the paper definitely interesting even though it requires the representation of the (true) likelihood as a marginalisation over multiple layers of latent variables z. And does not provide an evaluation of the error involved in the process when the model is misspecified. As a minor supplementary appeal of the paper, the use of an asymmetric Galton quincunx to illustrate an intractable array of latent variables will certainly induce me to exploit it in projects and courses!

[Disclaimer: I was not involved in the PNAS editorial process at any point!]

alternatives to EM

Posted in Books, Statistics with tags , , , , , , , on January 30, 2019 by xi'an

In an arXived preprint submitted to Computational Statistics & Data Analysis, Chan, Han, and Lim study alternatives to EM for latent class models. That is, mixtures of products of Multinomials. (First occurrence of an indicator function being called the “Iverson bracket function”!) The introduction is fairly extensive given this most studied model. The criticisms of EM laid by the authors are that (a) it does not produce an evaluation of the estimation error, which does not sound correct; (b) the convergence is slow, which is also rather misleading as my [low dimensional] experience with mixtures is that it gets very quickly and apparently linearly  to the vicinity of one of the modes. The argument in favour of alternative non-linear optimisation approaches is that they can achieve quadratic convergence. One solution is a projected Quasi-Newton method, based on a quadratic approximation to the target. With some additional intricacies that make the claim of being “way easier than EM algorithm” somewhat specious. The second approach proposed in the paper is sequential quadratic programming, which incorporates the Lagrange multiplier in the target. While the different simulations in the paper show that EM may indeed call for a much larger number of iterations, the obtained likelihoods all are comparable.

clustering dynamical networks

Posted in pictures, Statistics, University life with tags , , , , , , , , , , on June 5, 2018 by xi'an


Yesterday I attended a presentation by Catherine Matias on dynamic graph structures, as she was giving a plenary talk at the 50th French statistical meeting, conveniently located a few blocks away from my office at ENSAE-CREST. In the nicely futuristic buildings of the EDF campus, which are supposed to represent cogs according to the architect, but which remind me more of these gas holders so common in the UK, at least in the past! (The E of EDF stands for electricity, but the original public company handled both gas and electricity.) This was primarily a survey of the field, which is much more diverse and multifaceted than I realised, even though I saw some recent developments by Antonietta Mira and her co-authors, as well as refereed a thesis on temporal networks at Ca’Foscari by Matteo Iacopini, which defence I will attend in early July. The difficulty in the approaches covered by Catherine stands with the amount and complexity of the latent variables induced by the models superimposed on the data. In her paper with Christophe Ambroise, she followed a variational EM approach. From the spectator perspective that is mine, I wondered at using ABC instead, which is presumably costly when the data size grows in space or in time. And at using tensor structures as in Mateo’s thesis. This reminded me as well of Luke Bornn’s modelling of basketball games following each player in real time throughout the game. (Which does not prevent the existence of latent variables.) But more vaguely and speculatively I also wonder at the meaning of the chosen models, which try to represent “everything” in the observed process, which seems doomed from the start given the heterogeneity of the data. While reaching my Keynesian pessimistic low- point, which happens rather quickly!, one could hope for projection techniques, towards reducing the dimension of the data of interest and of the parameter required by the model.

interdependent Gibbs samplers

Posted in Books, Statistics, University life with tags , , , , , , on April 27, 2018 by xi'an

Mark Kozdoba and Shie Mannor just arXived a paper on an approach to accelerate a Gibbs sampler. With title “interdependent Gibbs samplers“. In fact, it presents rather strong similarities with our SAME algorithm. More of the same, as Adam Johanssen (Warwick) entitled one of his papers! The paper indeed suggests multiplying replicas of latent variables (e.g., an hidden path for an HMM) in an artificial model. And as in our 2002 paper, with Arnaud Doucet and Simon Godsill, the focus here is on maximum likelihood estimation (of the genuine parameters, not of the latent variables). Along with argument that the resulting pseudo-posterior is akin to a posterior with a powered likelihood. And a link with the EM algorithm. And an HMM application.

“The generative model consist of simply sampling the parameters ,  and then sampling m independent copies of the paths”

If anything this proposal is less appealing than SAME because it aims directly at the powered likelihood, rather than utilising an annealed sequence of powers that allows for a primary exploration of the whole parameter space before entering the trapping vicinity of a mode. Which makes me fail to catch the argument from the authors that this improves Gibbs sampling, as a more acute mode has on the opposite the dangerous feature of preventing visits to other modes. Hence the relevance to resort to some form of annealing.

As already mused upon in earlier posts, I find it most amazing that this technique has been re-discovered so many times, both in statistics and in adjacent fields. The idea of powering the likelihood with independent copies of the latent variables is obviously natural (since a version pops up every other year, always under a different name), but earlier versions should eventually saturate the market!