Archive for latent variable models

mining gold [ABC in PNAS]

Posted in Books, Statistics with tags , , , , , , , , , , , on March 13, 2020 by xi'an

Johann Brehmer and co-authors have just published a paper in PNAS entitled “Mining gold from implicit models to improve likelihood-free inference”. (Besides the pun about mining gold, the paper also involves techniques named RASCAL and SCANDAL, respectively! For Ratio And SCore Approximate Likelihood ratio and SCore-Augmented Neural Density Approximates Likelihood.) This setup is not ABC per se in that their simulator is used both to generate training data and construct a tractable surrogate model. Exploiting Geyer’s (1994) classification trick of expressing the likelihood ratio as the optimal classification ratio when facing two equal-size samples from one density and the other.

“For all these inference strategies, the augmented data is particularly powerful for enhancing the power of simulation-based inference for small changes in the parameter θ.”

Brehmer et al. argue that “the most important novel contribution that differentiates our work from the existing methods is the observation that additional information can be extracted from the simulator, and the development of loss functions that allow us to use this “augmented” data to more efficiently learn surrogates for the likelihood function.” Rather than starting from a statistical model, they also seem to use a scientific simulator made of multiple layers of latent variables z, where

x=F⁰(u⁰,z¹,θ), z¹=G¹(u¹,z²), z²=G¹(u²,z³), …

although they also call the marginal of x, p(x|θ), an (intractable) likelihood.

“The integral of the log is not the log of the integral!”

The central notion behind the improvement is a form of Rao-Blackwellisation, exploiting the simulated z‘s. Joint score functions and joint likelihood ratios are then available. Ignoring biases, the authors demonstrate that the closest approximation to the joint likelihood ratio and the joint score function that only depends on x is the actual likelihood ratio and the actual score function, respectively. Which sounds like an older EM result, except that the roles of estimate and target quantity are somehow inverted: one is approximating the marginal with the joint, while the marginal is the “best” approximation of the joint. But in the implementation of the method, an estimate of the (observed and intractable) likelihood ratio is indeed produced towards minimising an empirical loss based on two simulated samples. Learning this estimate ê(x) then allows one to use it for the actual data. It however requires fitting a new ê(x) for each pair of parameters. Providing as well an estimator of the likelihood p(x|θ). (Hence the SCANDAL!!!) A second type of approximation of the likelihood starts from the approximate value of the likelihood p(x|θ⁰) at a fixed value θ⁰ and expands it locally as an exponential family shift, with the score t(x|θ⁰) as sufficient statistic.

I find the paper definitely interesting even though it requires the representation of the (true) likelihood as a marginalisation over multiple layers of latent variables z. And does not provide an evaluation of the error involved in the process when the model is misspecified. As a minor supplementary appeal of the paper, the use of an asymmetric Galton quincunx to illustrate an intractable array of latent variables will certainly induce me to exploit it in projects and courses!

[Disclaimer: I was not involved in the PNAS editorial process at any point!]

unbiased product of expectations

Posted in Books, Statistics, University life with tags , , , , , , , , on August 5, 2019 by xi'an

m_biomet_106_2coverWhile I was not involved in any way, or even aware of this research, Anthony Lee, Simone Tiberi, and Giacomo Zanella have an incoming paper in Biometrika, and which was partly written while all three authors were at the University of Warwick. The purpose is to design an efficient manner to approximate the product of n unidimensional expectations (or integrals) all computed against the same reference density. Which is not a real constraint. A neat remark that motivates the method in the paper is that an improved estimator can be connected with the permanent of the n x N matrix A made of the values of the n functions computed at N different simulations from the reference density. And involves N!/ (N-n)! terms rather than N to the power n. Since it is NP-hard to compute, a manageable alternative uses random draws from constrained permutations that are reasonably easy to simulate. Especially since, given that the estimator recycles most of the particles, it requires a much smaller version of N. Essentially N=O(n) with this scenario, instead of O(n²) with the basic Monte Carlo solution, towards a similar variance.

This framework offers many applications in latent variable models, including pseudo-marginal MCMC, of course, but also for ABC since the ABC posterior based on getting each simulated observation close enough from the corresponding actual observation fits this pattern (albeit the dependence on the chosen ordering of the data is an issue that can make the example somewhat artificial).

interdependent Gibbs samplers

Posted in Books, Statistics, University life with tags , , , , , , on April 27, 2018 by xi'an

Mark Kozdoba and Shie Mannor just arXived a paper on an approach to accelerate a Gibbs sampler. With title “interdependent Gibbs samplers“. In fact, it presents rather strong similarities with our SAME algorithm. More of the same, as Adam Johanssen (Warwick) entitled one of his papers! The paper indeed suggests multiplying replicas of latent variables (e.g., an hidden path for an HMM) in an artificial model. And as in our 2002 paper, with Arnaud Doucet and Simon Godsill, the focus here is on maximum likelihood estimation (of the genuine parameters, not of the latent variables). Along with argument that the resulting pseudo-posterior is akin to a posterior with a powered likelihood. And a link with the EM algorithm. And an HMM application.

“The generative model consist of simply sampling the parameters ,  and then sampling m independent copies of the paths”

If anything this proposal is less appealing than SAME because it aims directly at the powered likelihood, rather than utilising an annealed sequence of powers that allows for a primary exploration of the whole parameter space before entering the trapping vicinity of a mode. Which makes me fail to catch the argument from the authors that this improves Gibbs sampling, as a more acute mode has on the opposite the dangerous feature of preventing visits to other modes. Hence the relevance to resort to some form of annealing.

As already mused upon in earlier posts, I find it most amazing that this technique has been re-discovered so many times, both in statistics and in adjacent fields. The idea of powering the likelihood with independent copies of the latent variables is obviously natural (since a version pops up every other year, always under a different name), but earlier versions should eventually saturate the market!

running ABC when the likelihood is available

Posted in Statistics with tags , , , , , on September 19, 2017 by xi'an

Today I refereed a paper where the authors used ABC to bypass convergence (and implementation) difficulties with their MCMC algorithm. And I am still pondering whether or not this strategy makes sense. If only because ABC needs to handle the same complexity and the same amount of parameters as an MCMC algorithm. While shooting “in the dark” by using the prior or a coarse substitute to the posterior. And I wonder at the relevance of simulating new data when the [true] likelihood value [at the observed data] can be computed. This would sound to me like the relevant and unique “statistics” worth considering…

rare events for ABC

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , on November 24, 2016 by xi'an

Dennis Prangle, Richard G. Everitt and Theodore Kypraios just arXived a new paper on ABC, aiming at handling high dimensional data with latent variables, thanks to a cascading (or nested) approximation of the probability of a near coincidence between the observed data and the ABC simulated data. The approach amalgamates a rare event simulation method based on SMC, pseudo-marginal Metropolis-Hastings and of course ABC. The rare event is the near coincidence of the observed summary and of a simulated summary. This is so rare that regular ABC is forced to accept not so near coincidences. Especially as the dimension increases.  I mentioned nested above purposedly because I find that the rare event simulation method of Cérou et al. (2012) has a nested sampling flavour, in that each move of the particle system (in the sample space) is done according to a constrained MCMC move. Constraint derived from the distance between observed and simulated samples. Finding an efficient move of that kind may prove difficult or impossible. The authors opt for a slice sampler, proposed by Murray and Graham (2016), however they assume that the distribution of the latent variables is uniform over a unit hypercube, an assumption I do not fully understand. For the pseudo-marginal aspect, note that while the approach produces a better and faster evaluation of the likelihood, it remains an ABC likelihood and not the original likelihood. Because the estimate of the ABC likelihood is monotonic in the number of terms, a proposal can be terminated earlier without inducing a bias in the method.

Lake Louise, Banff National Park, March 21, 2012This is certainly an innovative approach of clear interest and I hope we will discuss it at length at our BIRS ABC 15w5025 workshop next February. At this stage of light reading, I am slightly overwhelmed by the combination of so many computational techniques altogether towards a single algorithm. The authors argue there is very little calibration involved, but so many steps have to depend on as many configuration choices.