Archive for Rao-Blackwellisation

Rao-Blackwellisation, a review in the making

Posted in Statistics with tags , , , , , , , , , , on March 17, 2020 by xi'an

Recently, I have been contacted by a mainstream statistics journal to write a review of Rao-Blackwellisation techniques in computational statistics, in connection with an issue celebrating C.R. Rao’s 100th birthday. As many many techniques can be interpreted as weak forms of Rao-Blackwellisation, as e.g. all auxiliary variable approaches, I am clearly facing an abundance of riches and would thus welcome suggestions from Og’s readers on the major advances in Monte Carlo methods that can be connected with the Rao-Blackwell-Kolmogorov theorem. (On the personal and anecdotal side, I only met C.R. Rao once, in 1988, when he came for a seminar at Purdue University where I was spending the year.)

mining gold [ABC in PNAS]

Posted in Books, Statistics with tags , , , , , , , , , , , on March 13, 2020 by xi'an

Johann Brehmer and co-authors have just published a paper in PNAS entitled “Mining gold from implicit models to improve likelihood-free inference”. (Besides the pun about mining gold, the paper also involves techniques named RASCAL and SCANDAL, respectively! For Ratio And SCore Approximate Likelihood ratio and SCore-Augmented Neural Density Approximates Likelihood.) This setup is not ABC per se in that their simulator is used both to generate training data and construct a tractable surrogate model. Exploiting Geyer’s (1994) classification trick of expressing the likelihood ratio as the optimal classification ratio when facing two equal-size samples from one density and the other.

“For all these inference strategies, the augmented data is particularly powerful for enhancing the power of simulation-based inference for small changes in the parameter θ.”

Brehmer et al. argue that “the most important novel contribution that differentiates our work from the existing methods is the observation that additional information can be extracted from the simulator, and the development of loss functions that allow us to use this “augmented” data to more efficiently learn surrogates for the likelihood function.” Rather than starting from a statistical model, they also seem to use a scientific simulator made of multiple layers of latent variables z, where

x=F⁰(u⁰,z¹,θ), z¹=G¹(u¹,z²), z²=G¹(u²,z³), …

although they also call the marginal of x, p(x|θ), an (intractable) likelihood.

“The integral of the log is not the log of the integral!”

The central notion behind the improvement is a form of Rao-Blackwellisation, exploiting the simulated z‘s. Joint score functions and joint likelihood ratios are then available. Ignoring biases, the authors demonstrate that the closest approximation to the joint likelihood ratio and the joint score function that only depends on x is the actual likelihood ratio and the actual score function, respectively. Which sounds like an older EM result, except that the roles of estimate and target quantity are somehow inverted: one is approximating the marginal with the joint, while the marginal is the “best” approximation of the joint. But in the implementation of the method, an estimate of the (observed and intractable) likelihood ratio is indeed produced towards minimising an empirical loss based on two simulated samples. Learning this estimate ê(x) then allows one to use it for the actual data. It however requires fitting a new ê(x) for each pair of parameters. Providing as well an estimator of the likelihood p(x|θ). (Hence the SCANDAL!!!) A second type of approximation of the likelihood starts from the approximate value of the likelihood p(x|θ⁰) at a fixed value θ⁰ and expands it locally as an exponential family shift, with the score t(x|θ⁰) as sufficient statistic.

I find the paper definitely interesting even though it requires the representation of the (true) likelihood as a marginalisation over multiple layers of latent variables z. And does not provide an evaluation of the error involved in the process when the model is misspecified. As a minor supplementary appeal of the paper, the use of an asymmetric Galton quincunx to illustrate an intractable array of latent variables will certainly induce me to exploit it in projects and courses!

[Disclaimer: I was not involved in the PNAS editorial process at any point!]

revisiting the balance heuristic

Posted in Statistics with tags , , , , , , , on October 24, 2019 by xi'an

Last August, Felipe Medina-Aguayo (a former student at Warwick) and Richard Everitt (who has now joined Warwick) arXived a paper on multiple importance sampling (for normalising constants) that goes “exploring some improvements and variations of the balance heuristic via a novel extended-space representation of the estimator, leading to straightforward annealing schemes for variance reduction purposes”, with the interesting side remark that Rao-Blackwellisation may prove sub-optimal when there are many terms in the proposal family, in the sense that not every term in the mixture gets sampled. As already noticed by Victor Elvira and co-authors, getting rid of the components that are not used being an improvement without inducing a bias. The paper also notices that the loss due to using sample sizes rather than expected sample sizes is of second order, compared with the variance of the compared estimators. It further relates to a completion or auxiliary perspective that reminds me of the approaches we adopted in the population Monte Carlo papers and in the vanilla Rao-Blackwellisation paper. But it somewhat diverges from this literature when entering a simulated annealing perspective, in that the importance distributions it considers are freely chosen as powers of a generic target. It is quite surprising that, despite the normalising weights being unknown, a simulated annealing approach produces an unbiased estimator of the initial normalising constant. While another surprise therein is that the extended target associated to their balance heuristic does not admit the right density as marginal but preserves the same normalising constant… (This paper will be presented at BayesComp 2020.)

revised empirical HMC

Posted in Statistics, University life with tags , , , , , , , , on March 12, 2019 by xi'an

Following the informed and helpful comments from Matt Graham and Bob Carpenter on our eHMC paper [arXival] last month, we produced a revised and re-arXived version of the paper based on new experiments ran by Changye Wu and Julien Stoehr. Here are some quick replies to these comments, reproduced for convenience. (Warning: this is a loooong post, much longer than usual.) Continue reading

short course on MCMC at CiRM [slides]

Posted in Statistics with tags , , , , , , , , , , , , , , , on October 23, 2018 by xi'an

Here are the [recycled] slides for the introductory lecture I gave this morning at CIRM, with the side information that it appears Slideshare has gone to another of these stages when slides cannot be played on this blog [when using Firefox]…

IMS workshop [day 3]

Posted in pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on August 30, 2018 by xi'an

I made the “capital” mistake of walking across the entire NUS campus this morning, which is quite green and pretty, but which almost enjoys an additional dimension brought by such an intense humidity that one feels having to get around this humidity!, a feature I have managed to completely erase from my memory of my previous visit there. Anyway, nothing of any relevance. oNE talk in the morning was by Markus Eisenbach on tools used by physicists to speed up Monte Carlo methods, like the Wang-Landau flat histogram, towards computing the partition function, or the distribution of the energy levels, definitely addressing issues close to my interest, but somewhat beyond my reach for using a different language and stress, as often in physics. (I mean, as often in physics talks I attend.) An idea that came out clear to me was to bypass a (flat) histogram target and aim directly at a constant slope cdf for the energy levels. (But got scared away by the Fourier transforms!)

Lawrence Murray then discussed some features of the Birch probabilistic programming language he is currently developing, especially a fairly fascinating concept of delayed sampling, which connects with locally-optimal proposals and Rao Blackwellisation. Which I plan to get back to later [and hopefully sooner than later!].

In the afternoon, Maria de Iorio gave a talk about the construction of nonparametric priors that create dependence between a sequence of functions, a notion I had not thought of before, with an array of possibilities when using the stick breaking construction of Dirichlet processes.

And Christophe Andrieu gave a very smooth and helpful entry to partly deterministic Markov processes (PDMP) in preparation for talks he is giving next week for the continuation of the workshop at IMS. Starting with the guided random walk of Gustafson (1998), which extended a bit later into the non-reversible paper of Diaconis, Holmes, and Neal (2000). Although I had a vague idea of the contents of these papers, the role of the velocity ν became much clearer. And premonitory of the advances made by the more recent PDMP proposals. There is obviously a continuation with the equally pedagogical talk Christophe gave at MCqMC in Rennes two months [and half the globe] ago,  but the focus being somewhat different, it really felt like a new talk [my short term memory may also play some role in this feeling!, as I now remember the discussion of Hilderbrand (2002) for non-reversible processes]. An introduction to the topic I would recommend to anyone interested in this new branch of Monte Carlo simulation! To be followed by the most recently arXived hypocoercivity paper by Christophe and co-authors.

Metropolis-Hastings importance sampling

Posted in Books, Statistics, University life with tags , , , , , , , , , on June 6, 2018 by xi'an

[Warning: As I first got the paper from the authors and sent them my comments, this paper read contains their reply as well.]

In a sort of crazy coincidence, Daniel Rudolf and Björn Sprungk arXived a paper on a Metropolis-Hastings importance sampling estimator that offers similarities with  the one by Ingmar Schuster and Ilja Klebanov posted on arXiv the same day. The major difference in the construction of the importance sampler is that Rudolf and Sprungk use the conditional distribution of the proposal in the denominator of their importance weight, while Schuster and Klebanov go for the marginal (or a Rao-Blackwell representation of the marginal), mostly in an independent Metropolis-Hastings setting (for convergence) and for a discretised Langevin version in the applications. The former use a very functional L² approach to convergence (which reminded me of the early Schervish and Carlin, 1990, paper on the convergence of MCMC algorithms), not all of it necessary in my opinion. As for instance the extension of convergence properties to the augmented chain, namely (current, proposed), is rather straightforward since the proposed chain is a random transform of the current chain. An interesting remark at the end of the proof of the CLT is that the asymptotic variance of the importance sampling estimator is the same as with iid realisations from the target. This is a point we also noticed when constructing population Monte Carlo techniques (more than ten years ago), namely that dependence on the past in sequential Monte Carlo does not impact the validation and the moments of the resulting estimators, simply because “everything cancels” in importance ratios. The mean square error bound on the Monte Carlo error (Theorem 20) is not very surprising as the term ρ(y)²/P(x,y) appears naturally in the variance of importance samplers.

The first illustration where the importance sampler does worse than the initial MCMC estimator for a wide range of acceptance probabilities (Figures 2 and 3, which is which?) and I do not understand the opposite conclusion from the authors.

[Here is an answer from Daniel and Björn about this point:]

Indeed the formulation in our paper is unfortunate. The point we want to stress is that we observed in the numerical experiments certain ranges of step-sizes for which MH importance sampling shows a better performance than the classical MH algorithm with optimal scaling. Meaning that the MH importance sampling with optimal step-size can outperform MH sampling, without using additional computational resources. Surprisingly, the optimal step-size for the MH importance sampling estimator seems to remain constant for an increasing dimension in contrast to the well-known optimal scaling of the MH algorithm (given by a constant optimal acceptance rate).

The second uses the Pima Indian diabetes benchmark, amusingly (?) referring to Chopin and Ridgway (2017) who warn against the recourse to this dataset and to this model! The loss in mean square error due to the importance sampling may again be massive (Figure 5) and setting for an optimisation of the scaling factor in Metropolis-Hastings algorithms sounds unrealistic.

[And another answer from Daniel and Björn about this point:]

Indeed, Chopin and Ridgway suggest more complex problems with a larger number of covariates as benchmarks. However, the well-studied PIMA data set is a sufficient example in order to illustrate the possible benefits but also the limitations of the MH importance sampling approach. The latter are clearly (a) the required knowledge about the optimal step-size—otherwise the performance can indeed be dramatically worse than for the MH algorithm—and (b) the restriction to a small or at most moderate number of covariates. As you are indicating, optimizing the scaling factor is a challenging task. However, the hope is to derive some simple rule of thumb for the MH importance sampler similar to the well-known acceptance rate tuning for the standard MCMC estimator.