## control variates [seminar]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on November 5, 2021 by xi'an

Today, Petros Dellaportas (whom I have know since the early days of MCMC, when we met in CIRM) gave a seminar at the Warwick algorithm seminar on control variates for MCMC, reminding me of his 2012 JRSS paper. Based on the Poisson equation and using a second control variate to stabilise the Monte Carlo approximation do the first control variate. The difference with usual control variates is finding a first approximate G(x)-q(y|x)G(Y) to F-πF. And the first Poisson equation is using α(x,y)q(y|x) rather than π. Then the second expands log α(x,y)q(y|x) to achieve a manageable term.

Abstract: We provide a general methodology to construct control variates for any discrete time random walk Metropolis and Metropolis-adjusted Langevin algorithm Markov chains that can achieve, in a post-processing manner and with a negligible additional computational cost, impressive variance reduction when compared to the standard MCMC ergodic averages. Our proposed estimators are based on an approximate solution of the Poisson equation for a multivariate Gaussian target densities of any dimension.

I wonder if there were a neural network version that would first build G from scratch and later optimise it towards solving the Poisson equation. As in this recent arXival I haven’t read (yet).

## black box MCMC

Posted in Books, Statistics with tags , , , , , , , , on July 17, 2021 by xi'an

“…back-box methods, despite using no information of the proposal distribution, can actually give better estimation accuracy than the typical importance sampling [methods]…”

Earlier this week I was pointed out to Liu & Lee’s black box importance sampling, published in AISTATS 2017. (which I did not attend). Already found in Briol et al. (2015) and Oates, Girolami, and Chopin (2017), the method starts from Charles Stein‘s “unbiased estimator of the loss” (that was a fundamental tool in my own PhD thesis!), a variation on integration by part:

$\mathbb E_p[\nabla\log p(X) f(X)+\nabla f(X)]=0$

for differentiable functions f and p cancelling at the boundaries. It also holds for the kernelised extension

$\mathbb E_p[k_p(X,x')]=0$

for all x’, where the integrand is a 1-d function of an arbitrary kernel k(x,x’) and of the score function ∇log p. This null expectation happens to be a minimum since

$\mathbb E_{X,X'\sim q}[k_p(X,X')]\ge 0$

and hence importance weights can be obtained by minimising

$\sum_{ij} w_i w_j k_p(x_i,x_j)$

in w (from the unit simplex), for a sample of iid realisations from a possibly unknown distribution with density q. Liu & Lee show that this approximation converges faster than the standard Monte Carlo speed √n, when using Hilbertian properties of the kernel through control variates. Actually, the same thing happens when using a (leave-one-out) non-parametric kernel estimate of q rather than q. At least in theory.

“…simulating n parallel MCMC chains for m steps, where the length m of the chains can be smaller than what is typically used in MCMC, because it just needs to be large enough to bring the distribution `roughly’ close to the target distribution”

A practical application of the concept is suggested in the above quote. As a corrected weight for interrupted MCMC. Or when using an unadjusted Langevin algorithm. Provided the minimisation of the objective quadratic form is fast enough, the method can thus be used as a benchmark for regular MCMC implementation.

## scalable Langevin exact algorithm [Read Paper]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , on June 23, 2020 by xi'an

Murray Pollock, Paul Fearnhead, Adam M. Johansen and Gareth O. Roberts (CoI: all with whom I have strong professional and personal connections!) have a Read Paper discussion happening tomorrow [under relaxed lockdown conditions in the UK, except for the absurd quatorzine on all travelers|, but still in a virtual format] that we discussed together [from our respective homes] at Paris Dauphine. And which I already discussed on this blog when it first came out.

Here are quotes I spotted during this virtual Dauphine discussion but we did not come up with enough material to build a significant discussion, although wondering at the potential for solving the O(n) bottleneck, handling doubly intractable cases like the Ising model. And noticing the nice features of the log target being estimable by unbiased estimators. And of using control variates, for once well-justified in a non-trivial environment.

“However, in practice this simple idea is unlikely to work. We can see this most clearly with the rejection sampler, as the probability of survival will decrease exponentially with t—and thus the rejection probability will often be prohibitively large.”

“This can be viewed as a rejection sampler to simulate from μ(x,t), the distribution of the Brownian motion at time  t conditional on its surviving to time t. Any realization that has been killed is ‘rejected’ and a realization that is not killed is a draw from μ(x,t). It is easy to construct an importance sampling version of this rejection sampler.”

## more and more control variates

Posted in Statistics with tags , , , , , , , on October 5, 2018 by xi'an

A few months ago, François Portier and Johan Segers arXived a paper on a question that has always puzzled me, namely how to add control variates to a Monte Carlo estimator and when to stop if needed! The paper is called Monte Carlo integration with a growing number of control variates. It is related to the earlier Oates, Girolami and Chopin (2017) which I remember discussing with Chris when he was in Warwick. The puzzling issue of control variates is [for me] that, while the optimal weight always decreases the variance of the resulting estimate, in practical terms, implementing the method may increase the actual variance. Glynn and Szechtman at MCqMC 2000 identify six different ways of creating the estimate, depending on how the covariance matrix, denoted P(hh’), is estimated. With only one version integrating constant functions and control variates exactly. Which actually happens to be also a solution to a empirical likelihood maximisation under the empirical constraints imposed by the control variates. Another interesting feature is that, when the number m of control variates grows with the number n of simulations the asymptotic variance goes to zero, meaning that the control variate estimator converges at a faster speed.

Creating an infinite sequence of control variates sounds unachievable in a realistic situation. Legendre polynomials are used in the paper, but is there a generic and cheap way to getting these. And … control variate selection, anyone?!

## control functionals for Monte Carlo integration

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 28, 2016 by xi'an

A paper on control variates by Chris Oates, Mark Girolami (Warwick) and Nicolas Chopin (CREST) appeared in a recent issue of Series B. I had read and discussed the paper with them previously and the following is a set of comments I wrote at some stage, to be taken with enough gains of salt since Chris, Mark and Nicolas answered them either orally or in the paper. Note also that I already discussed an earlier version, with comments that are not necessarily coherent with the following ones! [Thanks to the busy softshop this week, I resorted to publish some older drafts, so mileage can vary in the coming days.]

First, it took me quite a while to get over the paper, mostly because I have never worked with reproducible kernel Hilbert spaces (RKHS) before. I looked at some proofs in the appendix and at the whole paper but could not spot anything amiss. It is obviously a major step to uncover a manageable method with a rate that is lower than √n. When I set my PhD student Anne Philippe on the approach via Riemann sums, we were quickly hindered by the dimension issue and could not find a way out. In the first versions of the nested sampling approach, John Skilling had also thought he could get higher convergence rates before realising the Monte Carlo error had not disappeared and hence was keeping the rate at the same √n speed.

The core proof in the paper leading to the 7/12 convergence rate relies on a mathematical result of Sun and Wu (2009) that a certain rate of regularisation of the function of interest leads to an average variance of order 1/6. I have no reason to mistrust the result (and anyway did not check the original paper), but I am still puzzled by the fact that it almost immediately leads to the control variate estimator having a smaller order variance (or at least variability). On average or in probability. (I am also uncertain on the possibility to interpret the boxplot figures as establishing super-√n speed.)

Another thing I cannot truly grasp is how the control functional estimator of (7) can be both a mere linear recombination of individual unbiased estimators of the target expectation and an improvement in the variance rate. I acknowledge that the coefficients of the matrices are functions of the sample simulated from the target density but still…

Another source of inner puzzlement is the choice of the kernel in the paper, which seems too simple to be able to cover all problems despite being used in every illustration there. I see the kernel as centred at zero, which means a central location must be know, decreasing to zero away from this centre, so possibly missing aspects of the integrand that are too far away, and isotonic in the reference norm, which also seems to preclude some settings where the integrand is not that compatible with the geometry.

I am equally nonplussed by the existence of a deterministic bound on the error, although it is not completely deterministic, depending on the values of the reproducible kernel at the points of the sample. Does it imply anything restrictive on the function to be integrated?

A side remark about the use of intractable in the paper is that, given the development of a whole new branch of computational statistics handling likelihoods that cannot be computed at all, intractable should possibly be reserved for such higher complexity models.