Archive for control variates

control functionals for Monte Carlo integration

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 28, 2016 by xi'an

img_2451A paper on control variates by Chris Oates, Mark Girolami (Warwick) and Nicolas Chopin (CREST) appeared in a recent issue of Series B. I had read and discussed the paper with them previously and the following is a set of comments I wrote at some stage, to be taken with enough gains of salt since Chris, Mark and Nicolas answered them either orally or in the paper. Note also that I already discussed an earlier version, with comments that are not necessarily coherent with the following ones! [Thanks to the busy softshop this week, I resorted to publish some older drafts, so mileage can vary in the coming days.]

First, it took me quite a while to get over the paper, mostly because I have never worked with reproducible kernel Hilbert spaces (RKHS) before. I looked at some proofs in the appendix and at the whole paper but could not spot anything amiss. It is obviously a major step to uncover a manageable method with a rate that is lower than √n. When I set my PhD student Anne Philippe on the approach via Riemann sums, we were quickly hindered by the dimension issue and could not find a way out. In the first versions of the nested sampling approach, John Skilling had also thought he could get higher convergence rates before realising the Monte Carlo error had not disappeared and hence was keeping the rate at the same √n speed.

The core proof in the paper leading to the 7/12 convergence rate relies on a mathematical result of Sun and Wu (2009) that a certain rate of regularisation of the function of interest leads to an average variance of order 1/6. I have no reason to mistrust the result (and anyway did not check the original paper), but I am still puzzled by the fact that it almost immediately leads to the control variate estimator having a smaller order variance (or at least variability). On average or in probability. (I am also uncertain on the possibility to interpret the boxplot figures as establishing super-√n speed.)

Another thing I cannot truly grasp is how the control functional estimator of (7) can be both a mere linear recombination of individual unbiased estimators of the target expectation and an improvement in the variance rate. I acknowledge that the coefficients of the matrices are functions of the sample simulated from the target density but still…

Another source of inner puzzlement is the choice of the kernel in the paper, which seems too simple to be able to cover all problems despite being used in every illustration there. I see the kernel as centred at zero, which means a central location must be know, decreasing to zero away from this centre, so possibly missing aspects of the integrand that are too far away, and isotonic in the reference norm, which also seems to preclude some settings where the integrand is not that compatible with the geometry.

I am equally nonplussed by the existence of a deterministic bound on the error, although it is not completely deterministic, depending on the values of the reproducible kernel at the points of the sample. Does it imply anything restrictive on the function to be integrated?

A side remark about the use of intractable in the paper is that, given the development of a whole new branch of computational statistics handling likelihoods that cannot be computed at all, intractable should possibly be reserved for such higher complexity models.

optimal mixture weights in multiple importance sampling

Posted in Statistics, University life with tags , , , , , , on December 12, 2014 by xi'an

Multiple importance sampling is back!!! I am always interested in this improvement upon regular importance sampling, even or especially after publishing a recent paper about our AMIS (for adaptive multiple importance sampling) algorithm, so I was quite eager to see what was in Hera He’s and Art Owen’s newly arXived paper. The paper is definitely exciting and set me on a new set of importance sampling improvements and experiments…

Some of the most interesting developments in the paper are that, (i) when using a collection of importance functions qi with the same target p, every ratio qi/p is a control variate function with expectation 1 [assuming each of the qi‘s has a support smaller than the support of p]; (ii) the weights of a mixture of the qi‘s can be chosen in an optimal way towards minimising the variance for a certain integrand; (iii) multiple importance sampling incorporates quite naturally stratified sampling, i.e. the qi‘s may have disjoint supports; )iv) control variates contribute little, esp. when compared with the optimisation over the weights [which does not surprise me that much, given that the control variates have little correlation with the integrands]; (v) Veach’s (1997) seminal PhD thesis remains a driving force behind those results [and in getting Eric Veach an Academy Oscar in 2014!].

One extension that I would find of the uttermost interest deals with unscaled densities, both for p and the qi‘s. In that case, the weights do not even sum up to a know value and I wonder at how much more difficult it is to analyse this realistic case. And unscaled densities led me to imagine using geometric mixtures instead. Or even harmonic mixtures! (Maybe not.)

Another one is more in tune with our adaptive multiple mixture paper. The paper works with regret, but one could also work with remorse! Besides the pun, this means that one could adapt the weights along iterations and even possible design new importance functions from the past outcome, i.e., be adaptive once again. He and Owen suggest mixing their approach with our adaptive sequential Monte Carlo model.

Vanilla Rao-Blackwellisation [re]revised

Posted in R, Statistics with tags , , , , , , on June 1, 2010 by xi'an

Although the revision is quite minor, it took us two months to complete from the time I received the news in the Atlanta airport lounge… The vanilla Rao-Blackwellisation paper with Randal Douc has thus been resubmitted to the Annals of Statistics. And rearXived. The only significant change is the inclusion of two tables detailing computing time, like the one below

\left| \begin{matrix} \tau &\text{median} &\text{mean }&q_{.8} &q_{.9} &\text{time}\\ 0.25 &0.0 &8.85 &4.9 &13 &4.2\\ 0.50 &0.0 &6.76 &4 &11 &2.25\\ 1.00 &0.25 &6.15 &4 &10 &2.5\\ 2.00 &0.20 &5.90 &3.5 &8.5 &4.5\\\end{matrix} \right|

which provides different evaluations of the additional computing effort due to the use of the Rao–Blackwellisation: median and mean numbers of additional iterations, $80\%$ and $90\%$ quantiles for the additional iterations, and ratio of the average R computing times obtained over $10^5$ simulations. (Turning the above table into a formula acceptable by WordPress took me for ever, as any additional white space between the terms of the matrix is mis-interpreted!) Now, the mean time column does not look very supportive of the Rao-Blackwellisation technique, but this is due to the presence of a few outlying runs that required many iterations before hitting an acceptance probability of one. Excessive computing time can be curbed by using a pre-set number of iterations, as described in the paper…

The Bernoulli factory

Posted in R, Statistics with tags , , , , , , on April 23, 2010 by xi'an

A few months ago, Latuszyński, Kosmidis, Papaspiliopoulos and Roberts arXived a paper I should have noticed earlier as its topic is very much related to our paper with Randal Douc on the vanilla Rao-Blackwellisation scheme. It is motivated by the Bernoulli factory problem, which aims at (unbiasedly) estimating f(p) from an iid sequence of Bernoulli B(p). (The paper only considers functions f valued in [0,1]. In our case, the function is f(p)=1/p.) It appears that this problem has been studied for quite a while, in particular by Yuval Peres. Being in a train to Marseille (thanks to Eyjafjallajökull!), I do not have access to those earlier papers of Peres’, but Latuszyński et al. mentions that there are conditions on f such that it is sufficient to generate a Bernoulli event with probability

f_0(p) = \min (2p, 1-2\epsilon)

where \varepsilon>0 is arbitrary. In particular, constructing an unbiased estimator of

f_0(p) = \min ( 2p, 1 )

does not seem to be achievable (Nacu and Peres, 2005). (The way it is rephrased in Latuszyński et al. does not seem correct, though, as they state that f(p)=2p cannot be estimated in an unbiased manner, missing the constraint that the estimator must belong to [0,1], I think.)

The paper by Latuszyński et al. develops an original scheme to achieve simulation from B(f(p)) through the simulation of two bounding sequences that are respectively super- and submartingales and that both converge to f(p). (But their simulation scheme does not have to wait for the two sequences to coalesce.) This idea presumably (?) stemmed from the experience of the authors, in particular Gareth Roberts, in perfect sampling, since the most advanced perfect samplers made intensive use of this sandwiching idea of Kendall and Møller (2000, Advances in Applied Probability). The whole thing is also related to the famous Series B paper of Beskos et al. (2006). The method of Latuszyński et al. then builds the upper and lower processes via a truncated series decomposion of f(p), whose availability induces constraints on f.

The first application illustrated in Latuszyński et al. is the unbiased estimation of a transform f(p) that has a known series expansion

f(p) = \sum_{i=1}^\infty (1-)^k a_k p^k


1\le a_1\le a_2 \le \cdots

In that case, we could use the scheme of our paper with Randal, estimating p^k by

\hat p^k = X_1\ldots X_k.

The probability of using at least n simulations is then p^n, while the scheme of Latuszyński et al. leads to a probability of  a_n p^n. (Note however that the direct approach above allows to handle any series decomposition, alternating or not, with no constraint on the a_i‘s.)

What I find exciting about this Bernoulli factory problem is that the well-known issue of the absence of unbiased estimators for most transforms of a parameter p (Lehmann and Casella, 1998) vanishes when an unlimited number of iid simulations with mean p is available. Here are the slides of the talk given by Omiros last week at the Big’ MC seminar:

Vanilla Rao-Blackwellisation for revision

Posted in R, Statistics with tags , , , , , on March 18, 2010 by xi'an

The vanilla Rao-Blackwellisation paper with Randal Douc that had been resubmitted to the Annals of Statistics is now back for a revision, with quite encouraging comments:

The paper has been reviewed by two referees both of whom comment on the clear exposition and the novelty of the results. Both referees point to the empirical results as being suggestive of a more incremental improvement in practice rather than a major advance. However the approach the authors adopt is novel and I believe may motivate further developments in this area.

I cannot but agree on those comments! Since we are reducing the variance of the weights, the overall effect may be difficult to spot in practical applications. In the current version of the paper, we manage 20% reduction in the variance of those weights, but obviously this does not transfer to the same reduction of the variance of the overall estimator! Our vanilla Rao-Blackwellisation does not speed up the Markov chain.


Get every new post delivered to your Inbox.

Join 1,068 other followers