## on control variates

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on May 27, 2023 by xi'an

A few months ago, I had to write a thesis evaluation of Rémi Leluc’s PhD, which contained several novel Monte Carlo proposals on control variates and importance techniques. For instance, Leluc et al. (Statistics and Computing, 2021) revisits the concept of control variables by adding a perspective of control variable selection using LASSO. This prior selection is relevant since control variables are not necessarily informative about the objective function being integrated and my experience is that the more variables the less reliable the improvement. The remarkable feature of the results is in obtaining explicit and non-asymptotic bounds.

The author obtains a concentration inequality on the error resulting from the use of control variables, under strict assumptions on the variables. The associated numerical experiment illustrates the difficulties of practically implementing these principles due to the number of parameters to calibrate. I found the example of a capture-recapture experiment on ducks (European Dipper) particularly interesting, not only because we had used it in our book but also because it highlights the dependence of estimates on the dominant measure.

Based on a NeurIPS 2022 poster presentation Chapter 3 is devoted to the use of control variables in sequential Monte Carlo, where a sequence of importance functions is constructed based on previous iterations to improve the approximation of the target distribution. Under relatively strong assumptions of importance functions dominating the target distribution (which could generally be achieved by using an increasing fraction of the data in a partial posterior distribution), of sub-Gaussian tails of an intractable distribution’s residual, a concentration inequality is established for the adaptive control variable estimator.

This chapter uses a different family of control variables, based on a Stein operator introduced in Mira et al. (2016). In the case where the target is a mixture in IRd, one of our benchmarks in Cappé et al. (2008), remarkable gains are obtained for relatively high dimensions. While the computational demands of these improvements are not mentioned, the comparison with an MCMC approach (NUTS) based on the same number of particles demonstrates a clear improvement in Bayesian estimation.

Chapter 4 corresponds to a very recent arXival and presents a very original approach to control variate correction by reproducing the interest rate law through an approximation using the closest neighbor (leave-one-out) method. It requires neither control function nor necessarily additional simulation, except for the evaluation of the integral, which is rather remarkable, forming a kind of parallel with the bootstrap. (Any other approximation of the distribution would also be acceptable if available at the same computational cost.) The thesis aims to establish the convergence of the method when integration is performed by a Voronoi tessellation, which leads to an optimal rate of order n-1-2/d for quadratic error (under conditions of integrand regularity). In the alternative where the integral must be evaluated by Monte Carlo, this optimality disappears, unless a massive amount of simulations are used. Numerical illustrations cover SDEs and a Bayesian hierarchical modeling already used in Oates et al. (2017), with massive gain in both cases.

## taking advantage of the constant

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , on May 19, 2022 by xi'an

A question from X validated had enough appeal for me to procrastinate about it for ½ an hour: what difference does it make [for simulation purposes] that a target density is properly normalised? In the continuous case, I do not see much to exploit about this knowledge, apart from the value potentially leading to a control variate (in a Gelfand and Dey 1996 spirit) and possibly to a stopping rule (by checking that the portion of the space visited so far has mass close to one, but this is more delicate than it sounds).

In a (possibly infinite) countable setting, it seems to me one gain (?) is that approximating expectations by Monte Carlo no longer requires iid simulations in the sense that once visited,  atoms need not be visited again. Self-avoiding random walks and their generalisations thus appear as a natural substitute for MC(MC) methods in this setting, provided finding unexplored atoms proves manageable. For instance, a stopping rule is always available, namely that the cumulated weight of the visited fraction of the space is close enough to one. The above picture shows a toy example on a 500 x 500 grid with 0.1% of the mass remaining at the almost invisible white dots. (In my experiment, neighbours for the random exploration were chosen at random over the grid, as I assumed no global information was available about the repartition over the grid either of mass function or of the function whose expectation was seeked.)

## Jana de Wiljes’ colloquium at Warwick

Posted in Statistics with tags , , , , , , on February 25, 2020 by xi'an

## scalable Langevin exact algorithm

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , on October 18, 2016 by xi'an

“By employing a modification to existing naïve subsampling techniques we can obtain an algorithm which is still exact but has sub-linear iterative cost as a function of data size.”

A few weeks ago Murray Pollock, Paul Fearnhead, Adam Johansen and Gareth Roberts (all from Warwick except for Paul) arXived a paper The Scalable Langevin Exact Algorithm: Bayesian Inference for Big Data. (This was also the topic of Murray’s talk last year at JSM in Seattle.) One major advance found in the paper is the derivation of an “exact” algorithm that is sub-linear in the data size. As discussed in the introduction, the current approaches to large data problems either suffer from being approximate (like divide-and-conquer methods) or do not achieve significant reduction in the computing time, being of order O(n). The authors mention Teh and Welling (2011) sand their tochastic gradient approximation to the Langevin diffusion, when the gradient is based on a subsample. Without the Metropolis correction that would ensure an exact target but at a cost of order O(n). (Which makes the technique rather difficult to recommend.)

A novel [for me] notion at the core of this paper is the concept of quasi-stationary distribution, which is the limiting distribution of a Markov chain X[t] conditional on a Markov stopping time [being larger than t]. The approach is based on diffusions with appropriate stationary distributions like the Langevin diffusion. (Actually, as in most papers I have read and remember, the current paper only considers the Langevin diffusion.) In order to avoid the issues with unadjusted and Metropolis-adjusted Langevin schemes, a killed Brownian motion is created, which means a Brownian motion conditional of being alive till time T when the instantaneous killing rate is a given function of the chain, Φ(X[t]), related with the stationary measure of the Langevin diffusion ν. Under appropriate conditions, the density of this killed Brownian motion converges [in T] to √ν. Which immediately hints at creating a new Langevin diffusion targeting ν² instead of ν. And killing it with the proper rate, which can be done by thinning a Poisson process. Simulating the original path can be done by path-space rejection sampling, following the technique set by Gareth Roberts and co-authors more than ten years ago. Based on finite dimensional realisations of the path on [0,T]. And including the killing part can be done by importance sampling and checking that the simulated killing time is larger than the current (exponentially simulated) time.

One practical difficulty in the implementation of this neat principle is the derivation of the normalising constant, which evaluation degrades with the time horizon T. The solution adopted in the paper is through a sequential Monte Carlo method, using another discretisation of the time interval [0,T] (relying on the original one would get too costly?). As for subsampling, since the survival probability for the Brownian motion is based on an unbiased estimator, subsampling does not hurt if conducted in a random manner. Although this increases the variance on principle, the use of a control variate computed just once helps in reducing the complexity to O(1).

This is a tough paper and I have not gone through the effort of trying to implement it, but this is an original and innovative construct I would like to monitor in further details on a toy example, maybe next week while in Warwick. Or at least to discuss it with the authors.

## exact, unbiased, what else?!

Posted in Books, Statistics, University life with tags , , , , , , , , on April 13, 2016 by xi'an

Last week, Matias Quiroz, Mattias Villani, and Robert Kohn arXived a paper on exact subsampling MCMC, a paper that contributes to the current literature on approximating MCMC samplers for large datasets, in connection with an earlier paper of Quiroz et al. discussed here last week.

The “exact” in the title is to be understood in the Russian roulette sense. By using Rhee and Glynn debiaising device, the authors achieve an unbiased estimator of the likelihood as in Bardenet et al. (2015). The central tool for the derivation of an unbiased and positive estimator is to find a control variate for each component of the log likelihood that is good enough for the difference between the component and the control to be lower bounded. By the constant a in the screen capture above. When the individual terms d in the product are iid unbiased estimates of the log likelihood difference. And q is the sum of the control variates. Or maybe more accurately of the cheap substitutes to the exact log likelihood components. Thus still of complexity O(n), which makes the application to tall data more difficult to contemplate.

The \$64 question is obviously how to produce cheap and efficient control variates that kill the curse of the tall data. (It still irks to resort to this term of control variate, really!) Section 3.2 in the paper suggests clustering the data and building an approximation for each cluster, which seems to imply manipulating the whole dataset at this early stage. At a cost of O(Knd). Furthermore, because finding a correct lower bound a is close to impossible in practice, the authors use a “soft lower bound”, meaning that it is only an approximation and thus that (3.4) above can get negative from time to time, which cancels the validation of the method as a pseudo-marginal approach. The resolution of this difficulty is to resort to the same proxy as in the Russian roulette paper, replacing the unbiased estimator with its absolute value, an answer I already discussed for the Russian roulette paper. An additional step is proposed by Quiroz et al., namely correlating the random numbers between numerator and denominator in their final importance sampling estimator, via a Gaussian copula as in Deligiannidis et al.

This paper made me wonder (idly wonder, mind!) anew how to get rid of the vexing unbiasedness requirement. From a statistical and especially from a Bayesian perspective, unbiasedness is a second order property that cannot be achieved for most transforms of the parameter θ. And that does not keep under reparameterisation. It is thus vexing and perplexing that unbiased is so central to the validation of our Monte Carlo technique and that any divergence from this canon leaves us wandering blindly with no guarantee of ever reaching the target of the simulation experiment…