Archive for Charles Stein

on control variates

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on May 27, 2023 by xi'an

A few months ago, I had to write a thesis evaluation of Rémi Leluc’s PhD, which contained several novel Monte Carlo proposals on control variates and importance techniques. For instance, Leluc et al. (Statistics and Computing, 2021) revisits the concept of control variables by adding a perspective of control variable selection using LASSO. This prior selection is relevant since control variables are not necessarily informative about the objective function being integrated and my experience is that the more variables the less reliable the improvement. The remarkable feature of the results is in obtaining explicit and non-asymptotic bounds.

The author obtains a concentration inequality on the error resulting from the use of control variables, under strict assumptions on the variables. The associated numerical experiment illustrates the difficulties of practically implementing these principles due to the number of parameters to calibrate. I found the example of a capture-recapture experiment on ducks (European Dipper) particularly interesting, not only because we had used it in our book but also because it highlights the dependence of estimates on the dominant measure.

Based on a NeurIPS 2022 poster presentation Chapter 3 is devoted to the use of control variables in sequential Monte Carlo, where a sequence of importance functions is constructed based on previous iterations to improve the approximation of the target distribution. Under relatively strong assumptions of importance functions dominating the target distribution (which could generally be achieved by using an increasing fraction of the data in a partial posterior distribution), of sub-Gaussian tails of an intractable distribution’s residual, a concentration inequality is established for the adaptive control variable estimator.

This chapter uses a different family of control variables, based on a Stein operator introduced in Mira et al. (2016). In the case where the target is a mixture in IRd, one of our benchmarks in Cappé et al. (2008), remarkable gains are obtained for relatively high dimensions. While the computational demands of these improvements are not mentioned, the comparison with an MCMC approach (NUTS) based on the same number of particles demonstrates a clear improvement in Bayesian estimation.

Chapter 4 corresponds to a very recent arXival and presents a very original approach to control variate correction by reproducing the interest rate law through an approximation using the closest neighbor (leave-one-out) method. It requires neither control function nor necessarily additional simulation, except for the evaluation of the integral, which is rather remarkable, forming a kind of parallel with the bootstrap. (Any other approximation of the distribution would also be acceptable if available at the same computational cost.) The thesis aims to establish the convergence of the method when integration is performed by a Voronoi tessellation, which leads to an optimal rate of order n-1-2/d for quadratic error (under conditions of integrand regularity). In the alternative where the integral must be evaluated by Monte Carlo, this optimality disappears, unless a massive amount of simulations are used. Numerical illustrations cover SDEs and a Bayesian hierarchical modeling already used in Oates et al. (2017), with massive gain in both cases.

black box MCMC

Posted in Books, Statistics with tags , , , , , , , , on July 17, 2021 by xi'an

“…back-box methods, despite using no information of the proposal distribution, can actually give better estimation accuracy than the typical importance sampling [methods]…”

Earlier this week I was pointed out to Liu & Lee’s black box importance sampling, published in AISTATS 2017. (which I did not attend). Already found in Briol et al. (2015) and Oates, Girolami, and Chopin (2017), the method starts from Charles Stein‘s “unbiased estimator of the loss” (that was a fundamental tool in my own PhD thesis!), a variation on integration by part:

\mathbb E_p[\nabla\log p(X) f(X)+\nabla f(X)]=0

for differentiable functions f and p cancelling at the boundaries. It also holds for the kernelised extension

\mathbb E_p[k_p(X,x')]=0

for all x’, where the integrand is a 1-d function of an arbitrary kernel k(x,x’) and of the score function ∇log p. This null expectation happens to be a minimum since

\mathbb E_{X,X'\sim q}[k_p(X,X')]\ge 0

and hence importance weights can be obtained by minimising

\sum_{ij} w_i w_j k_p(x_i,x_j)

in w (from the unit simplex), for a sample of iid realisations from a possibly unknown distribution with density q. Liu & Lee show that this approximation converges faster than the standard Monte Carlo speed √n, when using Hilbertian properties of the kernel through control variates. Actually, the same thing happens when using a (leave-one-out) non-parametric kernel estimate of q rather than q. At least in theory.

“…simulating n parallel MCMC chains for m steps, where the length m of the chains can be smaller than what is typically used in MCMC, because it just needs to be large enough to bring the distribution `roughly’ close to the target distribution”

A practical application of the concept is suggested in the above quote. As a corrected weight for interrupted MCMC. Or when using an unadjusted Langevin algorithm. Provided the minimisation of the objective quadratic form is fast enough, the method can thus be used as a benchmark for regular MCMC implementation.

ISBA 2021.1

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , , , on June 29, 2021 by xi'an

An infinite (mixture) session was truly the first one I could attend on Day 1, as a heap of unexpected last minute issues kept me busy or on hedge for the beginning of the day (if not preventing me from a dawn dip in Calanque de Morgiou). Using the CIRM video system for zoom talked required more preparation than I had thought and we made it barely in time for the first session, while I had to store zoom links for all speakers present in Luminy.  Plus allocate sessions to the rooms provided by CIRM, twice since there was a mishap with the other workshop present at CIRM. And reassuring speakers, made anxious by the absence of a clear schedule. Chairing the second ABC session was also a tense moment, from checking every speaker could connect and share slides, to ensuring they kept on schedule (and they did on both!, ta’), to checking for questions at the end. Spotting a possible connection between Takuo Mastubara’s Stein’s approximation for in the ABC setup and a related paper by Liu and Lee I had read just a few days ago. Alas, it was too early to relax as an inverter in the CIRM room burned and led to a local power failure. Fortunately this was restored prior to the mixture session! (As several boars were spotted on the campus yesternight, I hope no tragic encounter happens before the end of the meeting!!!) So the mixture session proposed new visions on infering K, the number of components, some of which reminded me of… my first talk at CIRM where I was trying to get rid of empty components at each MCMC step, albeit in a much more rudimentary way obviously. And later had the wonderful surprise of hearing Xiao-Li’s lecture start by an excerpt from Car Talk, the hilarious Sunday morning radio talk-show about the art of used car maintenance on National Public Radio (NPR) that George Casella could not miss (and where a letter he wrote them about a mistaken probability computation was mentioned!). The final session of the day was an invited ABC session I chaired (after being exfiltrated from the CIRM dinner table!) with Kate Lee, Ryan Giordano, and Julien Stoehr as speakers. Besides Julien’s talk on our Gibbs-ABC paper, both other talks shared a concern with the frequentist properties of the ABC posterior, either to be used as a control tool or as a faster assessment of the variability of the (Monte Carlo) ABC output.

estimation of a normal mean matrix

Posted in Statistics with tags , , , , , , , , , on May 13, 2021 by xi'an

A few days ago, I noticed the paper Estimation under matrix quadratic loss and matrix superharmonicity by Takeru Matsuda and my friend Bill Strawderman had appeared in Biometrika. (Disclaimer: I was not involved in handling the submission!) This is a “classical” shrinkage estimation problem in that covariance matrix estimators are compared under under a quadratic loss, using Charles Stein’s technique of unbiased estimation of the risk is derived. The authors show that the Efron–Morris estimator is minimax. They also introduce superharmonicity for matrix-variate functions towards showing that generalized Bayes estimator with respect to a matrix superharmonic priors are minimax., including a generalization of Stein’s prior. Superharmonicity that relates to (much) earlier results by Ed George (1986), Mary-Ellen Bock (1988),  Dominique Fourdrinier, Bill Strawderman, and Marty Wells (1998). (All of whom I worked with in the 1980’s and 1990’s! in Rouen, Purdue, and Cornell). This paper also made me realise Dominique, Bill, and Marty had published a Springer book on Shrinkage estimators a few years ago and that I had missed it..!

training energy based models

Posted in Books, Statistics with tags , , , , , , , on April 7, 2021 by xi'an

This recent arXival by Song and Kingma covers different computational approaches to semi-parametric estimation, but also exposes imho the chasm existing between statistical and machine learning perspectives on the problem.

“Energy-based models are much less restrictive in functional form: instead of specifying a normalized probability, they only specify the unnormalized negative log-probability (…) Since the energy function does not need to integrate to one, it can be parameterized with any nonlinear regression function.”

The above in the introduction appears first as a strange argument, since the mass one constraint is the least of the problems when addressing non-parametric density estimation. Problems like the convergence, the speed of convergence, the computational cost and the overall integrability of the estimator. It seems however that the restriction or lack thereof is to be understood as the ability to use much more elaborate forms of densities, which are then black-boxes whose components have little relevance… When using such mega-over-parameterised representations of densities, such as neural networks and normalising flows, a statistical assessment leads to highly challenging questions. But convergence (in the sample size) does not appear to be a concern for the paper. (Except for a citation of Hyvärinen on p.5.)

Using MLE in this context appears to be questionable, though, since the base parameter θ is not unlikely to remain identifiable. Computing the MLE is therefore a minor issue, in this regard, a resolution based on simulated gradients being well-chartered from the earlier era of stochastic optimisation as in Robbins & Monro (1954), Duflo (1996) or Benveniste & al. (1990). (The log-gradient of the normalising constant being estimated by the opposite of the gradient of the energy at a random point.)

“Running MCMC till convergence to obtain a sample x∼p(x) can be computationally expensive.”

Contrastive divergence à la Hinton (2002) is presented as a solution to the convergence problem by stopping early, which seems reasonable given the random gradient is mostly noise. With a possible correction for bias à la Jacob & al. (missing the published version).

An alternative to MLE is the 2005 Hyvärinen score, notorious for bypassing the normalising constant. But blamed in the paper for being costly in the dimension d of the variate x, due to the second derivative matrix. Which can be avoided by using Stein’s unbiased estimator of the risk (yay!) if using randomized data. And surprisingly linked with contrastive divergence as well, if a Taylor expansion is good enough an approximation! An interesting byproduct of the discussion on score matching is to turn it into an unintended form of ABC!

“Many methods have been proposed to automatically tune the noise distribution, such as Adversarial Contrastive Estimation (Bose et al., 2018), Conditional NCE (Ceylan and Gutmann, 2018) and Flow Contrastive Estimation (Gao et al., 2020).”

A third approach is the noise contrastive estimation method of Gutmann & Hyvärinen (2010) that connects with both others. And is a precursor of GAN methods, mentioned at the end of the paper via a (sort of) variational inequality.

%d bloggers like this: