Archive for Charles Stein

black box MCMC

Posted in Books, Statistics with tags , , , , , , , , on July 17, 2021 by xi'an

“…back-box methods, despite using no information of the proposal distribution, can actually give better estimation accuracy than the typical importance sampling [methods]…”

Earlier this week I was pointed out to Liu & Lee’s black box importance sampling, published in AISTATS 2017. (which I did not attend). Already found in Briol et al. (2015) and Oates, Girolami, and Chopin (2017), the method starts from Charles Stein‘s “unbiased estimator of the loss” (that was a fundamental tool in my own PhD thesis!), a variation on integration by part:

\mathbb E_p[\nabla\log p(X) f(X)+\nabla f(X)]=0

for differentiable functions f and p cancelling at the boundaries. It also holds for the kernelised extension

\mathbb E_p[k_p(X,x')]=0

for all x’, where the integrand is a 1-d function of an arbitrary kernel k(x,x’) and of the score function ∇log p. This null expectation happens to be a minimum since

\mathbb E_{X,X'\sim q}[k_p(X,X')]\ge 0

and hence importance weights can be obtained by minimising

\sum_{ij} w_i w_j k_p(x_i,x_j)

in w (from the unit simplex), for a sample of iid realisations from a possibly unknown distribution with density q. Liu & Lee show that this approximation converges faster than the standard Monte Carlo speed √n, when using Hilbertian properties of the kernel through control variates. Actually, the same thing happens when using a (leave-one-out) non-parametric kernel estimate of q rather than q. At least in theory.

“…simulating n parallel MCMC chains for m steps, where the length m of the chains can be smaller than what is typically used in MCMC, because it just needs to be large enough to bring the distribution `roughly’ close to the target distribution”

A practical application of the concept is suggested in the above quote. As a corrected weight for interrupted MCMC. Or when using an unadjusted Langevin algorithm. Provided the minimisation of the objective quadratic form is fast enough, the method can thus be used as a benchmark for regular MCMC implementation.

ISBA 2021.1

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , , , on June 29, 2021 by xi'an

An infinite (mixture) session was truly the first one I could attend on Day 1, as a heap of unexpected last minute issues kept me busy or on hedge for the beginning of the day (if not preventing me from a dawn dip in Calanque de Morgiou). Using the CIRM video system for zoom talked required more preparation than I had thought and we made it barely in time for the first session, while I had to store zoom links for all speakers present in Luminy.  Plus allocate sessions to the rooms provided by CIRM, twice since there was a mishap with the other workshop present at CIRM. And reassuring speakers, made anxious by the absence of a clear schedule. Chairing the second ABC session was also a tense moment, from checking every speaker could connect and share slides, to ensuring they kept on schedule (and they did on both!, ta’), to checking for questions at the end. Spotting a possible connection between Takuo Mastubara’s Stein’s approximation for in the ABC setup and a related paper by Liu and Lee I had read just a few days ago. Alas, it was too early to relax as an inverter in the CIRM room burned and led to a local power failure. Fortunately this was restored prior to the mixture session! (As several boars were spotted on the campus yesternight, I hope no tragic encounter happens before the end of the meeting!!!) So the mixture session proposed new visions on infering K, the number of components, some of which reminded me of… my first talk at CIRM where I was trying to get rid of empty components at each MCMC step, albeit in a much more rudimentary way obviously. And later had the wonderful surprise of hearing Xiao-Li’s lecture start by an excerpt from Car Talk, the hilarious Sunday morning radio talk-show about the art of used car maintenance on National Public Radio (NPR) that George Casella could not miss (and where a letter he wrote them about a mistaken probability computation was mentioned!). The final session of the day was an invited ABC session I chaired (after being exfiltrated from the CIRM dinner table!) with Kate Lee, Ryan Giordano, and Julien Stoehr as speakers. Besides Julien’s talk on our Gibbs-ABC paper, both other talks shared a concern with the frequentist properties of the ABC posterior, either to be used as a control tool or as a faster assessment of the variability of the (Monte Carlo) ABC output.

estimation of a normal mean matrix

Posted in Statistics with tags , , , , , , , , , on May 13, 2021 by xi'an

A few days ago, I noticed the paper Estimation under matrix quadratic loss and matrix superharmonicity by Takeru Matsuda and my friend Bill Strawderman had appeared in Biometrika. (Disclaimer: I was not involved in handling the submission!) This is a “classical” shrinkage estimation problem in that covariance matrix estimators are compared under under a quadratic loss, using Charles Stein’s technique of unbiased estimation of the risk is derived. The authors show that the Efron–Morris estimator is minimax. They also introduce superharmonicity for matrix-variate functions towards showing that generalized Bayes estimator with respect to a matrix superharmonic priors are minimax., including a generalization of Stein’s prior. Superharmonicity that relates to (much) earlier results by Ed George (1986), Mary-Ellen Bock (1988),  Dominique Fourdrinier, Bill Strawderman, and Marty Wells (1998). (All of whom I worked with in the 1980’s and 1990’s! in Rouen, Purdue, and Cornell). This paper also made me realise Dominique, Bill, and Marty had published a Springer book on Shrinkage estimators a few years ago and that I had missed it..!

training energy based models

Posted in Books, Statistics with tags , , , , , , , on April 7, 2021 by xi'an

This recent arXival by Song and Kingma covers different computational approaches to semi-parametric estimation, but also exposes imho the chasm existing between statistical and machine learning perspectives on the problem.

“Energy-based models are much less restrictive in functional form: instead of specifying a normalized probability, they only specify the unnormalized negative log-probability (…) Since the energy function does not need to integrate to one, it can be parameterized with any nonlinear regression function.”

The above in the introduction appears first as a strange argument, since the mass one constraint is the least of the problems when addressing non-parametric density estimation. Problems like the convergence, the speed of convergence, the computational cost and the overall integrability of the estimator. It seems however that the restriction or lack thereof is to be understood as the ability to use much more elaborate forms of densities, which are then black-boxes whose components have little relevance… When using such mega-over-parameterised representations of densities, such as neural networks and normalising flows, a statistical assessment leads to highly challenging questions. But convergence (in the sample size) does not appear to be a concern for the paper. (Except for a citation of Hyvärinen on p.5.)

Using MLE in this context appears to be questionable, though, since the base parameter θ is not unlikely to remain identifiable. Computing the MLE is therefore a minor issue, in this regard, a resolution based on simulated gradients being well-chartered from the earlier era of stochastic optimisation as in Robbins & Monro (1954), Duflo (1996) or Benveniste & al. (1990). (The log-gradient of the normalising constant being estimated by the opposite of the gradient of the energy at a random point.)

“Running MCMC till convergence to obtain a sample x∼p(x) can be computationally expensive.”

Contrastive divergence à la Hinton (2002) is presented as a solution to the convergence problem by stopping early, which seems reasonable given the random gradient is mostly noise. With a possible correction for bias à la Jacob & al. (missing the published version).

An alternative to MLE is the 2005 Hyvärinen score, notorious for bypassing the normalising constant. But blamed in the paper for being costly in the dimension d of the variate x, due to the second derivative matrix. Which can be avoided by using Stein’s unbiased estimator of the risk (yay!) if using randomized data. And surprisingly linked with contrastive divergence as well, if a Taylor expansion is good enough an approximation! An interesting byproduct of the discussion on score matching is to turn it into an unintended form of ABC!

“Many methods have been proposed to automatically tune the noise distribution, such as Adversarial Contrastive Estimation (Bose et al., 2018), Conditional NCE (Ceylan and Gutmann, 2018) and Flow Contrastive Estimation (Gao et al., 2020).”

A third approach is the noise contrastive estimation method of Gutmann & Hyvärinen (2010) that connects with both others. And is a precursor of GAN methods, mentioned at the end of the paper via a (sort of) variational inequality.

JB³ [Junior Bayes beyond the borders]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on June 22, 2020 by xi'an

Bocconi and j-ISBA are launcing a webinar series for and by junior Bayesian researchers. The first talk is on 25 June, 25 at 3pm UTC/GMT (5pm CET) with Francois-Xavier Briol, one of the laureates of the 2020 Savage Thesis Prize (and a former graduate of OxWaSP, the Oxford-Warwick doctoral training program), on Stein’s method for Bayesian computation, with as a discussant Nicolas Chopin.

As pointed out on their webpage,

Due to the importance of the above endeavor, JB³ will continue after the health emergency as an annual series. It will include various refinements aimed at increasing the involvement of the whole junior Bayesian community and facilitating a broader participation to the online seminars all over the world via various online solutions.

Thanks to all my friends at Bocconi for running this experiment!

%d bloggers like this: