## black box MCMC

Posted in Books, Statistics with tags , , , , , , , , on July 17, 2021 by xi'an

“…back-box methods, despite using no information of the proposal distribution, can actually give better estimation accuracy than the typical importance sampling [methods]…”

Earlier this week I was pointed out to Liu & Lee’s black box importance sampling, published in AISTATS 2017. (which I did not attend). Already found in Briol et al. (2015) and Oates, Girolami, and Chopin (2017), the method starts from Charles Stein‘s “unbiased estimator of the loss” (that was a fundamental tool in my own PhD thesis!), a variation on integration by part:

$\mathbb E_p[\nabla\log p(X) f(X)+\nabla f(X)]=0$

for differentiable functions f and p cancelling at the boundaries. It also holds for the kernelised extension

$\mathbb E_p[k_p(X,x')]=0$

for all x’, where the integrand is a 1-d function of an arbitrary kernel k(x,x’) and of the score function ∇log p. This null expectation happens to be a minimum since

$\mathbb E_{X,X'\sim q}[k_p(X,X')]\ge 0$

and hence importance weights can be obtained by minimising

$\sum_{ij} w_i w_j k_p(x_i,x_j)$

in w (from the unit simplex), for a sample of iid realisations from a possibly unknown distribution with density q. Liu & Lee show that this approximation converges faster than the standard Monte Carlo speed √n, when using Hilbertian properties of the kernel through control variates. Actually, the same thing happens when using a (leave-one-out) non-parametric kernel estimate of q rather than q. At least in theory.

“…simulating n parallel MCMC chains for m steps, where the length m of the chains can be smaller than what is typically used in MCMC, because it just needs to be large enough to bring the distribution `roughly’ close to the target distribution”

A practical application of the concept is suggested in the above quote. As a corrected weight for interrupted MCMC. Or when using an unadjusted Langevin algorithm. Provided the minimisation of the objective quadratic form is fast enough, the method can thus be used as a benchmark for regular MCMC implementation.

## ISBA 2021.1

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , , , on June 29, 2021 by xi'an

## estimation of a normal mean matrix

Posted in Statistics with tags , , , , , , , , , on May 13, 2021 by xi'an

A few days ago, I noticed the paper Estimation under matrix quadratic loss and matrix superharmonicity by Takeru Matsuda and my friend Bill Strawderman had appeared in Biometrika. (Disclaimer: I was not involved in handling the submission!) This is a “classical” shrinkage estimation problem in that covariance matrix estimators are compared under under a quadratic loss, using Charles Stein’s technique of unbiased estimation of the risk is derived. The authors show that the Efron–Morris estimator is minimax. They also introduce superharmonicity for matrix-variate functions towards showing that generalized Bayes estimator with respect to a matrix superharmonic priors are minimax., including a generalization of Stein’s prior. Superharmonicity that relates to (much) earlier results by Ed George (1986), Mary-Ellen Bock (1988),  Dominique Fourdrinier, Bill Strawderman, and Marty Wells (1998). (All of whom I worked with in the 1980’s and 1990’s! in Rouen, Purdue, and Cornell). This paper also made me realise Dominique, Bill, and Marty had published a Springer book on Shrinkage estimators a few years ago and that I had missed it..!

## training energy based models

Posted in Books, Statistics with tags , , , , , , , on April 7, 2021 by xi'an

This recent arXival by Song and Kingma covers different computational approaches to semi-parametric estimation, but also exposes imho the chasm existing between statistical and machine learning perspectives on the problem.

“Energy-based models are much less restrictive in functional form: instead of specifying a normalized probability, they only specify the unnormalized negative log-probability (…) Since the energy function does not need to integrate to one, it can be parameterized with any nonlinear regression function.”

The above in the introduction appears first as a strange argument, since the mass one constraint is the least of the problems when addressing non-parametric density estimation. Problems like the convergence, the speed of convergence, the computational cost and the overall integrability of the estimator. It seems however that the restriction or lack thereof is to be understood as the ability to use much more elaborate forms of densities, which are then black-boxes whose components have little relevance… When using such mega-over-parameterised representations of densities, such as neural networks and normalising flows, a statistical assessment leads to highly challenging questions. But convergence (in the sample size) does not appear to be a concern for the paper. (Except for a citation of Hyvärinen on p.5.)

Using MLE in this context appears to be questionable, though, since the base parameter θ is not unlikely to remain identifiable. Computing the MLE is therefore a minor issue, in this regard, a resolution based on simulated gradients being well-chartered from the earlier era of stochastic optimisation as in Robbins & Monro (1954), Duflo (1996) or Benveniste & al. (1990). (The log-gradient of the normalising constant being estimated by the opposite of the gradient of the energy at a random point.)

“Running MCMC till convergence to obtain a sample x∼p(x) can be computationally expensive.”

Contrastive divergence à la Hinton (2002) is presented as a solution to the convergence problem by stopping early, which seems reasonable given the random gradient is mostly noise. With a possible correction for bias à la Jacob & al. (missing the published version).

An alternative to MLE is the 2005 Hyvärinen score, notorious for bypassing the normalising constant. But blamed in the paper for being costly in the dimension d of the variate x, due to the second derivative matrix. Which can be avoided by using Stein’s unbiased estimator of the risk (yay!) if using randomized data. And surprisingly linked with contrastive divergence as well, if a Taylor expansion is good enough an approximation! An interesting byproduct of the discussion on score matching is to turn it into an unintended form of ABC!

“Many methods have been proposed to automatically tune the noise distribution, such as Adversarial Contrastive Estimation (Bose et al., 2018), Conditional NCE (Ceylan and Gutmann, 2018) and Flow Contrastive Estimation (Gao et al., 2020).”

A third approach is the noise contrastive estimation method of Gutmann & Hyvärinen (2010) that connects with both others. And is a precursor of GAN methods, mentioned at the end of the paper via a (sort of) variational inequality.

## JB³ [Junior Bayes beyond the borders]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , on June 22, 2020 by xi'an

Bocconi and j-ISBA are launcing a webinar series for and by junior Bayesian researchers. The first talk is on 25 June, 25 at 3pm UTC/GMT (5pm CET) with Francois-Xavier Briol, one of the laureates of the 2020 Savage Thesis Prize (and a former graduate of OxWaSP, the Oxford-Warwick doctoral training program), on Stein’s method for Bayesian computation, with as a discussant Nicolas Chopin.

As pointed out on their webpage,

Due to the importance of the above endeavor, JB³ will continue after the health emergency as an annual series. It will include various refinements aimed at increasing the involvement of the whole junior Bayesian community and facilitating a broader participation to the online seminars all over the world via various online solutions.

Thanks to all my friends at Bocconi for running this experiment!