## turn-key and scalable synchronous distributed MCMC algorithm

Posted in Statistics, University life with tags , , , , , on April 29, 2022 by xi'an

Last week, I attended a Lagrange seminar where Vincent Plassier presented a ICML²¹ paper he had co-authored with Maxime Vono, Alain Durmus, and Eric Moulines. Aiming at distributed MCMC algorithms that operate on several machines, with a target distribution that breaks as a target

$\int\prod_{i=1}^b \pi_i(\theta,z_i)\,\text d\mathbf{z}=\prod_{i=1}^b e^{U_i(A_i\theta)}$

where θ is common to all terms. And each term in the product can (only) be computed locally. This setup is obviously the same as for the embarrassingly parallel approaches of Neiswanger et al. (2014) and Scott et al. (2016). And it follows an earlier proposal of Vono et al. (2020), which appears as a full Gibbs algorithm on the augmented parameters (θ,z), assuming each term is a conditional density in the latent z’s. Which requires constant communications between the b workers and the central “master” node when θ is concerned. The ICML²¹ paper overcomes this difficulty by defining an approximate target with a Normal component in z. Meaning that the (approximate) conditional distribution of θ given the latent z is Normal, i.e. considering the augmented joint

$\prod_{i=1}^b\exp\left\{u_i(z_i)-\rho_i||z_i-A_i\theta||^2\right\}$

but despite the Gaussian aspect, this is not always practical:

“When d [is large], this Gibbs sampling scheme unfortunately leads to prohibitive computational costs and hence prevents its practical use for general Bayesian inference problems.”

The authors then move to simulating from several Langevin step, more specifically running one move of the Euler-Maruyama discretisation scheme of the overdamped Langevin stochastic differential equation. Communication with the central node is then reduced. The paper proposes a proof of convergence in this unusual (since overdamped) setup. As well as bounds on the bias due to the inclusion of the latent variables. They also manage to find the required scaling of the various parameters involved (Normal variance, discretisation scale, Langevin runs) to achieve convergence, which I find rather remarkable. The table at the top illustrates the comparison with earlier methods, whenever available.

## control variates [seminar]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on November 5, 2021 by xi'an

Today, Petros Dellaportas (whom I have know since the early days of MCMC, when we met in CIRM) gave a seminar at the Warwick algorithm seminar on control variates for MCMC, reminding me of his 2012 JRSS paper. Based on the Poisson equation and using a second control variate to stabilise the Monte Carlo approximation do the first control variate. The difference with usual control variates is finding a first approximate G(x)-q(y|x)G(Y) to F-πF. And the first Poisson equation is using α(x,y)q(y|x) rather than π. Then the second expands log α(x,y)q(y|x) to achieve a manageable term.

Abstract: We provide a general methodology to construct control variates for any discrete time random walk Metropolis and Metropolis-adjusted Langevin algorithm Markov chains that can achieve, in a post-processing manner and with a negligible additional computational cost, impressive variance reduction when compared to the standard MCMC ergodic averages. Our proposed estimators are based on an approximate solution of the Poisson equation for a multivariate Gaussian target densities of any dimension.

I wonder if there were a neural network version that would first build G from scratch and later optimise it towards solving the Poisson equation. As in this recent arXival I haven’t read (yet).

## your GAN is secretly an energy-based model

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on January 5, 2021 by xi'an

As I was reading this NeurIPS 2020 paper by Che et al., and trying to make sense of it, I came across a citation to our paper Casella, Robert and Wells (2004) on a generalized accept-reject sampling scheme where the proposal changes at each simulation that sounds surprising if appreciated! But after checking this paper also appears as the first reference on the Wikipedia page for rejection sampling, which makes me wonder if many actually read it. (On the side, we mostly wrote this paper on a drive from Baltimore to Ithaca, after JSM 1999.)

“We provide more evidence that it is beneficial to sample from the energy-based model defined both by the generator and the discriminator instead of from the generator only.”

The paper seems to propose a post-processing of the generator output by a GAN, generating from the mixture of both generator and discriminator, via a (unscented) Langevin algorithm. The core idea is that, if p(.) is the true data generating process, g(.) the estimated generator and d(.) the discriminator, then

p(x) ≈ p⁰(x)∝g(x) exp(d(x))

(The approximation would be exact the discriminator optimal.) The authors work with the latent z’s, in the GAN meaning that generating pseudo-data x from g means taking a deterministic transform of z, x=G(z). When considering the above p⁰, a generation from p⁰ can be seen as accept-reject with acceptance probability proportional to exp[d{G(z)}]. (On the side, Lemma 1 is the standard validation for accept-reject sampling schemes.)

Reading this paper made me realise how much the field had evolved since my previous GAN related read. With directions like Metropolis-Hastings GANs and Wasserstein GANs. (And I noticed a “broader impact” section past the conclusion section about possible misuses with societal consequences, which is a new requirement for NeurIPS publications.)

## AABI9 tidbits [& misbits]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on December 10, 2019 by xi'an

Today’s Advances in Approximate Bayesian Inference symposium, organised by Thang Bui, Adji Bousso Dieng, Dawen Liang, Francisco Ruiz, and Cheng Zhang, took place in front of Vancouver Harbour (and the tentalising ski slope at the back) and saw more than 400 participants, drifting away from the earlier versions which had a stronger dose of ABC and much fewer participants. There were students’ talks in a fair proportion, as well (and a massive number of posters). As of below, I took some notes during some of the talks with no pretense at exhaustivity, objectivity or accuracy. (This is a blog post, remember?!) Overall I found the day exciting (to the point I did not suffer at all from the usal naps consecutive to very short nights!) and engaging, with a lot of notions and methods I had never heard about. (Which shows how much I know nothing!)

The fourth talk was by Sergey Levine, Reinforcement Learning, Optimal , Control, and Probabilistic Inference, back to Kullback-Leibler as the objective function, with linkage to optimal control (with distributions as actions?), plus again variational inference, producing an approximation in sequential settings. This sounded like a type of return of the MaxEnt prior, but the talk pace was so intense that I could not follow where the innovations stood.

The fifth talk was by Iuliia Molchanova, on Structured Semi-Implicit Variational Inference, from BAyesgroup.ru (I did not know of a Bayesian group in Russia!, as I was under the impression that Bayesian statistics were under-represented there, but apparently the situation is quite different in machine learning.) The talk brought an interesting concept of semi-implicit variational inference, exploiting some form of latent variables as far as I can understand, using mixtures of Gaussians.

The sixth talk was by Rianne van den Berg, Normalizing Flows for Discrete Data, and amounted to covering three papers also discussed in NeurIPS 2019 proper, which I found somewhat of a suboptimal approach to an invited talk, as it turned into a teaser for following talks or posters. But the teasers it contained were quite interesting as they covered normalising flows as integer valued controlled changes of variables using neural networks about which I had just became aware during the poster session, in connection with papers of Papamakarios et al., which I need to soon read.

The seventh talk was by Matthew Hoffman: Langevin Dynamics as Nonparametric Variational Inference, and sounded most interesting, both from title and later reports, as it was bridging Langevin with VI, but I alas missed it for being “stuck” in a tea-house ceremony that lasted much longer than expected. (More later on that side issue!)

After the second poster session (with a highly original proposal by Radford Neal towards creating  non-reversibility at the level of the uniform generator rather than later on), I thus only attended Emily Fox’s Stochastic Gradient MCMC for Sequential Data Sources, which superbly reviewed (in connection with a sequence of papers, including a recent one by Aicher et al.) error rate and convergence properties of stochastic gradient estimator methods there. Another paper I need to soon read!

The one before last speaker, Roman Novak, exposed a Python library about infinite neural networks, for which I had no direct connection (and talks I have always difficulties about libraries, even without a four hour sleep night) and the symposium concluded with a mild round-table. Mild because Frank Wood’s best efforts (and healthy skepticism about round tables!) to initiate controversies, we could not see much to bite from each other’s viewpoint.

## noise contrastive estimation

Posted in Statistics with tags , , , , , , , , , on July 15, 2019 by xi'an

As I was attending Lionel Riou-Durand’s PhD thesis defence in ENSAE-CREST last week, I had a look at his papers (!). The 2018 noise contrastive paper is written with Nicolas Chopin (both authors share the CREST affiliation with me). Which compares Charlie Geyer’s 1994 bypassing the intractable normalising constant problem by virtue of an artificial logit model with additional simulated data from another distribution ψ.

“Geyer (1994) established the asymptotic properties of the MC-MLE estimates under general conditions; in particular that the x’s are realisations of an ergodic process. This is remarkable, given that most of the theory on M-estimation (i.e.estimation obtained by maximising functions) is restricted to iid data.”

Michael Guttman and Aapo Hyvärinen also use additional simulated data in another likelihood of a logistic classifier, called noise contrastive estimation. Both methods replace the unknown ratio of normalising constants with an unbiased estimate based on the additional simulated data. The major and impressive result in this paper [now published in the Electronic Journal of Statistics] is that the noise contrastive estimation approach always enjoys a smaller variance than Geyer’s solution, at an equivalent computational cost when the actual data observations are iid. And the artificial data simulations ergodic. The difference between both estimators is however negligible against the Monte Carlo error (Theorem 2).

This may be a rather naïve question, but I wonder at the choice of the alternative distribution ψ. With a vague notion that it could be optimised in a GANs perspective. A side result of interest in the paper is to provide a minimal (re)parameterisation of the truncated multivariate Gaussian distribution, if only as an exercise for future exams. Truncated multivariate Gaussian for which the normalising constant is of course unknown.