Archive for Langevin MCMC algorithm

your GAN is secretly an energy-based model

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on January 5, 2021 by xi'an

As I was reading this NeurIPS 2020 paper by Che et al., and trying to make sense of it, I came across a citation to our paper Casella, Robert and Wells (2004) on a generalized accept-reject sampling scheme where the proposal changes at each simulation that sounds surprising if appreciated! But after checking this paper also appears as the first reference on the Wikipedia page for rejection sampling, which makes me wonder if many actually read it. (On the side, we mostly wrote this paper on a drive from Baltimore to Ithaca, after JSM 1999.)

“We provide more evidence that it is beneficial to sample from the energy-based model defined both by the generator and the discriminator instead of from the generator only.”

The paper seems to propose a post-processing of the generator output by a GAN, generating from the mixture of both generator and discriminator, via a (unscented) Langevin algorithm. The core idea is that, if p(.) is the true data generating process, g(.) the estimated generator and d(.) the discriminator, then

p(x) ≈ p⁰(x)∝g(x) exp(d(x))

(The approximation would be exact the discriminator optimal.) The authors work with the latent z’s, in the GAN meaning that generating pseudo-data x from g means taking a deterministic transform of z, x=G(z). When considering the above p⁰, a generation from p⁰ can be seen as accept-reject with acceptance probability proportional to exp[d{G(z)}]. (On the side, Lemma 1 is the standard validation for accept-reject sampling schemes.)

Reading this paper made me realise how much the field had evolved since my previous GAN related read. With directions like Metropolis-Hastings GANs and Wasserstein GANs. (And I noticed a “broader impact” section past the conclusion section about possible misuses with societal consequences, which is a new requirement for NeurIPS publications.)

AABI9 tidbits [& misbits]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on December 10, 2019 by xi'an

Today’s Advances in Approximate Bayesian Inference symposium, organised by Thang Bui, Adji Bousso Dieng, Dawen Liang, Francisco Ruiz, and Cheng Zhang, took place in front of Vancouver Harbour (and the tentalising ski slope at the back) and saw more than 400 participants, drifting away from the earlier versions which had a stronger dose of ABC and much fewer participants. There were students’ talks in a fair proportion, as well (and a massive number of posters). As of below, I took some notes during some of the talks with no pretense at exhaustivity, objectivity or accuracy. (This is a blog post, remember?!) Overall I found the day exciting (to the point I did not suffer at all from the usal naps consecutive to very short nights!) and engaging, with a lot of notions and methods I had never heard about. (Which shows how much I know nothing!)

The first talk was by Michalis Titsias, Gradient-based Adaptive Markov Chain Monte Carlo (jointly with Petros Dellaportas) involving as its objective function the multiplication of the variance of the move and of the acceptance probability, with a proposed adaptive version merging gradients, variational Bayes, neurons, and two levels of calibration parameters. The method advocates using this construction in a burnin phase rather than continuously, hence does not require advanced Markov tools for convergence assessment. (I found myself less excited by adaptation than earlier, maybe because it seems like switching one convergence problem for another, with additional design choices to be made.)The second talk was by Jakub Swiatkowsk, The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks, involving mean field approximation in variational inference (loads of VI at this symposium!), meaning de facto searching for a MAP estimator, and reminding me of older factor analysis and other analyse de données projection methods, except it also involved neural networks (what else at NeurIPS?!)The third talk was by Michael Gutmann, Robust Optimisation Monte Carlo, (OMC) for implicit data generated models (Diggle & Graton, 1982), an ABC talk at last!, using a formalisation through the functional representation of the generative process and involving derivatives of the summary statistic against parameter, in that sense, with the (Bayesian) random nature of the parameter sample only induced by the (frequentist) randomness in the generative transform since a new parameter “realisation” is obtained there as the one providing minimal distance between data and pseudo-data, with no uncertainty or impact of the prior. The Jacobian of this summary transform (and once again a neural network is used to construct the summary) appears in the importance weight, leading to OMC being unstable, beyond failing to reproduce the variability expressed by the regular posterior or even the ABC posterior. It took me a while to wonder `where is Wally?!’ (the prior) as it only appears in the importance weight.

The fourth talk was by Sergey Levine, Reinforcement Learning, Optimal , Control, and Probabilistic Inference, back to Kullback-Leibler as the objective function, with linkage to optimal control (with distributions as actions?), plus again variational inference, producing an approximation in sequential settings. This sounded like a type of return of the MaxEnt prior, but the talk pace was so intense that I could not follow where the innovations stood.

The fifth talk was by Iuliia Molchanova, on Structured Semi-Implicit Variational Inference, from BAyesgroup.ru (I did not know of a Bayesian group in Russia!, as I was under the impression that Bayesian statistics were under-represented there, but apparently the situation is quite different in machine learning.) The talk brought an interesting concept of semi-implicit variational inference, exploiting some form of latent variables as far as I can understand, using mixtures of Gaussians.

The sixth talk was by Rianne van den Berg, Normalizing Flows for Discrete Data, and amounted to covering three papers also discussed in NeurIPS 2019 proper, which I found somewhat of a suboptimal approach to an invited talk, as it turned into a teaser for following talks or posters. But the teasers it contained were quite interesting as they covered normalising flows as integer valued controlled changes of variables using neural networks about which I had just became aware during the poster session, in connection with papers of Papamakarios et al., which I need to soon read.

The seventh talk was by Matthew Hoffman: Langevin Dynamics as Nonparametric Variational Inference, and sounded most interesting, both from title and later reports, as it was bridging Langevin with VI, but I alas missed it for being “stuck” in a tea-house ceremony that lasted much longer than expected. (More later on that side issue!)

After the second poster session (with a highly original proposal by Radford Neal towards creating  non-reversibility at the level of the uniform generator rather than later on), I thus only attended Emily Fox’s Stochastic Gradient MCMC for Sequential Data Sources, which superbly reviewed (in connection with a sequence of papers, including a recent one by Aicher et al.) error rate and convergence properties of stochastic gradient estimator methods there. Another paper I need to soon read!

The one before last speaker, Roman Novak, exposed a Python library about infinite neural networks, for which I had no direct connection (and talks I have always difficulties about libraries, even without a four hour sleep night) and the symposium concluded with a mild round-table. Mild because Frank Wood’s best efforts (and healthy skepticism about round tables!) to initiate controversies, we could not see much to bite from each other’s viewpoint.

noise contrastive estimation

Posted in Statistics with tags , , , , , , , , , on July 15, 2019 by xi'an

As I was attending Lionel Riou-Durand’s PhD thesis defence in ENSAE-CREST last week, I had a look at his papers (!). The 2018 noise contrastive paper is written with Nicolas Chopin (both authors share the CREST affiliation with me). Which compares Charlie Geyer’s 1994 bypassing the intractable normalising constant problem by virtue of an artificial logit model with additional simulated data from another distribution ψ.

“Geyer (1994) established the asymptotic properties of the MC-MLE estimates under general conditions; in particular that the x’s are realisations of an ergodic process. This is remarkable, given that most of the theory on M-estimation (i.e.estimation obtained by maximising functions) is restricted to iid data.”

Michael Guttman and Aapo Hyvärinen also use additional simulated data in another likelihood of a logistic classifier, called noise contrastive estimation. Both methods replace the unknown ratio of normalising constants with an unbiased estimate based on the additional simulated data. The major and impressive result in this paper [now published in the Electronic Journal of Statistics] is that the noise contrastive estimation approach always enjoys a smaller variance than Geyer’s solution, at an equivalent computational cost when the actual data observations are iid. And the artificial data simulations ergodic. The difference between both estimators is however negligible against the Monte Carlo error (Theorem 2).

This may be a rather naïve question, but I wonder at the choice of the alternative distribution ψ. With a vague notion that it could be optimised in a GANs perspective. A side result of interest in the paper is to provide a minimal (re)parameterisation of the truncated multivariate Gaussian distribution, if only as an exercise for future exams. Truncated multivariate Gaussian for which the normalising constant is of course unknown.

Siem Reap conference

Posted in Kids, pictures, Travel, University life with tags , , , , , , , , , , , , , , , , , , on March 8, 2019 by xi'an

As I returned from the conference in Siem Reap. on a flight avoiding India and Pakistan and their [brittle and bristling!] boundary on the way back, instead flying far far north, near Arkhangelsk (but with nothing to show for it, as the flight back was fully in the dark), I reflected how enjoyable this conference had been, within a highly friendly atmosphere, meeting again with many old friends (some met prior to the creation of CREST) and new ones, a pleasure not hindered by the fabulous location near Angkor of course. (The above picture is the “last hour” group picture, missing a major part of the participants, already gone!)

Among the many talks, Stéphane Shao gave a great presentation on a paper [to appear in JASA] jointly written with Pierre Jacob, Jie Ding, and Vahid Tarokh on the Hyvärinen score and its use for Bayesian model choice, with a highly intuitive representation of this divergence function (which I first met in Padua when Phil Dawid gave a talk on this approach to Bayesian model comparison). Which is based on the use of a divergence function based on the squared error difference between the gradients of the true log-score and of the model log-score functions. Providing an alternative to the Bayes factor that can be shown to be consistent, even for some non-iid data, with some gains in the experiments represented by the above graph.

Arnak Dalalyan (CREST) presented a paper written with Lionel Riou-Durand on the convergence of non-Metropolised Langevin Monte Carlo methods, with a new discretization which leads to a substantial improvement of the upper bound on the sampling error rate measured in Wasserstein distance. Moving from p/ε to √p/√ε in the requested number of steps when p is the dimension and ε the target precision, for smooth and strongly log-concave targets.

This post gives me the opportunity to advertise for the NGO Sala Baï hostelry school, which the whole conference visited for lunch and which trains youths from underprivileged backgrounds towards jobs in hostelery, supported by donations, companies (like Krama Krama), or visiting the Sala Baï  restaurant and/or hotel while in Siem Reap.

 

scalable Metropolis-Hastings

Posted in Books, Statistics, Travel with tags , , , , , , , , , on February 12, 2019 by xi'an

Among the flury of arXived papers of last week (414!), including a fair chunk of papers submitted to ICML 2019, I spotted one entry by Cornish et al. on scalable Metropolis-Hastings, which Arnaud Doucet had mentioned to me yesterday when in Oxford. The paper builds on the delayed acceptance paper we wrote with Marco Banterlé, Clara Grazian and Anthony Lee, itself relying on a factorisation decomposition of the likelihood, combined with control variate accelerating techniques. The factorisation of both the target and the proposal allows for a (less efficient) Metropolis-Hastings acceptance ratio that is the product

\prod_{i=1}^m \alpha_i(\theta,\theta')

of individual Metropolis-Hastings acceptance ratios, but which allows for quicker rejection if one of the probabilities in the product is small, because the corresponding Bernoulli draw is zero with high probability. One advance made in Michel et al. (2017) [which I doubly missed] is that subsampling is achievable by thinning (as in PDMPs, where these authors have been quite active) through an algorithm of Shantikumar (1985) [described in Devroye’s bible]. Provided each Metropolis-Hastings probability can be lower bounded:

\alpha_i(\theta,\theta') \ge \exp\{-\psi_i \phi(\theta,\theta')\}

by a term where the transition φ does not depend on the index i in the product. The computing cost of the thinning process thus depends on the efficiency of the subsampling, namely whether or not the (Poisson) number of terms is much smaller than m, number of terms in the product. A neat trick in the current paper that extends the the Fukui-Todo procedure is to switch to the original Metropolis-Hastings when the overall lower bound is too small, recovering the geometric ergodicity of this original if it holds (Theorem 2.1). Another neat remark is that when using the naïve factorisation as the product of the n individual likelihoods, the resulting algorithm is sort of doomed as n grows, even with an optimal scaling of the proposals. To achieve scalability, the authors introduce a Taylor (i.e., Gaussian) approximation to each local target in the product and start the acceptance decomposition by using the resulting overall Gaussian approximation. Meaning that the remaining product is now made of ratios of targets over their local Taylor approximations, hence most likely close to one. And potentially lower-bounded by the remainder term in the Taylor expansion. Leading to the conclusion that, when everything goes well, meaning that the Taylor expansions can be conducted and the bounds derived for the appropriate expansion, the order of the Poisson scale is O(1/√n)..! The proposal for the Metropolis-Hastings move is actually tuned to the Gaussian approximation, appearing as a variant of the Langevin move or more exactly a discretization of an Hamiltonian move. Obviously, I cannot judge of the complexity in implementing this new scheme from just reading the paper, but this development on the split target is definitely an exciting prospect for handling huge datasets and their friends!