Archive for Bayesian GANs

ABC by classification

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on December 21, 2021 by xi'an

As a(nother) coincidence, yesterday, we had a reading group discussion at Paris Dauphine a few days after Veronika Rockova presented the paper in person in Oaxaca. The idea in ABC by classification that she co-authored with Yuexi Wang and Tetsuya Kaj is to use the empirical Kullback-Leibler divergence as a substitute to the intractable likelihood at the parameter value θ. In the generalised Bayes setting of Bissiri et al. Since this quantity is not available it is estimated as well. By a classification method that somehow relates to Geyer’s 1994 inverse logistic proposal, using the (ABC) pseudo-data generated from the model associated with θ. The convergence of the algorithm obviously depends on the choice of the discriminator used in practice. The paper also makes a connection with GANs as a potential alternative for the generalised Bayes representation. It mostly focus on the frequentist validation of the ABC posterior, in the sense of exhibiting a posterior concentration rate in n, the sample size, while requiring performances of the discriminators that may prove hard to check in practice. Expanding our 2018 result to this setting, with the tolerance decreasing more slowly than the Kullback-Leibler estimation error.

Besides the shared appreciation that working with the Kullback-Leibler divergence was a nice and under-appreciated direction, one point that came out of our discussion is that using the (estimated) Kullback-Leibler divergence as a form of distance (attached with a tolerance) is less prone to variability (or more robust) than using directly (and without tolerance) the estimate as a substitute to the intractable likelihood, if we interpreted the discrepancy in Figure 3 properly. Another item was about the discriminator function itself: while a machine learning methodology such as neural networks could be used, albeit with unclear theoretical guarantees, it was unclear to us whether or not a new discriminator needed be constructed for each value of the parameter θ. Even when the simulations are run by a deterministic transform.

21w5107 [½day 3]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on December 2, 2021 by xi'an

Day [or half-day] three started without firecrackers and with David Rossell (formerly Warwick) presenting an empirical Bayes approach to generalised linear model choice with a high degree of confounding, using approximate Laplace approximations. With considerable improvements in the experimental RMSE. Making feeling sorry there was no apparent fully (and objective?) Bayesian alternative! (Two more papers on my reading list that I should have read way earlier!) Then Veronika Rockova discussed her work on approximate Metropolis-Hastings by classification. (With only a slight overlap with her One World ABC seminar.) Making me once more think of Geyer’s n⁰564 technical report, namely the estimation of a marginal likelihood by a logistic discrimination representation. Her ABC resolution replaces the tolerance step by an exponential of minus the estimated Kullback-Leibler divergence between the data density and the density associated with the current value of the parameter. (I wonder if there is a residual multiplicative constant there… Presumably not. Great idea!) The classification step need be run at every iteration, which could be sped up by subsampling.

On the always fascinating theme of loss based posteriors, à la Bissiri et al., Jack Jewson (formerly Warwick) exposed his work generalised Bayesian and improper models (from Birmingham!). Using data to decide between model and loss, which sounds highly unorthodox! First difficulty is that losses are unscaled. Or even not integrable after an exponential transform. Hence the notion of improper models. As in the case of robust Tukey’s loss, which is bounded by an arbitrary κ. Immediately I wonder if the fact that the pseudo-likelihood does not integrate is important beyond the (obvious) absence of a normalising constant. And the fact that this is not a generative model. And the answer came a few slides later with the use of the Hyvärinen score. Rather than the likelihood score. Which can itself be turned into a H-posterior, very cool indeed! Although I wonder at the feasibility of finding an [objective] prior on κ.

Rajesh Ranganath completed the morning session with a talk on [the difficulty of] connecting Bayesian models and complex prediction models. Using instead a game theoretic approach with Brier scores under censoring. While there was a connection with Veronika’s use of a discriminator as a likelihood approximation, I had trouble catching the overall message…

GANs as density estimators

Posted in Books, Statistics with tags , , , , , , , on October 15, 2021 by xi'an

I recently read an arXival entitled Conditional Sampling With Monotone GAN by Kovakchi et al., who construct  a mapping T that transforms or pushes forward a reference measure þ() like a multivariate Normal distribution to a target conditional distribution ð(dθ|x).  Which makes the proposal a type of normalising flow, except it does not require a Jacobian derivation… The mapping T is monotonous and block triangular in order to be invertible. It is learned from data by minimising a functional divergence between Tþ(dθ) and ð(dθ|x), for instance GAN least square or GAN Wasserstein penalties and representing T as a neural network.  Where monotonicity is imposed by a Lagrangian. The authors “note that global minimizers of [their GAN criterion] can also be used for conditional density estimation” but I fail to understand the distinction in that once T is constructed, the estimated conditional density is automatically available. However my main source of puzzlement is at the worth of this construction, since it does not provide an exact generative process for the conditional distribution, while requiring many generations from the joint distribution. Rather than a comparison with MCMC, which is not applicable in untractable generative models, a comparison with less expensive ABC solutions would have been appropriate, I think. And the paper is missing any quantification on the quality or asymptotics of the density estimate provided by this involved approximation, as most of the recent literature on normalising flows and friends. (A point acknowledged by the authors in the supplementary material section.)

“In this regard, the MGANs approach introduced in the article belongs to the category of sampling techniques such as MCMC, whose goal is to generate independent samples from the law of y|x, as opposed to assuming some structural form of the probability measure directly.”

I am unsure I understand the above remark as MCMC methods are intrinsically linked with the exact probability distribution, exploiting either some conditional representations as in Gibbs or at the very least the ability to compute the joint density…


NCE, VAEs, GANs & even ABC…

Posted in Statistics with tags , , , , , , , , , , , , , on May 14, 2021 by xi'an

As I was preparing my (new) lectures for a PhD short course “at” Warwick (meaning on Teams!), I read a few surveys and other papers on all these acronyms. It included the massive Guttmann and Hyvärinen 2012 NCE JMLR paperGoodfellow’s NIPS 2016 tutorial on GANs, and  Kingma and Welling 2019 introduction to VAEs. Which I found a wee bit on the light side, maybe missing the fundamentals of the notion… As well as the pretty helpful 2019 survey on normalising flows by Papamakarios et al., although missing on the (statistical) density estimation side.  And also a nice (2017) survey of GANs by Shakir Mohamed and Balaji Lakshminarayanan with a somewhat statistical spirit, even though convergence issues are not again not covered. But misspecification is there. And the many connections between ABC and GANs, if definitely missing on the uncertainty aspects. While Deep Learning by Goodfellow, Bengio and Courville adresses both the normalising constant (or partition function) and GANs, it was somehow not deep enough (!) to use for the course, offering only a few pages on NCE, VAEs and GANs. (And also missing on the statistical references addressing the issue, incl. [or excl.]  Geyer, 1994.) Overall, the infinite variations offered on GANs leave me uncertain about their statistical relevance, as it is unclear how good the regularisation therein is for handling overfitting and consistent estimation. (And if I spot another decomposition of the Kullback-Leibler divergence, I may start crying…)

mean field Langevin system & neural networks

Posted in Statistics with tags , , , , , , , on February 4, 2020 by xi'an

A colleague of mine in Paris Dauphine, Zhenjie Ren, recently gave a talk on recent papers of his connecting neural nets and Langevin. Estimating the parameters of the NNs by mean-field Langevin dynamics. Following from an earlier paper on the topic by Mei, Montanari & Nguyen in 2018. Here are some notes I took during the seminar, not necessarily coherent as I was a bit under the weather that day. And had no previous exposure to most notions.

Fitting a one-layer network is turned into a minimisation programme over a measure space (when using loads of data). A reformulation that makes the problem convex. Adding a regularisation by the entropy and introducing derivatives of a functional against the measure. With a necessary and sufficient condition for the solution to be unique when the functional is convex. This reformulation leads to a Fokker-Planck equation, itself related with a Langevin diffusion. Except there is a measure in the Langevin equation, which stationary version is the solution of the original regularised minimisation programme.

A second paper contains an extension to deep NN, re-expressed as a problem in a random environment. Or with a marginal constraint (one marginal distribution being constrained). With a partial derivative wrt the marginal measure. Turning into a Langevin diffusion with an extra random element. Using optimal control produces a new Hamiltonian. Eventually producing the mean-field Langevin system as backward propagation. Coefficients being computed by chain rule, equivalent to a Euler scheme for Langevin dynamics.

This approach holds consequence for GANs with discriminator as one-layer NN and generator minimised over two measures. The discriminator is the invariant measure of the mean-field Langevin dynamics. Mentioning Metropolis-Hastings GANs which seem to require one full run of an MCMC algorithm at each iteration of the mean-field Langevin.

%d bloggers like this: