Archive for Bayesian GANs

robust inference using posterior bootstrap

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , on February 18, 2022 by xi'an

The famous 1994 Read Paper by Michael Newton and Adrian Raftery was entitled Approximate Bayesian inference, where the boostrap aspect is in randomly (exponentially) weighting each observation in the iid sample through a power of the corresponding density, a proposal that happened at about the same time as Tony O’Hagan suggested the related fractional Bayes factor. (The paper may also be equally famous for suggesting the harmonic mean estimator of the evidence!, although it only appeared as an appendix to the paper.) What is unclear to me is the nature of the distribution g(θ) associated with the weighted bootstrap sample, conditional on the original sample, since the outcome is the result of a random Exponential sample and of an optimisation step. With no impact of the prior (which could have been used as a penalisation factor), corrected by Michael and Adrian via an importance step involving the estimation of g(·).

At the Algorithm Seminar today in Warwick, Emilie Pompe presented recent research, including some written jointly with Pierre Jacob, [which I have not yet read] that does exactly that inclusion of the log prior as penalisation factor, along with an extra weight different from one, as motivated by the possibility of a misspecification. Including a new approach to cut models. An alternative mentioned during the talk that reminds me of GANs is to generate a pseudo-sample from the prior predictive and add it to the original sample. (Some attendees commented on the dependence of the later version on the chosen parameterisation, which is an issue that had X’ed my mind as well.)

ABC by classification

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on December 21, 2021 by xi'an

As a(nother) coincidence, yesterday, we had a reading group discussion at Paris Dauphine a few days after Veronika Rockova presented the paper in person in Oaxaca. The idea in ABC by classification that she co-authored with Yuexi Wang and Tetsuya Kaj is to use the empirical Kullback-Leibler divergence as a substitute to the intractable likelihood at the parameter value θ. In the generalised Bayes setting of Bissiri et al. Since this quantity is not available it is estimated as well. By a classification method that somehow relates to Geyer’s 1994 inverse logistic proposal, using the (ABC) pseudo-data generated from the model associated with θ. The convergence of the algorithm obviously depends on the choice of the discriminator used in practice. The paper also makes a connection with GANs as a potential alternative for the generalised Bayes representation. It mostly focus on the frequentist validation of the ABC posterior, in the sense of exhibiting a posterior concentration rate in n, the sample size, while requiring performances of the discriminators that may prove hard to check in practice. Expanding our 2018 result to this setting, with the tolerance decreasing more slowly than the Kullback-Leibler estimation error.

Besides the shared appreciation that working with the Kullback-Leibler divergence was a nice and under-appreciated direction, one point that came out of our discussion is that using the (estimated) Kullback-Leibler divergence as a form of distance (attached with a tolerance) is less prone to variability (or more robust) than using directly (and without tolerance) the estimate as a substitute to the intractable likelihood, if we interpreted the discrepancy in Figure 3 properly. Another item was about the discriminator function itself: while a machine learning methodology such as neural networks could be used, albeit with unclear theoretical guarantees, it was unclear to us whether or not a new discriminator needed be constructed for each value of the parameter θ. Even when the simulations are run by a deterministic transform.

21w5107 [½day 3]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , on December 2, 2021 by xi'an

Day [or half-day] three started without firecrackers and with David Rossell (formerly Warwick) presenting an empirical Bayes approach to generalised linear model choice with a high degree of confounding, using approximate Laplace approximations. With considerable improvements in the experimental RMSE. Making feeling sorry there was no apparent fully (and objective?) Bayesian alternative! (Two more papers on my reading list that I should have read way earlier!) Then Veronika Rockova discussed her work on approximate Metropolis-Hastings by classification. (With only a slight overlap with her One World ABC seminar.) Making me once more think of Geyer’s n⁰564 technical report, namely the estimation of a marginal likelihood by a logistic discrimination representation. Her ABC resolution replaces the tolerance step by an exponential of minus the estimated Kullback-Leibler divergence between the data density and the density associated with the current value of the parameter. (I wonder if there is a residual multiplicative constant there… Presumably not. Great idea!) The classification step need be run at every iteration, which could be sped up by subsampling.

On the always fascinating theme of loss based posteriors, à la Bissiri et al., Jack Jewson (formerly Warwick) exposed his work generalised Bayesian and improper models (from Birmingham!). Using data to decide between model and loss, which sounds highly unorthodox! First difficulty is that losses are unscaled. Or even not integrable after an exponential transform. Hence the notion of improper models. As in the case of robust Tukey’s loss, which is bounded by an arbitrary κ. Immediately I wonder if the fact that the pseudo-likelihood does not integrate is important beyond the (obvious) absence of a normalising constant. And the fact that this is not a generative model. And the answer came a few slides later with the use of the Hyvärinen score. Rather than the likelihood score. Which can itself be turned into a H-posterior, very cool indeed! Although I wonder at the feasibility of finding an [objective] prior on κ.

Rajesh Ranganath completed the morning session with a talk on [the difficulty of] connecting Bayesian models and complex prediction models. Using instead a game theoretic approach with Brier scores under censoring. While there was a connection with Veronika’s use of a discriminator as a likelihood approximation, I had trouble catching the overall message…

GANs as density estimators

Posted in Books, Statistics with tags , , , , , , , on October 15, 2021 by xi'an

I recently read an arXival entitled Conditional Sampling With Monotone GAN by Kovakchi et al., who construct  a mapping T that transforms or pushes forward a reference measure þ() like a multivariate Normal distribution to a target conditional distribution ð(dθ|x).  Which makes the proposal a type of normalising flow, except it does not require a Jacobian derivation… The mapping T is monotonous and block triangular in order to be invertible. It is learned from data by minimising a functional divergence between Tþ(dθ) and ð(dθ|x), for instance GAN least square or GAN Wasserstein penalties and representing T as a neural network.  Where monotonicity is imposed by a Lagrangian. The authors “note that global minimizers of [their GAN criterion] can also be used for conditional density estimation” but I fail to understand the distinction in that once T is constructed, the estimated conditional density is automatically available. However my main source of puzzlement is at the worth of this construction, since it does not provide an exact generative process for the conditional distribution, while requiring many generations from the joint distribution. Rather than a comparison with MCMC, which is not applicable in untractable generative models, a comparison with less expensive ABC solutions would have been appropriate, I think. And the paper is missing any quantification on the quality or asymptotics of the density estimate provided by this involved approximation, as most of the recent literature on normalising flows and friends. (A point acknowledged by the authors in the supplementary material section.)

“In this regard, the MGANs approach introduced in the article belongs to the category of sampling techniques such as MCMC, whose goal is to generate independent samples from the law of y|x, as opposed to assuming some structural form of the probability measure directly.”

I am unsure I understand the above remark as MCMC methods are intrinsically linked with the exact probability distribution, exploiting either some conditional representations as in Gibbs or at the very least the ability to compute the joint density…

 

NCE, VAEs, GANs & even ABC…

Posted in Statistics with tags , , , , , , , , , , , , , on May 14, 2021 by xi'an

As I was preparing my (new) lectures for a PhD short course “at” Warwick (meaning on Teams!), I read a few surveys and other papers on all these acronyms. It included the massive Guttmann and Hyvärinen 2012 NCE JMLR paperGoodfellow’s NIPS 2016 tutorial on GANs, and  Kingma and Welling 2019 introduction to VAEs. Which I found a wee bit on the light side, maybe missing the fundamentals of the notion… As well as the pretty helpful 2019 survey on normalising flows by Papamakarios et al., although missing on the (statistical) density estimation side.  And also a nice (2017) survey of GANs by Shakir Mohamed and Balaji Lakshminarayanan with a somewhat statistical spirit, even though convergence issues are not again not covered. But misspecification is there. And the many connections between ABC and GANs, if definitely missing on the uncertainty aspects. While Deep Learning by Goodfellow, Bengio and Courville adresses both the normalising constant (or partition function) and GANs, it was somehow not deep enough (!) to use for the course, offering only a few pages on NCE, VAEs and GANs. (And also missing on the statistical references addressing the issue, incl. [or excl.]  Geyer, 1994.) Overall, the infinite variations offered on GANs leave me uncertain about their statistical relevance, as it is unclear how good the regularisation therein is for handling overfitting and consistent estimation. (And if I spot another decomposition of the Kullback-Leibler divergence, I may start crying…)

%d bloggers like this: