Archive for Bayesian GANs

GANs as density estimators

Posted in Books, Statistics with tags , , , , , , , on October 15, 2021 by xi'an

I recently read an arXival entitled Conditional Sampling With Monotone GAN by Kovakchi et al., who construct  a mapping T that transforms or pushes forward a reference measure þ() like a multivariate Normal distribution to a target conditional distribution ð(dθ|x).  Which makes the proposal a type of normalising flow, except it does not require a Jacobian derivation… The mapping T is monotonous and block triangular in order to be invertible. It is learned from data by minimising a functional divergence between Tþ(dθ) and ð(dθ|x), for instance GAN least square or GAN Wasserstein penalties and representing T as a neural network.  Where monotonicity is imposed by a Lagrangian. The authors “note that global minimizers of [their GAN criterion] can also be used for conditional density estimation” but I fail to understand the distinction in that once T is constructed, the estimated conditional density is automatically available. However my main source of puzzlement is at the worth of this construction, since it does not provide an exact generative process for the conditional distribution, while requiring many generations from the joint distribution. Rather than a comparison with MCMC, which is not applicable in untractable generative models, a comparison with less expensive ABC solutions would have been appropriate, I think. And the paper is missing any quantification on the quality or asymptotics of the density estimate provided by this involved approximation, as most of the recent literature on normalising flows and friends. (A point acknowledged by the authors in the supplementary material section.)

“In this regard, the MGANs approach introduced in the article belongs to the category of sampling techniques such as MCMC, whose goal is to generate independent samples from the law of y|x, as opposed to assuming some structural form of the probability measure directly.”

I am unsure I understand the above remark as MCMC methods are intrinsically linked with the exact probability distribution, exploiting either some conditional representations as in Gibbs or at the very least the ability to compute the joint density…

 

NCE, VAEs, GANs & even ABC…

Posted in Statistics with tags , , , , , , , , , , , , , on May 14, 2021 by xi'an

As I was preparing my (new) lectures for a PhD short course “at” Warwick (meaning on Teams!), I read a few surveys and other papers on all these acronyms. It included the massive Guttmann and Hyvärinen 2012 NCE JMLR paperGoodfellow’s NIPS 2016 tutorial on GANs, and  Kingma and Welling 2019 introduction to VAEs. Which I found a wee bit on the light side, maybe missing the fundamentals of the notion… As well as the pretty helpful 2019 survey on normalising flows by Papamakarios et al., although missing on the (statistical) density estimation side.  And also a nice (2017) survey of GANs by Shakir Mohamed and Balaji Lakshminarayanan with a somewhat statistical spirit, even though convergence issues are not again not covered. But misspecification is there. And the many connections between ABC and GANs, if definitely missing on the uncertainty aspects. While Deep Learning by Goodfellow, Bengio and Courville adresses both the normalising constant (or partition function) and GANs, it was somehow not deep enough (!) to use for the course, offering only a few pages on NCE, VAEs and GANs. (And also missing on the statistical references addressing the issue, incl. [or excl.]  Geyer, 1994.) Overall, the infinite variations offered on GANs leave me uncertain about their statistical relevance, as it is unclear how good the regularisation therein is for handling overfitting and consistent estimation. (And if I spot another decomposition of the Kullback-Leibler divergence, I may start crying…)

mean field Langevin system & neural networks

Posted in Statistics with tags , , , , , , , on February 4, 2020 by xi'an

A colleague of mine in Paris Dauphine, Zhenjie Ren, recently gave a talk on recent papers of his connecting neural nets and Langevin. Estimating the parameters of the NNs by mean-field Langevin dynamics. Following from an earlier paper on the topic by Mei, Montanari & Nguyen in 2018. Here are some notes I took during the seminar, not necessarily coherent as I was a bit under the weather that day. And had no previous exposure to most notions.

Fitting a one-layer network is turned into a minimisation programme over a measure space (when using loads of data). A reformulation that makes the problem convex. Adding a regularisation by the entropy and introducing derivatives of a functional against the measure. With a necessary and sufficient condition for the solution to be unique when the functional is convex. This reformulation leads to a Fokker-Planck equation, itself related with a Langevin diffusion. Except there is a measure in the Langevin equation, which stationary version is the solution of the original regularised minimisation programme.

A second paper contains an extension to deep NN, re-expressed as a problem in a random environment. Or with a marginal constraint (one marginal distribution being constrained). With a partial derivative wrt the marginal measure. Turning into a Langevin diffusion with an extra random element. Using optimal control produces a new Hamiltonian. Eventually producing the mean-field Langevin system as backward propagation. Coefficients being computed by chain rule, equivalent to a Euler scheme for Langevin dynamics.

This approach holds consequence for GANs with discriminator as one-layer NN and generator minimised over two measures. The discriminator is the invariant measure of the mean-field Langevin dynamics. Mentioning Metropolis-Hastings GANs which seem to require one full run of an MCMC algorithm at each iteration of the mean-field Langevin.

Bayesian inference with no likelihood

Posted in Books, Statistics, University life with tags , , , , , , , , on January 28, 2020 by xi'an

This week I made a quick trip to Warwick for the defence (or viva) of the PhD thesis of Jack Jewson, containing novel perspectives on constructing Bayesian inference without likelihood or without complete trust in said likelihood. The thesis aimed at constructing minimum divergence posteriors in an M-open perspective and built a rather coherent framework from principles to implementation. There is a clear link with the earlier work of Bissiri et al. (2016), with further consistency constraints where the outcome must recover the true posterior in the M-closed scenario (if not always the case with the procedures proposed in the thesis).

Although I am partial to the use of empirical likelihoods in setting, I appreciated the position of the thesis and the discussion of the various divergences towards the posterior derivation (already discussed on this blog) , with interesting perspectives on the calibration of the pseudo-posterior à la Bissiri et al. (2016). Among other things, the thesis pointed out a departure from the likelihood principle and some of its most established consequences, like Bayesian additivity. In that regard, there were connections with generative adversarial networks (GANs) and their Bayesian versions that could have been explored. And an impression that the type of Bayesian robustness explored in the thesis has more to do with outliers than with misspecification. Epsilon-contamination amodels re quite specific as it happens, in terms of tails and other things.

The next chapter is somewhat “less” Bayesian in my view as it considers a generalised form of variational inference. I agree that the view of the posterior as a solution to an optimisation is tempting but changing the objective function makes the notion less precise.  Which makes reading it somewhat delicate as it seems to dilute the meaning of both prior and posterior to the point of becoming irrelevant.

The last chapter on change-point models is quite alluring in that it capitalises on the previous developments to analyse a fairly realistic if traditional problem, applied to traffic in London, prior and posterior to the congestion tax. However, there is always an issue with robustness and outliers in that the notion is somewhat vague or informal. Things start clarifying at the end but I find surprising that conjugates are robust optimal solutions since the usual folk theorem from the 80’s is that they are not robust.

an independent sampler that maximizes the acceptance rate of the MH algorithm

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , on September 3, 2019 by xi'an

An ICLR 2019 paper by Neklyudov, Egorov and Vetrov on an optimal choice of the proposal in an independent Metropolis algorithm I discovered via an X validated question. Namely whether or not the expected Metropolis-Hastings acceptance ratio is always one (which it is not when the support of the proposal is restricted). The paper mentions the domination of the Accept-Reject algorithm by the associated independent Metropolis-Hastings algorithm, which has actually been stated in our Monte Carlo Statistical Methods (1999, Lemma 6.3.2) and may prove even older. The authors also note that the expected acceptance probability is equal to one minus the total variation distance between the joint defined as target x Metropolis-Hastings proposal distribution and its time-reversed version. Which seems to suffer from the same difficulty as the one mentioned in the X validated question. Namely that it only holds when the support of the Metropolis-Hastings proposal is at least the support of the target (or else when the support of the joint defined as target x Metropolis-Hastings proposal distribution is somewhat symmetric. Replacing total variation with Kullback-Leibler then leads to a manageable optimisation target if the proposal is a parameterised independent distribution. With a GAN version when the proposal is not explicitly available. I find it rather strange that one still seeks independent proposals for running Metropolis-Hastings algorithms as the result will depend on the family of proposals considered and as performances will deteriorate with dimension (the authors mention a 10% acceptance rate, which sounds quite low). [As an aside, ICLR 2020 will take part in Addis Abeba next April.]

%d bloggers like this: