**A**s I was preparing my (new) lectures for a PhD short course “at” Warwick (meaning on Teams!), I read a few surveys and other papers on all these acronyms. It included the massive Guttmann and Hyvärinen 2012 NCE JMLR paper, Goodfellow’s NIPS 2016 tutorial on GANs, and Kingma and Welling 2019 introduction to VAEs. Which I found a wee bit on the light side, maybe missing the fundamentals of the notion… As well as the pretty helpful 2019 survey on normalising flows by Papamakarios et al., although missing on the (statistical) density estimation side. And also a nice (2017) survey of GANs by Shakir Mohamed and Balaji Lakshminarayanan with a somewhat statistical spirit, even though convergence issues are not again not covered. But misspecification is there. And the many connections between ABC and GANs, if definitely missing on the uncertainty aspects. While Deep Learning by Goodfellow, Bengio and Courville adresses both the normalising constant (or partition function) and GANs, it was somehow not deep enough (!) to use for the course, offering only a few pages on NCE, VAEs and GANs. (And also missing on the statistical references addressing the issue, incl. [or excl.] Geyer, 1994.) Overall, the infinite variations offered on GANs leave me uncertain about their statistical relevance, as it is unclear how good the regularisation therein is for handling overfitting and consistent estimation. (And if I spot another decomposition of the Kullback-Leibler divergence, I may start crying…)

## Archive for normalising flow

## NCE, VAEs, GANs & even ABC…

Posted in Statistics with tags ABC, Bayesian GANs, CDT, deep learning, energy based model, generative adversarial networks, noise contrasting estimation, normalising constant, normalising flow, partition function, PhD course, Teams, University of Warwick, variational autoencoders on May 14, 2021 by xi'an## nested sampling: any prior anytime?!

Posted in Books, pictures, Statistics, Travel with tags arXiv, Bayesian Methods in Cosmology, CMB, IAP, improper priors, inverse cdf, ΛCDM model, MNRAS, nested sampling, normalising flow, Planck Collaboration, prior modelling, simulation on March 26, 2021 by xi'an**A** recent arXival by Justin Alsing and Will Handley on “*nested sampling with any prior you like*” caught my attention. If only because I was under the impression that some priors would not agree with nested sampling. Especially those putting positive weight on some fixed levels of the likelihood function, as well as improper priors.

“…nested sampling has largely only been practical for a somewhat restrictive class of priors, which have a readily available representation as a transform from the unit hyper-cube.”

Reading from the paper, it seems that the whole point is to demonstrate that “any proper prior may be transformed onto the unit hypercube via a bijective transformation.” Which seems rather straightforward if the transform is not otherwise constrained: use a logit transform in every direction. The paper gets instead into the rather fashionable direction of normalising flows as density representations. (Which suddenly reminded me of the PhD dissertation of Rob Cornish at Oxford, which I examined last year. Even though nested was not used there in the same understanding.) The purpose appearing later (in the paper) or *in fine* to express a random variable simulated from the prior as the (generative) transform of a Uniform variate, f(U). Resuscitating the simulation from an arbitrary distribution from first principles.

“One particularly common scenario where this arises is when one wants to use the (sampled) posterior from one experiment as the prior for another”

But I remained uncertain at the requirement for this representation in implementing nested sampling as I do not see how it helps in bypassing the hurdles of simulating from the prior constrained by increasing levels of the likelihood function. It would be helpful to construct normalising flows adapted to the truncated priors but I did not see anything related to this version in the paper.

The cosmological application therein deals with the incorporation of recent measurements in the study of the ΛCDM cosmological model, that is, more recent that the CMB Planck dataset we played with 15 years ago. (Time flies, even if an expanding Universe!) Namely, the Baryon Oscillation Spectroscopic Survey and the S*H*_{0}ES collaboration.

## flow contrastive estimation

Posted in Books, Statistics with tags Bayesian inference, flight, GAN, generative adversarial networks, Hyperion, noise-contrastive estimation, normalising constant, normalising flow, Université de Montpellier on March 15, 2021 by xi'an**O**n the flight back from Montpellier, last week, I read a 2019 paper by Gao et al. revisiting the MLE estimation of a parametric family parameter when the normalising constant Z=Z(θ) is unknown. Via noise-contrastive estimation à la Guttman & Hyvärinnen (or à la Charlie Geyer). Treating the normalising constant Z as an extra parameter (as in Kong et al.) and the classification probability as an objective function and calling it a likelihood, which it is not in my opinion as (i) the allocation to the groups is not random and (ii) the original density of the actual observations does not appear in the so-called likelihood.

*“When q appears on the right of KL-divergence* [against *p*],* it is forced to cover most of the modes of p, When q appears on the left of KL-divergence, it tends to chase the major modes of p while ignoring the minor modes.”*

The flow in the title indicates that the contrastive distribution *q* is estimated by a flow-based estimator, namely the transform of a basic noise distribution via easily invertible and differentiable transforms, for instance with lower triangular Jacobians. This flow is also estimated directly from the data but the authors complain this estimation is not good enough for noise contrastive estimation and suggest instead resorting to a GAN version where the classification log-probability is maximised in the model parameters and minimsed in the flow parameters. Except that I feel it misses the true likelihood part. In other words, why on Hyperion would estimating all θ, Z=Z(θ), and α at once improve the estimation of Z?

The other aspect that puzzles me is that (12) uses integrated classification probabilities (with the unknown Z as extra parameter), rather than conditioning on the data, Bayes-like. (The difference between (12) and GAN is that here the discriminator function is constrained.) Esp. when the first expectation is replaced with its empirical version.

## distortion estimates for approximate Bayesian inference

Posted in pictures, Statistics, University life with tags ABC, approximate Bayesian inference, distortion map, normalising flow, optimal transport, simulation-based inference, Toronto, uai2020, variational Bayes methods, variational inference on July 7, 2020 by xi'an**A** few days ago, Hanwen Xing, Geoff Nichols and Jeong Eun Lee arXived a paper with the following title, to be presented at uai2020. Towards assessing the fit of the approximation for the actual posterior, given the available data. This covers of course ABC methods (which seems to be the primary focus of the paper) but also variational inference and synthetic likelihood versions. For a parameter of interest, the difference between exact and approximate marginal posterior distributions is see as a distortion map, *D = F o G⁻¹*, interpreted as in optimal transport and estimated by normalising flows. Even when the approximate distribution *G* is poorly estimated since *D* remains the cdf of *G(X)* when *X* is distributed from *F*. The marginal posterior approximate cdf *G* can be estimated by ABC or another approximate technique. The distortion function D is itself restricted to be a Beta cdf, with parameters estimated by a neural network (although based on which input is unclear to me, unless the weights in (5) are the neural weights). The assessment is based on the estimated distortion at the dataset, as a significant difference from the identity signal a poor fit for the approximation. Overall, the procedure seems implementable rather easily and while depending on calibrating choices (other than the number of layers in the neural network) a realistic version of the simulation-based diagnostic of Talts et al. (2018).