**A**s I was preparing my (new) lectures for a PhD short course “at” Warwick (meaning on Teams!), I read a few surveys and other papers on all these acronyms. It included the massive Guttmann and Hyvärinen 2012 NCE JMLR paper, Goodfellow’s NIPS 2016 tutorial on GANs, and Kingma and Welling 2019 introduction to VAEs. Which I found a wee bit on the light side, maybe missing the fundamentals of the notion… As well as the pretty helpful 2019 survey on normalising flows by Papamakarios et al., although missing on the (statistical) density estimation side. And also a nice (2017) survey of GANs by Shakir Mohamed and Balaji Lakshminarayanan with a somewhat statistical spirit, even though convergence issues are not again not covered. But misspecification is there. And the many connections between ABC and GANs, if definitely missing on the uncertainty aspects. While Deep Learning by Goodfellow, Bengio and Courville adresses both the normalising constant (or partition function) and GANs, it was somehow not deep enough (!) to use for the course, offering only a few pages on NCE, VAEs and GANs. (And also missing on the statistical references addressing the issue, incl. [or excl.] Geyer, 1994.) Overall, the infinite variations offered on GANs leave me uncertain about their statistical relevance, as it is unclear how good the regularisation therein is for handling overfitting and consistent estimation. (And if I spot another decomposition of the Kullback-Leibler divergence, I may start crying…)

## Archive for noise contrasting estimation

## NCE, VAEs, GANs & even ABC…

Posted in Statistics with tags ABC, Bayesian GANs, CDT, deep learning, energy based model, generative adversarial networks, noise contrasting estimation, normalising constant, normalising flow, partition function, PhD course, Teams, University of Warwick, variational autoencoders on May 14, 2021 by xi'an## training energy based models

Posted in Books, Statistics with tags Charles Stein, energy based model, GAN, generative adversarial networks, Hyvärinen score, maximum likelihood estimation, noise contrasting estimation, unbiased estimation on April 7, 2021 by xi'an**T**his recent arXival by Song and Kingma covers different computational approaches to semi-parametric estimation, but also exposes imho the chasm existing between statistical and machine learning perspectives on the problem.

“Energy-based models are much less restrictive in functional form: instead of specifying a normalized probability, they only specify the unnormalized negative log-probability (…) Since the energy function does not need to integrate to one, it can be parameterized with any nonlinear regression function.”

The above in the introduction appears first as a strange argument, since the mass one constraint is the least of the problems when addressing non-parametric density estimation. Problems like the convergence, the speed of convergence, the computational cost and the overall integrability of the estimator. It seems however that the restriction or lack thereof is to be understood as the ability to use much more elaborate forms of densities, which are then black-boxes whose components have little relevance… When using such mega-over-parameterised representations of densities, such as neural networks and normalising flows, a statistical assessment leads to highly challenging questions. But convergence (in the sample size) does not appear to be a concern for the paper. (Except for a citation of Hyvärinen on p.5.)

Using MLE in this context appears to be questionable, though, since the base parameter θ is not unlikely to remain identifiable. Computing the MLE is therefore a minor issue, in this regard, a resolution based on simulated gradients being well-chartered from the earlier era of stochastic optimisation as in Robbins & Monro (1954), Duflo (1996) or Benveniste & al. (1990). (The log-gradient of the normalising constant being estimated by the opposite of the gradient of the energy at a random point.)

“Running MCMC till convergence to obtain a sample x∼p(x) can be computationally expensive.”

Contrastive divergence à la Hinton (2002) is presented as a solution to the convergence problem by stopping early, which seems reasonable given the random gradient is mostly noise. With a possible correction for bias *à la* Jacob & al. (missing the published version).

An alternative to MLE is the 2005 Hyvärinen score, notorious for bypassing the normalising constant. But blamed in the paper for being costly in the dimension d of the variate x, due to the second derivative matrix. Which can be avoided by using Stein’s unbiased estimator of the risk (yay!) if using randomized data. And surprisingly linked with contrastive divergence as well, if a Taylor expansion is good enough an approximation! An interesting byproduct of the discussion on score matching is to turn it into an unintended form of ABC!

“Many methods have been proposed to automatically tune the noise distribution, such as Adversarial Contrastive Estimation (Bose et al., 2018), Conditional NCE (Ceylan and Gutmann, 2018) and Flow Contrastive Estimation (Gao et al., 2020).”

A third approach is the noise contrastive estimation method of Gutmann & Hyvärinen (2010) that connects with both others. And is a precursor of GAN methods, mentioned at the end of the paper via a (sort of) variational inequality.

## Bayes factors revisited

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags Bayes factor, Bayesian foundations, bridge sampling, Dickey-Savage ratio, nested sampling, noise contrasting estimation, Ockham's razor, posterior probability, SMC, testing of hypotheses on March 22, 2021 by xi'an

“Bayes factor analyses are highly sensitive to and crucially depend on prior assumptions about model parameters (…) Note that the dependency of Bayes factors on the prior goes beyond the dependency of the posterior on the prior. Importantly, for most interesting problems and models, Bayes factors cannot be computed analytically.”

Daniel J. Schad, Bruno Nicenboim, Paul-Christian Bürkner, Michael Betancourt, Shravan Vasishth have just arXived a massive document on the Bayes factor, worrying about the computation of this common tool, but also at the variability of decisions based on Bayes factors, e.g., stressing correctly that

“…we should not confuse inferences with decisions. Bayes factors provide inference on hypotheses. However, to obtain discrete decisions (…) from continuous inferences in a principled way requires utility functions. Common decision heuristics (e.g., using Bayes factor larger than 10 as a discovery threshold) do not provide a principled way to perform decisions, but are merely heuristic conventions.”

The text is long and at times meandering (at least in the sections I read), while trying a wee bit too hard to bring up the advantages of using Bayes factors versus frequentist or likelihood solutions. (The likelihood ratio being presented as a “frequentist” solution, which I think is an incorrect characterisation.) For instance, the starting point of preferring a model with a higher marginal likelihood is presented as an evidence (oops!) rather than argumented. Since this quantity depends on both the prior and the likelihood, it being high or low is impacted by both. One could then argue that using its numerical value as an absolute criterion amounts to selecting the prior a posteriori as much as checking the fit to the data! The paper also resorts to the Occam’s razor argument, which I wish we could omit, as it is a vague criterion, wide open to misappropriation. It is also qualitative, rather than quantitative, hence does not help in calibrating the Bayes factor.

Concerning the actual computation of the Bayes factor, an issue that has always been a concern and a research topic for me, the authors consider only two “very common methods”, the Savage–Dickey density ratio method and bridge sampling. We discussed the shortcomings of the Savage–Dickey density ratio method with Jean-Michel Marin about ten years ago. And while bridge sampling is an efficient approach when comparing models of the same dimension, I have reservations about this efficiency in other settings. Alternative approaches like importance nested sampling, noise contrasting estimation or SMC samplers are often performing quite efficiently as normalising constant approximations. (Not to mention our version of harmonic mean estimator with HPD support.)

Simulation-based inference is based on the notion that simulated data can be produced from the predictive distributions. Reminding me of ABC model choice to some extent. But I am uncertain this approach can be used to calibrate the decision procedure to select the most appropriate model. We thought about using this approach in our testing by mixture paper and it is favouring the more complex of the two models. This seems also to occur for the example behind Figure 5 in the paper.

Two other points: first, the paper does not consider the important issue with improper priors, which are not rigorously compatible with Bayes factors, as I discussed often in the past. And second, Bayes factors are not truly Bayesian decision procedures, since they remove the prior weights on the models, thus the mention of utility functions therein seems inappropriate unless a genuine utility function can be produced.

## conditional noise contrastive estimation

Posted in Books, pictures, University life with tags Charlie Geyer, conference, ICML 2018, intractable constant, logistic regression, machine learning, noise contrasting estimation, Stockholm, Sweden on August 13, 2019 by xi'an**A**t ICML last year, Ciwan Ceylan and Michael Gutmann presented a new version of noise constrative estimation to deal with intractable constants. While noise contrastive estimation relies upon a second independent sample to contrast with the observed sample, this approach uses instead a perturbed or noisy version of the original sample, for instance a Normal generation centred at the original datapoint. And eliminates the annoying constant by breaking the (original and noisy) samples into two groups. The probability to belong to one group or the other then does not depend on the constant, which is a very effective trick. And can be optimised with respect to the parameters of the model of interest. Recovering the score matching function of Hyvärinen (2005). While this is in line with earlier papers by Gutmann and Hyvärinen, this line of reasoning (starting with Charlie Geyer’s logistic regression) never ceases to amaze me!