## flow contrastive estimation

**O**n the flight back from Montpellier, last week, I read a 2019 paper by Gao et al. revisiting the MLE estimation of a parametric family parameter when the normalising constant Z=Z(θ) is unknown. Via noise-contrastive estimation à la Guttman & Hyvärinnen (or à la Charlie Geyer). Treating the normalising constant Z as an extra parameter (as in Kong et al.) and the classification probability as an objective function and calling it a likelihood, which it is not in my opinion as (i) the allocation to the groups is not random and (ii) the original density of the actual observations does not appear in the so-called likelihood.

*“When q appears on the right of KL-divergence* [against *p*],* it is forced to cover most of the modes of p, When q appears on the left of KL-divergence, it tends to chase the major modes of p while ignoring the minor modes.”*

The flow in the title indicates that the contrastive distribution *q* is estimated by a flow-based estimator, namely the transform of a basic noise distribution via easily invertible and differentiable transforms, for instance with lower triangular Jacobians. This flow is also estimated directly from the data but the authors complain this estimation is not good enough for noise contrastive estimation and suggest instead resorting to a GAN version where the classification log-probability is maximised in the model parameters and minimsed in the flow parameters. Except that I feel it misses the true likelihood part. In other words, why on Hyperion would estimating all θ, Z=Z(θ), and α at once improve the estimation of Z?

The other aspect that puzzles me is that (12) uses integrated classification probabilities (with the unknown Z as extra parameter), rather than conditioning on the data, Bayes-like. (The difference between (12) and GAN is that here the discriminator function is constrained.) Esp. when the first expectation is replaced with its empirical version.

March 16, 2021 at 11:11 am

There is a very nice and under appreciated 2018 EJS paper of Lionel Riou-Durand and Nicolas Chopin on the asymptotic properties of NCE estimates. The asymptotic variance of the estimate is directly related to the ratio q_alpha/(p_theta+q_alpha). I interpret the proposed flow method as trying to make this ratio constant and we expect that it will indeed typically decrease the variance.

March 16, 2021 at 2:22 pm

Quite a nice coincidence : I am sitting next to Nicolas’ office for a first time visit over 14 months!

March 15, 2021 at 11:07 am

Thanks for the notes on this — plenty to think about here. In the above you mention some older work by Geyer, is there a reference you can give for that? (probably it’s famous work — sorry for my ignorance!)

March 15, 2021 at 2:30 pm

Thanks David, this is THE famous 1991-1994 unpublished technical report by Charlie Geyer, indeed!!! Andrew Gelman includes it in his list of the greatest works of statistics never published . It is available here.

March 18, 2021 at 10:11 am

Great — thanks, xian. Useful not only to me to be reminded of that nice work by Geyer, but also to future students who might want to find it (and hopefully to read it!)