Archive for normalising flow

distortion estimates for approximate Bayesian inference

Posted in pictures, Statistics, University life with tags , , , , , , , , , on July 7, 2020 by xi'an

A few days ago, Hanwen Xing, Geoff Nichols and Jeong Eun Lee arXived a paper with the following title, to be presented at uai2020. Towards assessing the fit of the approximation for the actual posterior, given the available data. This covers of course ABC methods (which seems to be the primary focus of the paper) but also variational inference and synthetic likelihood versions. For a parameter of interest, the difference between exact and approximate marginal posterior distributions is see as a distortion map, D = F o G⁻¹, interpreted as in optimal transport and estimated by normalising flows. Even when the approximate distribution G is poorly estimated since D remains the cdf of G(X) when X is distributed from F. The marginal posterior approximate cdf G can be estimated by ABC or another approximate technique. The distortion function D is itself restricted to be a Beta cdf, with parameters estimated by a neural network (although based on which input is unclear to me, unless the weights in (5) are the neural weights). The assessment is based on the estimated distortion at the dataset, as a significant difference from the identity signal a poor fit for the approximation. Overall, the procedure seems implementable rather easily and while depending on calibrating choices (other than the number of layers in the neural network) a realistic version of the simulation-based diagnostic of Talts et al. (2018).

scalable Metropolis-Hastings, nested Monte Carlo, and normalising flows

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , on June 16, 2020 by xi'an

Over a sunny if quarantined Sunday, I started reading the PhD dissertation of Rob Cornish, Oxford University, as I am the external member of his viva committee. Ending up in a highly pleasant afternoon discussing this thesis over a (remote) viva yesterday. (If bemoaning a lost opportunity to visit Oxford!) The introduction to the viva was most helpful and set the results within the different time and geographical zones of the Ph.D since Rob had to switch from one group of advisors in Engineering to another group in Statistics. Plus an encompassing prospective discussion, expressing pessimism at exact MCMC for complex models and looking forward further advances in probabilistic programming.

Made of three papers, the thesis includes this ICML 2019 [remember the era when there were conferences?!] paper on scalable Metropolis-Hastings, by Rob Cornish, Paul Vanetti, Alexandre Bouchard-Côté, Georges Deligiannidis, and Arnaud Doucet, which I commented last year. Which achieves a remarkable and paradoxical O(1/√n) cost per iteration, provided (global) lower bounds are found on the (local) Metropolis-Hastings acceptance probabilities since they allow for Poisson thinning à la Devroye (1986) and  second order Taylor expansions constructed for all components of the target, with the third order derivatives providing bounds. However, the variability of the acceptance probability gets higher, which induces a longer but still manageable if the concentration of the posterior is in tune with the Bernstein von Mises asymptotics. I had not paid enough attention in my first read at the strong theoretical justification for the method, relying on the convergence of MAP estimates in well- and (some) mis-specified settings. Now, I would have liked to see the paper dealing with a more complex problem that logistic regression.

The second paper in the thesis is an ICML 2018 proceeding by Tom Rainforth, Robert Cornish, Hongseok Yang, Andrew Warrington, and Frank Wood, which considers Monte Carlo problems involving several nested expectations in a non-linear manner, meaning that (a) several levels of Monte Carlo approximations are required, with associated asymptotics, and (b) the resulting overall estimator is biased. This includes common doubly intractable posteriors, obviously, as well as (Bayesian) design and control problems. [And it has nothing to do with nested sampling.] The resolution chosen by the authors is strictly plug-in, in that they replace each level in the nesting with a Monte Carlo substitute and do not attempt to reduce the bias. Which means a wide range of solutions (other than the plug-in one) could have been investigated, including bootstrap maybe. For instance, Bayesian design is presented as an application of the approach, but since it relies on the log-evidence, there exist several versions for estimating (unbiasedly) this log-evidence. Similarly, the Forsythe-von Neumann technique applies to arbitrary transforms of a primary integral. The central discussion dwells on the optimal choice of the volume of simulations at each level, optimal in terms of asymptotic MSE. Or rather asymptotic bound on the MSE. The interesting result being that the outer expectation requires the square of the number of simulations for the other expectations. Which all need converge to infinity. A trick in finding an estimator for a polynomial transform reminded me of the SAME algorithm in that it duplicated the simulations as many times as the highest power of the polynomial. (The ‘Og briefly reported on this paper… four years ago.)

The third and last part of the thesis is a proposal [to appear in ICML 20] on relaxing bijectivity constraints in normalising flows with continuously index flows. (Or CIF. As Rob made a joke about this cleaning brand, let me add (?) to that joke by mentioning that looking at CIF and bijections is less dangerous in a Trump cum COVID era at CIF and injections!) With Anthony Caterini, George Deligiannidis and Arnaud Doucet as co-authors. I am much less familiar with this area and hence a wee bit puzzled at the purpose of removing what I understand to be an appealing side of normalising flows, namely to produce a manageable representation of density functions as a combination of bijective and differentiable functions of a baseline random vector, like a standard Normal vector. The argument made in the paper is that imposing this representation of the density imposes a constraint on the topology of its support since said support is homeomorphic to the support of the baseline random vector. While the supporting theoretical argument is a mathematical theorem that shows the Lipschitz bound on the transform should be infinity in the case the supports are topologically different, these arguments may be overly theoretical when faced with the practical implications of the replacement strategy. I somewhat miss its overall strength given that the whole point seems to be in approximating a density function, based on a finite sample.

ABC webinar, first!

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , on April 13, 2020 by xi'an


The première of the ABC World Seminar last Thursday was most successful! It took place at the scheduled time, with no technical interruption and allowed 130⁺ participants from most of the World [sorry, West Coast friends!] to listen to the first speaker, Dennis Prangle,  presenting normalising flows and distilled importance sampling. And to answer questions. As I had already commented on the earlier version of his paper, I will not reproduce them here. In short, I remain uncertain, albeit not skeptical, about the notions of normalising flows and variational encoders for estimating densities, when perceived as a non-parametric estimator due to the large number of parameters it involves and wonder at the availability of convergence rates. Incidentally, I had forgotten at the remarkable link between KL distance & importance sampling variability. Adding to the to-read list Müller et al. (2018) on neural importance sampling.


distilling importance

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 13, 2019 by xi'an

As I was about to leave Warwick at the end of last week, I noticed a new arXival by Dennis Prangle, distilling importance sampling. In connection with [our version of] population Monte Carlo, “each step of [Dennis’] distilled importance sampling method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior.”  (The introduction of the paper points out various connections with ABC, conditional density estimation, adaptive importance sampling, X entropy, &tc.)

“An advantage of [distilled importance sampling] over [likelihood-free] methods is that it performs inference on the full data, without losing information by using summary statistics.”

A notion used therein I had not heard before is the one of normalising flows, apparently more common in machine learning and in particular with GANs. (The slide below is from Shakir Mohamed and Danilo Rezende.) The  notion is to represent an arbitrary variable as the bijective transform of a standard variate like a N(0,1) variable or a U(0,1) variable (calling the inverse cdf transform). The only link I can think of is perfect sampling where the representation of all simulations as a function of a white noise vector helps with coupling.

I read a blog entry by Eric Jang on the topic (who produced this slide among other things) but did not emerge much the wiser. As the text instantaneously moves from the Jacobian formula to TensorFlow code… In Dennis’ paper, it appears that the concept is appealing for quickly producing samples and providing a rich family of approximations, especially when neural networks are included as transforms. They are used to substitute for a tempered version of the posterior target, validated as importance functions and aiming at being the closest to this target in Kullback-Leibler divergence. With the importance function interpretation, unbiased estimators of the gradient [in the parameter of the normalising flow] can be derived, with potential variance reduction. What became clearer to me from reading the illustration section is that the prior x predictive joint can also be modeled this way towards producing reference tables for ABC (or GANs) much faster than with the exact model. (I came across several proposals of that kind in the past months.) However, I deem mileage should vary depending on the size and dimension of the data. I also wonder at the connection between the (final) distribution simulated by distilled importance [the least tempered target?] and the ABC equivalent.