distilling importance
As I was about to leave Warwick at the end of last week, I noticed a new arXival by Dennis Prangle, distilling importance sampling. In connection with [our version of] population Monte Carlo, “each step of [Dennis’] distilled importance sampling method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior.” (The introduction of the paper points out various connections with ABC, conditional density estimation, adaptive importance sampling, X entropy, &tc.)
“An advantage of [distilled importance sampling] over [likelihood-free] methods is that it performs inference on the full data, without losing information by using summary statistics.”
A notion used therein I had not heard before is the one of normalising flows, apparently more common in machine learning and in particular with GANs. (The slide below is from Shakir Mohamed and Danilo Rezende.) The notion is to represent an arbitrary variable as the bijective transform of a standard variate like a N(0,1) variable or a U(0,1) variable (calling the inverse cdf transform). The only link I can think of is perfect sampling where the representation of all simulations as a function of a white noise vector helps with coupling.
I read a blog entry by Eric Jang on the topic (who produced this slide among other things) but did not emerge much the wiser. As the text instantaneously moves from the Jacobian formula to TensorFlow code… In Dennis’ paper, it appears that the concept is appealing for quickly producing samples and providing a rich family of approximations, especially when neural networks are included as transforms. They are used to substitute for a tempered version of the posterior target, validated as importance functions and aiming at being the closest to this target in Kullback-Leibler divergence. With the importance function interpretation, unbiased estimators of the gradient [in the parameter of the normalising flow] can be derived, with potential variance reduction. What became clearer to me from reading the illustration section is that the prior x predictive joint can also be modeled this way towards producing reference tables for ABC (or GANs) much faster than with the exact model. (I came across several proposals of that kind in the past months.) However, I deem mileage should vary depending on the size and dimension of the data. I also wonder at the connection between the (final) distribution simulated by distilled importance [the least tempered target?] and the ABC equivalent.
December 13, 2019 at 12:30 pm
also normalising flows : https://arxiv.org/pdf/1912.06073.pdf
November 13, 2019 at 8:23 pm
Hi Xian, thanks for the post on this paper!
Normalising flows are a fun area, producing machine learning generative models with a rigorous probabilistic foundation. Another good blog post on them is:
http://akosiorek.github.io/ml/2018/04/03/norm_flows.html
The target distribution in the paper’s queueing distribution is the ABC posterior for a Gaussian ABC kernel. So the final approximate posterior (in Figure 3) is an ABC approximation, but for a very small value of the bandwidth parameter. The other example (sinusoidal) uses a different tempering scheme, but the the details are less important in the end as the algorithm ends up targeting the exact posterior.
As you note, the method is limited to fairly low dimensional target distributions (and also targets that factorise in a nice way – more on this in the next version!) Hopefully this can be improved to some extent in future by plugging in more advanced normalising flows. There’s also a lot of scope to improve a lot of other aspects of the algorithm. As you mention variance reduction of the gradient estimator would be great, and many other ideas could be adapted from SMC, adaptive importance sampling, the cross-entropy method etc