## distilling importance

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 13, 2019 by xi'an

As I was about to leave Warwick at the end of last week, I noticed a new arXival by Dennis Prangle, distilling importance sampling. In connection with [our version of] population Monte Carlo, “each step of [Dennis’] distilled importance sampling method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior.”  (The introduction of the paper points out various connections with ABC, conditional density estimation, adaptive importance sampling, X entropy, &tc.)

“An advantage of [distilled importance sampling] over [likelihood-free] methods is that it performs inference on the full data, without losing information by using summary statistics.”

A notion used therein I had not heard before is the one of normalising flows, apparently more common in machine learning and in particular with GANs. (The slide below is from Shakir Mohamed and Danilo Rezende.) The  notion is to represent an arbitrary variable as the bijective transform of a standard variate like a N(0,1) variable or a U(0,1) variable (calling the inverse cdf transform). The only link I can think of is perfect sampling where the representation of all simulations as a function of a white noise vector helps with coupling.

I read a blog entry by Eric Jang on the topic (who produced this slide among other things) but did not emerge much the wiser. As the text instantaneously moves from the Jacobian formula to TensorFlow code… In Dennis’ paper, it appears that the concept is appealing for quickly producing samples and providing a rich family of approximations, especially when neural networks are included as transforms. They are used to substitute for a tempered version of the posterior target, validated as importance functions and aiming at being the closest to this target in Kullback-Leibler divergence. With the importance function interpretation, unbiased estimators of the gradient [in the parameter of the normalising flow] can be derived, with potential variance reduction. What became clearer to me from reading the illustration section is that the prior x predictive joint can also be modeled this way towards producing reference tables for ABC (or GANs) much faster than with the exact model. (I came across several proposals of that kind in the past months.) However, I deem mileage should vary depending on the size and dimension of the data. I also wonder at the connection between the (final) distribution simulated by distilled importance [the least tempered target?] and the ABC equivalent.

## spacings on a torus

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on March 22, 2018 by xi'an

While in Brussels last week I noticed an interesting question on X validated that I considered in the train back home and then more over the weekend. This is a question about spacings, namely how long on average does it take to cover an interval of length L when drawing unit intervals at random (with a torus handling of the endpoints)? Which immediately reminded me of Wilfrid Kendall (Warwick) famous gif animation of coupling from the past via leaves covering a square region, from the top (forward) and from the bottom (backward)…

The problem is rather easily expressed in terms of uniform spacings, more specifically on the maximum spacing being less than 1 (or 1/L depending on the parameterisation). Except for the additional constraint at the boundary, which is not independent of the other spacings. Replacing this extra event with an independent spacing, there exists a direct formula for the expected stopping time, which can be checked rather easily by simulation. But the exact case appears to be add a few more steps to the draws, 3/2 apparently. The following graph displays the regression of the Monte Carlo number of steps over 10⁴ replicas against the exact values:

## convergences of MCMC and unbiasedness

Posted in pictures, Statistics, University life with tags , , , , , , , , , on January 16, 2018 by xi'an

During his talk on unbiased MCMC in Dauphine today, Pierre Jacob provided a nice illustration of the convergence modes of MCMC algorithms. With the stationary target achieved after 100 Metropolis iterations, while the mean of the target taking much more iterations to be approximated by the empirical average. Plus a nice connection between coupling time and convergence. Convergence to the target.During Pierre’s talk, some simple questions came to mind, from developing an “impatient user version”, as in perfect sampling, in order  to stop chains that run “forever”,  to optimising parallelisation in order to avoid problems of asynchronicity. While the complexity of coupling increases with dimension and the coupling probability goes down, the average coupling time varies but an unexpected figure is that the expected cost per iteration is of 2 simulations, irrespective of the chosen kernels. Pierre also made a connection with optimal transport coupling and stressed that the maximal coupling was for the proposal and not for the target.

## unbiased MCMC

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on August 25, 2017 by xi'an

Two weeks ago, Pierre Jacob, John O’Leary, and Yves F. Atchadé arXived a paper on unbiased MCMC with coupling. Associating MCMC with unbiasedness is rather challenging since MCMC are rarely producing simulations from the exact target, unless specific tools like renewal can be produced in an efficient manner. (I supported the use of such renewal techniques as early as 1995, but later experiments led me to think renewal control was too rare an occurrence to consider it as a generic convergence assessment method.)

This new paper makes me think I had given up too easily! Here the central idea is coupling of two (MCMC) chains, associated with the debiasing formula used by Glynn and Rhee (2014) and already discussed here. Having the coupled chains meet at some time with probability one implies that the debiasing formula does not need a (random) stopping time. The coupling time is sufficient. Furthermore, several estimators can be derived from the same coupled Markov chain simulations, obtained by starting the averaging at a later time than the first iteration. The average of these (unbiased) averages results into a weighted estimate that weights more the later differences. Although coupling is also at the basis of perfect simulation methods, the analogy between this debiasing technique and perfect sampling is hard to fathom, since the coupling of two chains is not a perfect sampling instant. (Something obvious only in retrospect for me is that the variance of the resulting unbiased estimator is at best the variance of the original MCMC estimator.)

When discussing the implementation of coupling in Metropolis and Gibbs settings, the authors give a simple optimal coupling algorithm I was not aware of. Which is a form of accept-reject also found in perfect sampling I believe. (Renewal based on small sets makes an appearance on page 11.) I did not fully understood the way two random walk Metropolis steps are coupled, in that the normal proposals seem at odds with the boundedness constraints. But coupling is clearly working in this setting, while renewal does not. In toy examples like the (Efron and Morris!) baseball data and the (Gelfand and Smith!) pump failure data, the parameters k and m of the algorithm can be optimised against the variance of the averaged averages. And this approach comes highly useful in the case of the cut distribution,  a problem which I became aware of during MCMskiii and on which we are currently working with Pierre and others.

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

The crux of the method is to run an iteration as [where y denotes the observation]

1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
2. Generate a pseudo-observation z~ƒ(z|θ’);
3. Accept with probability

$\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}$

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.