## distilling importance

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 13, 2019 by xi'an

As I was about to leave Warwick at the end of last week, I noticed a new arXival by Dennis Prangle, distilling importance sampling. In connection with [our version of] population Monte Carlo, “each step of [Dennis’] distilled importance sampling method aims to reduce the Kullback Leibler (KL) divergence from the distilled density to the current tempered posterior.”  (The introduction of the paper points out various connections with ABC, conditional density estimation, adaptive importance sampling, X entropy, &tc.)

“An advantage of [distilled importance sampling] over [likelihood-free] methods is that it performs inference on the full data, without losing information by using summary statistics.”

A notion used therein I had not heard before is the one of normalising flows, apparently more common in machine learning and in particular with GANs. (The slide below is from Shakir Mohamed and Danilo Rezende.) The  notion is to represent an arbitrary variable as the bijective transform of a standard variate like a N(0,1) variable or a U(0,1) variable (calling the inverse cdf transform). The only link I can think of is perfect sampling where the representation of all simulations as a function of a white noise vector helps with coupling.

I read a blog entry by Eric Jang on the topic (who produced this slide among other things) but did not emerge much the wiser. As the text instantaneously moves from the Jacobian formula to TensorFlow code… In Dennis’ paper, it appears that the concept is appealing for quickly producing samples and providing a rich family of approximations, especially when neural networks are included as transforms. They are used to substitute for a tempered version of the posterior target, validated as importance functions and aiming at being the closest to this target in Kullback-Leibler divergence. With the importance function interpretation, unbiased estimators of the gradient [in the parameter of the normalising flow] can be derived, with potential variance reduction. What became clearer to me from reading the illustration section is that the prior x predictive joint can also be modeled this way towards producing reference tables for ABC (or GANs) much faster than with the exact model. (I came across several proposals of that kind in the past months.) However, I deem mileage should vary depending on the size and dimension of the data. I also wonder at the connection between the (final) distribution simulated by distilled importance [the least tempered target?] and the ABC equivalent.

## spacings on a torus

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , on March 22, 2018 by xi'an

While in Brussels last week I noticed an interesting question on X validated that I considered in the train back home and then more over the weekend. This is a question about spacings, namely how long on average does it take to cover an interval of length L when drawing unit intervals at random (with a torus handling of the endpoints)? Which immediately reminded me of Wilfrid Kendall (Warwick) famous gif animation of coupling from the past via leaves covering a square region, from the top (forward) and from the bottom (backward)…

The problem is rather easily expressed in terms of uniform spacings, more specifically on the maximum spacing being less than 1 (or 1/L depending on the parameterisation). Except for the additional constraint at the boundary, which is not independent of the other spacings. Replacing this extra event with an independent spacing, there exists a direct formula for the expected stopping time, which can be checked rather easily by simulation. But the exact case appears to be add a few more steps to the draws, 3/2 apparently. The following graph displays the regression of the Monte Carlo number of steps over 10⁴ replicas against the exact values:

## convergences of MCMC and unbiasedness

Posted in pictures, Statistics, University life with tags , , , , , , , , , on January 16, 2018 by xi'an

During his talk on unbiased MCMC in Dauphine today, Pierre Jacob provided a nice illustration of the convergence modes of MCMC algorithms. With the stationary target achieved after 100 Metropolis iterations, while the mean of the target taking much more iterations to be approximated by the empirical average. Plus a nice connection between coupling time and convergence. Convergence to the target.During Pierre’s talk, some simple questions came to mind, from developing an “impatient user version”, as in perfect sampling, in order  to stop chains that run “forever”,  to optimising parallelisation in order to avoid problems of asynchronicity. While the complexity of coupling increases with dimension and the coupling probability goes down, the average coupling time varies but an unexpected figure is that the expected cost per iteration is of 2 simulations, irrespective of the chosen kernels. Pierre also made a connection with optimal transport coupling and stressed that the maximal coupling was for the proposal and not for the target.

## unbiased MCMC

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on August 25, 2017 by xi'an

Two weeks ago, Pierre Jacob, John O’Leary, and Yves F. Atchadé arXived a paper on unbiased MCMC with coupling. Associating MCMC with unbiasedness is rather challenging since MCMC are rarely producing simulations from the exact target, unless specific tools like renewal can be produced in an efficient manner. (I supported the use of such renewal techniques as early as 1995, but later experiments led me to think renewal control was too rare an occurrence to consider it as a generic convergence assessment method.)

This new paper makes me think I had given up too easily! Here the central idea is coupling of two (MCMC) chains, associated with the debiasing formula used by Glynn and Rhee (2014) and already discussed here. Having the coupled chains meet at some time with probability one implies that the debiasing formula does not need a (random) stopping time. The coupling time is sufficient. Furthermore, several estimators can be derived from the same coupled Markov chain simulations, obtained by starting the averaging at a later time than the first iteration. The average of these (unbiased) averages results into a weighted estimate that weights more the later differences. Although coupling is also at the basis of perfect simulation methods, the analogy between this debiasing technique and perfect sampling is hard to fathom, since the coupling of two chains is not a perfect sampling instant. (Something obvious only in retrospect for me is that the variance of the resulting unbiased estimator is at best the variance of the original MCMC estimator.)

When discussing the implementation of coupling in Metropolis and Gibbs settings, the authors give a simple optimal coupling algorithm I was not aware of. Which is a form of accept-reject also found in perfect sampling I believe. (Renewal based on small sets makes an appearance on page 11.) I did not fully understood the way two random walk Metropolis steps are coupled, in that the normal proposals seem at odds with the boundedness constraints. But coupling is clearly working in this setting, while renewal does not. In toy examples like the (Efron and Morris!) baseball data and the (Gelfand and Smith!) pump failure data, the parameters k and m of the algorithm can be optimised against the variance of the averaged averages. And this approach comes highly useful in the case of the cut distribution,  a problem which I became aware of during MCMskiii and on which we are currently working with Pierre and others.

Posted in Books, Statistics, University life with tags , , , , , , , , , , on October 27, 2016 by xi'an

In the March 2016 issue of JASA that currently sits on my desk, there is a paper by Liang, Jim, Song and Liu on the adaptive exchange algorithm, which aims at handling posteriors for sampling distributions with intractable normalising constants. The concept behind the algorithm is the exchange principle initiated by Jesper Møller and co-authors in 2006, where an auxiliary pseudo-observation is simulated for the missing constants to vanish in a Metropolis-Hastings ratio. (The name exchangeable was introduced in a subsequent paper by Iain Murray, Zoubin Ghahramani and David MacKay, also in 2006.)

The crux of the method is to run an iteration as [where y denotes the observation]

1. Proposing a new value θ’ of the parameter from a proposal q(θ’|θ);
2. Generate a pseudo-observation z~ƒ(z|θ’);
3. Accept with probability

$\dfrac{\pi(\theta')f(y|\theta')}{\pi(\theta)f(y|\theta)}\dfrac{q(\theta|\theta')f(z|\theta)}{q(\theta'|\theta)f(z|\theta')}$

which has the appeal to cancel all normalising constants. And the repeal of requiring an exact simulation from the very distribution with the missing constant, ƒ(.|θ). Which means that in practice a finite number of MCMC steps will be used and will bias the outcome. The algorithm is unusual in that it replaces the exact proposal q(θ’|θ) with an unbiased random version q(θ’|θ)ƒ(z|θ’), z being just an augmentation of the proposal. (The current JASA paper by Liang et al. seems to confuse augment and argument, see p.378.)

To avoid the difficulty in simulating from ƒ(.|θ), the authors draw pseudo-observations from sampling distributions with a finite number m of parameter values under the [unrealistic] assumption (A⁰) that this collection of values provides an almost complete cover of the posterior support. One of the tricks stands with an auxiliary [time-heterogeneous] chain of pseudo-observations generated by single Metropolis steps from one of these m fixed targets. These pseudo-observations are then used in the main (or target) chain to define the above exchange probability. The auxiliary chain is Markov but time-heterogeneous since the probabilities of accepting a move are evolving with time according to a simulated annealing schedule. Which produces a convergent estimate of the m normalising constants. The main chain is not Markov in that it depends on the whole history of the auxiliary chain [see Step 5, p.380]. Even jointly the collection of both chains is not Markov. The paper prefers to consider the process as an adaptive Markov chain. I did not check the rather intricate in details, so cannot judge of the validity of the overall algorithm; I simply note that one condition (A², p.383) is incredibly strong in that it assumes the Markov transition kernel to be Doeblin uniformly on any compact set of the calibration parameters. However, the major difficulty with this approach seems to be in its delicate calibration. From providing a reference set of m parameter values scanning the posterior support to picking transition kernels on both the parameter and the sample spaces, to properly cooling the annealing schedule [always a fun part!], there seems to be [from my armchair expert’s perspective, of course!] a wide range of opportunities for missing the target or running into zero acceptance problems. Both examples analysed in the paper, the auto-logistic and the auto-normal models, are actually of limited complexity in that they depend on a few parameters, 2 and 4 resp., and enjoy sufficient statistics, of dimensions 2 and 4 as well. Hence simulating (pseudo-)realisations of those sufficient statistics should be less challenging than the original approach replicating an entire vector of thousands of dimensions.

## retrospective Monte Carlo

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on July 12, 2016 by xi'an

The past week I spent in Warwick ended up with a workshop on retrospective Monte Carlo, which covered exact sampling, debiasing, Bernoulli factory problems and multi-level Monte Carlo, a definitely exciting package! (Not to mention opportunities to go climbing with some participants.) In particular, several talks focussed on the debiasing technique of Rhee and Glynn (2012) [inspired from von Neumann and Ulam, and already discussed in several posts here]. Including results in functional spaces, as demonstrated by a multifaceted talk by Sergios Agapiou who merged debiasing, deburning, and perfect sampling.

From a general perspective on unbiasing, while there exist sufficient conditions to ensure finite variance and aim at an optimal version, I feel a broader perspective should be adopted towards comparing those estimators with biased versions that take less time to compute. In a diffusion context, Chang-han Rhee presented a detailed argument as to why his debiasing solution achieves a O(√n) convergence rate in opposition the regular discretised diffusion, but multi-level Monte Carlo also achieves this convergence speed. We had a nice discussion about this point at the break, with my slow understanding that continuous time processes had much much stronger reasons for sticking to unbiasedness. At the poster session, I had the nice surprise of reading a poster on the penalty method I discussed the same morning! Used for subsampling when scaling MCMC.

On the second day, Gareth Roberts talked about the Zig-Zag algorithm (which reminded me of the cigarette paper brand). This method has connections with slice sampling but it is a continuous time method which, in dimension one, means running a constant velocity particle that starts at a uniform value between 0 and the maximum density value and proceeds horizontally until it hits the boundary, at which time it moves to another uniform. Roughly. More specifically, this approach uses piecewise deterministic Markov processes, with a radically new approach to simulating complex targets based on continuous time simulation. With computing times that [counter-intuitively] do not increase with the sample size.

Mark Huber gave another exciting talk around the Bernoulli factory problem, connecting with perfect simulation and demonstrating this is not solely a formal Monte Carlo problem! Some earlier posts here have discussed papers on that problem, but I was unaware of the results bounding [from below] the expected number of steps to simulate B(f(p)) from a (p,1-p) coin. If not of the open questions surrounding B(2p). The talk was also great in that it centred on recursion and included a fundamental theorem of perfect sampling! Not that surprising given Mark’s recent book on the topic, but exhilarating nonetheless!!!

The final talk of the second day was given by Peter Glynn, with connections with Chang-han Rhee’s talk the previous day, but with a different twist. In particular, Peter showed out to achieve perfect or exact estimation rather than perfect or exact simulation by a fabulous trick: perfect sampling is better understood through the construction of random functions φ¹, φ², … such that X²=φ¹(X¹), X³=φ²(X²), … Hence,

$X^t=\varphi^{t-1}\circ\varphi^{t-2}\circ\ldots\circ\varphi^{1}(X^1)$

which helps in constructing coupling strategies. However, since the φ’s are usually iid, the above is generally distributed like

$Y^t=\varphi^{1}\circ\varphi^{2}\circ\ldots\circ\varphi^{t-1}(X^1)$

which seems pretty similar but offers a much better concentration as t grows. Cutting the function composition is then feasible towards producing unbiased estimators and more efficient. (I realise this is not a particularly clear explanation of the idea, detailed in an arXival I somewhat missed. When seen this way, Y would seem much more expensive to compute [than X].)

## perfect sampling, just perfect!

Posted in Books, Statistics, University life with tags , , , , , , , , on January 19, 2016 by xi'an

Great news! Mark Huber (whom I’ve know for many years, so this review may be not completely objective!) has just written a book on perfect simulation! I remember (and still share) the excitement of the MCMC community when the first perfect simulation papers of Propp and Wilson (1995) came up on the (now deceased) MCMC preprint server, as it seemed then the ideal (perfect!) answer to critics of the MCMC methodology, plugging MCMC algorithms into a generic algorithm that eliminating burnin, warmup, and convergence issues… It seemed both magical, with the simplest argument: “start at T=-∞ to reach stationarity at T=0”, and esoteric (“why forward fails while backward works?!”), requiring simple random walk examples (and a java app by Jeff Rosenthal) to understand the difference (between backward and forward), as well as Wilfrid Kendall’s kids’ coloured wood cubes and his layer of leaves falling on the ground and seen from below… These were exciting years, with MCMC still in its infancy, and no goal seemed too far away! Now that years have gone, and that the excitement has clearly died away, perfect sampling can be considered in a more sedate manner, with pros and cons well-understood. This is why Mark Huber’s book is coming at a perfect time if any! It covers the evolution of the perfect sampling techniques, from the early coupling from the past to the monotonous versions, to the coalescence principles, with applications to spatial processes, to the variations on nested sampling and their use in doubly intractable distributions, with forays into the (fabulous) Bernoulli factory problem (a surprise for me, as Bernoulli factories are connected with unbiasedness, not stationarity! Even though my only fieldwork [with Randal Douc] in such factories was addressing a way to turn MCMC into importance sampling. The key is in the notion of approximate densities, introduced in Section 2.6.). The book is quite thorough with the probabilistic foundations of the different principles, with even “a [tiny weeny] little bit of measure theory.

Any imperfection?! Rather, only a (short too short!) reflection on the limitations of perfect sampling, namely that it cannot cover the simulation of posterior distributions in the Bayesian processing of most statistical models. Which makes the quote

“Distributions where the label of a node only depends on immediate neighbors, and where there is a chance of being able to ignore the neighbors are the most easily handled by perfect simulation protocols (…) Statistical models in particular tend to fall into this category, as they often do not wish to restrict the outcome too severely, instead giving the data a chance to show where the model is incomplete or incorrect.” (p.223)

just surprising, given the very small percentage of statistical models which can be handled by perfect sampling. And the downsizing of perfect sampling related papers in the early 2000’s. Which also makes the final and short section on the future of perfect sampling somewhat restricted in its scope.

So, great indeed!, a close to perfect entry to a decade of work on perfect sampling. If you have not heard of the concept before, consider yourself lucky to be offered such a gentle guidance into it. If you have dabbled with perfect sampling before, reading the book will be like meeting old friends and hearing about their latest deeds. More formally, Mark Huber’s book should bring you a new perspective on the topic. (As for me, I had never thought of connecting perfect sampling with accept reject algorithms.)