Archive for maximal coupling

maximal couplings of the Metropolis-Hastings algorithm

Posted in Statistics, University life with tags , , , , , , , , , on November 17, 2020 by xi'an

As a sequel to their JRSS B paper, John O’Leary, Guanyang Wang, and [my friend, co-author and former student!] Pierre E. Jacob have recently posted a follow-up paper on maximal coupling for Metropolis-Hastings algorithms, where maximal is to be understood in terms of the largest possible probability for the coupled chains to be equal, according to the bound set by the coupling inequality. It made me realise that there is a heap of very recent works in this area.

A question that came up when reading the paper with our PhD students is whether or not the coupled chains stay identical after meeting once. When facing two different targets this seems inevitable and indeed Lemma 2 seems to show that no. A strong lemma that does not [need to] state what happens outside the diagonal Δ.

One of the essential tricks is to optimise several kinds of maximal coupling, incl. one for the Bernoullesque choice of moving, as given on p.3.

Algorithm 1 came as a novelty to me as it first seemed (to me!) the two chains may never meet, but this was before I read the small prints of the transition (proposal) kernel being maximally coupled with itself. While Algorithm 2 may be the earliest example of Metropolis-Hastings coupling I have seen, namely in 1999 in Crete, in connection with a talk by Laird Breyer and Gareth Roberts at a workshop of our ESSS network. As explained by the authors, this solution is not always a maximal coupling for the reason that

min(q¹.q²) min(α¹,α²) ≤ min(q¹α¹,q²α²)

(with q for the transition kernel and α for the acceptance probability). Lemma 1 is interesting in that it describes the probability to un-meet (!) as the surface between one of the move densities and the minimum of the two.

The first solution is to couple by plain Accept-Reject with the first chain being the proposed value and if rejected [i.e. not in C] to generate from the remainder or residual of the second target, in a form of completion of acceptance-rejection (accept when above rather than below, i.e. in A or A’). This can be shown to be a maximal coupling. Another coupling using reflection residuals works better but requires some spherical structure in the kernel. A further coupling on the acceptance of the Metropolis-Hastings move seems to bring an extra degree of improvement.

In the introduction, the alternatives about the acceptance probability α(·,·), e.g. Metropolis-Hastings versus Barker, are mentioned but would it make a difference to the preferred maximal coupling when using one or the other?

A further comment is that, in larger dimensions, I mean larger than one!, a Gibbsic form of coupling could be considered. In which case it would certainly decrease the coupling probability but may still speed up the overall convergence by coupling more often. See “maximality is sometimes less important than other properties of a coupling, such as the contraction behavior when a meeting does not occur.” (p.8)

As a final pun, I noted that Vaserstein is not a typo, as Leonid Vaseršteĭn is a Russian-American mathematician, currently at Penn State.

unbiased MCMC with couplings [4pm, 26 Feb., Paris]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on February 24, 2020 by xi'an

On Wednesday, 26 February, Pierre Jacob (Havard U, currently visiting Paris-Dauphine) is giving a seminar on unbiased MCMC methods with couplings at AgroParisTech, bvd Claude Bernard, Paris 5ième, Room 32, at 4pm in the All about that Bayes seminar.

MCMC methods yield estimators that converge to integrals of interest in the limit of the number of iterations. This iterative asymptotic justification is not ideal; first, it stands at odds with current trends in computing hardware, with increasingly parallel architectures; secondly, the choice of “burn-in” or “warm-up” is arduous. This talk will describe recently proposed estimators that are unbiased for the expectations of interest while having a finite computing cost and a finite variance. They can thus be generated independently in parallel and averaged over. The method also provides practical upper bounds on the distance (e.g. total variation) between the marginal distribution of the chain at a finite step and its invariant distribution. The key idea is to generate “faithful” couplings of Markov chains, whereby pairs of chains coalesce after a random number of iterations. This talk will provide an overview of this line of research.

unbiased Hamiltonian Monte Carlo with couplings

Posted in Books, Kids, Statistics, University life with tags , , , , , , on October 25, 2019 by xi'an

In the June issue of Biometrika, which had been sitting for a few weeks on my desk under my teapot!, Jeremy Heng and Pierre Jacob published a paper on unbiased estimators for Hamiltonian Monte Carlo using couplings. (Disclaimer: I was not involved with the review or editing of this paper.) Which extends to HMC environments the earlier paper of Pierre Jacob, John O’Leary and Yves Atchadé, to be discussed soon at the Royal Statistical Society. The fundamentals are the same, namely that an unbiased estimator can be produced from a converging sequence of estimators and that it can be de facto computed if two Markov chains with the same marginal can be coupled. The issue with Hamiltonians is to figure out how to couple their dynamics. In the Gaussian case, it is relatively easy to see that two chains with the same initial momentum meet periodically. In general, there is contraction within a compact set (Lemma 1). The coupling extends to a time discretisation of the Hamiltonian flow by a leap-frog integrator, still using the same momentum. Which roughly amounts in using the same random numbers in both chains. When defining a relaxed meeting (!) where both chains are within δ of one another, the authors rely on a drift condition (8) that reminds me of the early days of MCMC convergence and seem to imply the existence of a small set “where the target distribution [density] is strongly log-concave”. And which makes me wonder if this small set could be used instead to create renewal events that would in turn ensure both stationarity and unbiasedness without the recourse to a second coupled chain. When compared on a Gaussian example with couplings on Metropolis-Hastings and MALA (Fig. 1), the coupled HMC sees hardly any impact of the dimension of the target (in the average coupling time), with a much lower value. However, I wonder at the relevance of the meeting time as an assessment of efficiency. In the sense that the coupling time is not a convergence time but reflects as well on the initial conditions. I acknowledge that this allows for an averaging over  parallel implementations but I remain puzzled by the statement that this leads to “estimators that are consistent in the limit of the number of replicates, rather than in the usual limit of the number of Markov chain iterations”, since a particularly poor initial distribution could on principle lead to a mode of the target being never explored or on the coupling time being ever so rarely too large for the computing abilities at hand.

convergences of MCMC and unbiasedness

Posted in pictures, Statistics, University life with tags , , , , , , , , , on January 16, 2018 by xi'an

During his talk on unbiased MCMC in Dauphine today, Pierre Jacob provided a nice illustration of the convergence modes of MCMC algorithms. With the stationary target achieved after 100 Metropolis iterations, while the mean of the target taking much more iterations to be approximated by the empirical average. Plus a nice connection between coupling time and convergence. Convergence to the target.During Pierre’s talk, some simple questions came to mind, from developing an “impatient user version”, as in perfect sampling, in order  to stop chains that run “forever”,  to optimising parallelisation in order to avoid problems of asynchronicity. While the complexity of coupling increases with dimension and the coupling probability goes down, the average coupling time varies but an unexpected figure is that the expected cost per iteration is of 2 simulations, irrespective of the chosen kernels. Pierre also made a connection with optimal transport coupling and stressed that the maximal coupling was for the proposal and not for the target.

correlation for maximal coupling

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , on January 3, 2018 by xi'an

An interesting (if vaguely formulated) question on X validated: given two Gaussian variates that are maximally coupled, what is the correlation between these variates? The answer depends on the parameters of both Gaussian, with a correlation of one when both Gaussians are identical. Answering the question by simulation (as I could not figure out the analytical formula on Boxing Day…) led me back to Pierre Jacob’s entry on the topic on Statisfaction, where simulating the maximal coupling stems from the decompositions

p(x)=p(x)∧q(x)+{p(x)-p(x)∧q(x)}  and  q(x)=p(x)∧q(x)+{q(x)-p(x)∧q(x)}

and incidentally to the R function image.plot (from the R library fields) for including the side legend.