## maximal couplings of the Metropolis-Hastings algorithm

Posted in Statistics, University life with tags , , , , , , , , , on November 17, 2020 by xi'an

As a sequel to their JRSS B paper, John O’Leary, Guanyang Wang, and [my friend, co-author and former student!] Pierre E. Jacob have recently posted a follow-up paper on maximal coupling for Metropolis-Hastings algorithms, where maximal is to be understood in terms of the largest possible probability for the coupled chains to be equal, according to the bound set by the coupling inequality. It made me realise that there is a heap of very recent works in this area.

A question that came up when reading the paper with our PhD students is whether or not the coupled chains stay identical after meeting once. When facing two different targets this seems inevitable and indeed Lemma 2 seems to show that no. A strong lemma that does not [need to] state what happens outside the diagonal Δ.

One of the essential tricks is to optimise several kinds of maximal coupling, incl. one for the Bernoullesque choice of moving, as given on p.3.

Algorithm 1 came as a novelty to me as it first seemed (to me!) the two chains may never meet, but this was before I read the small prints of the transition (proposal) kernel being maximally coupled with itself. While Algorithm 2 may be the earliest example of Metropolis-Hastings coupling I have seen, namely in 1999 in Crete, in connection with a talk by Laird Breyer and Gareth Roberts at a workshop of our ESSS network. As explained by the authors, this solution is not always a maximal coupling for the reason that

min(q¹.q²) min(α¹,α²) ≤ min(q¹α¹,q²α²)

(with q for the transition kernel and α for the acceptance probability). Lemma 1 is interesting in that it describes the probability to un-meet (!) as the surface between one of the move densities and the minimum of the two.

The first solution is to couple by plain Accept-Reject with the first chain being the proposed value and if rejected [i.e. not in C] to generate from the remainder or residual of the second target, in a form of completion of acceptance-rejection (accept when above rather than below, i.e. in A or A’). This can be shown to be a maximal coupling. Another coupling using reflection residuals works better but requires some spherical structure in the kernel. A further coupling on the acceptance of the Metropolis-Hastings move seems to bring an extra degree of improvement.

In the introduction, the alternatives about the acceptance probability α(·,·), e.g. Metropolis-Hastings versus Barker, are mentioned but would it make a difference to the preferred maximal coupling when using one or the other?

A further comment is that, in larger dimensions, I mean larger than one!, a Gibbsic form of coupling could be considered. In which case it would certainly decrease the coupling probability but may still speed up the overall convergence by coupling more often. See “maximality is sometimes less important than other properties of a coupling, such as the contraction behavior when a meeting does not occur.” (p.8)

As a final pun, I noted that Vaserstein is not a typo, as Leonid Vaseršteĭn is a Russian-American mathematician, currently at Penn State.

## Hastings 50 years later

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , on January 9, 2020 by xi'an

What is the exact impact of the Metropolis-Hastings algorithm on the field of Bayesian statistics? and what are the new tools of the trade? What I personally find the most relevant and attractive element in a review on the topic is the current role of this algorithm, rather than its past (his)story, since many such reviews have already appeared and will likely continue to appear. What matters most imho is how much the Metropolis-Hastings algorithm signifies for the community at large, especially beyond academia. Is the availability or unavailability of software like BUGS or Stan a help or an hindrance? Was Hastings’ paper the start of the era of approximate inference or the end of exact inference? Are the algorithm intrinsic features like Markovianity a fundamental cause for an eventual extinction because of the ensuing time constraint and the lack of practical guarantees of convergence and the illusion of a fully automated version? Or are emerging solutions like unbiased MCMC and asynchronous algorithms a beacon of hope?

In their Biometrika paper, Dunson and Johndrow (2019) recently wrote a celebration of Hastings’ 1970 paper in Biometrika, where they cover adaptive Metropolis (Haario et al., 1999; Roberts and Rosenthal, 2005), the importance of gradient based versions toward universal algorithms (Roberts and Tweedie, 1995; Neal, 2003), discussing the advantages of HMC over Langevin versions. They also recall the significant step represented by Peter Green’s (1995) reversible jump algorithm for multimodal and multidimensional targets, as well as tempering (Miasojedow et al., 2013; Woodard et al., 2009). They further cover intractable likelihood cases within MCMC (rather than ABC), with the use of auxiliary variables (Friel and Pettitt, 2008; Møller et al., 2006) and pseudo-marginal MCMC (Andrieu and Roberts, 2009; Andrieu and Vihola, 2016). They naturally insist upon the need to handle huge datasets, high-dimension parameter spaces, and other scalability issues, with links to unadjusted Langevin schemes (Bardenet et al., 2014; Durmus and Moulines, 2017; Welling and Teh, 2011). Similarly, Dunson and Johndrow (2019) discuss recent developments towards parallel MCMC and non-reversible schemes such as PDMP as highly promising, with a concluding section on the challenges of automatising and robustifying much further the said procedures, if only to reach a wider range of applications. The paper is well-written and contains a wealth of directions and reflections, including those in my above introduction. Here are some mostly disconnected directions I would have liked to see covered or more covered

1. convergence assessment today, e.g. the comparison of various approximation schemes
2. Rao-Blackwellisation and other post-processing improvements
3. other approximate inference tools than the pseudo-marginal MCMC
4. importance of the parameterisation of the problem for convergence
5. dimension issues and connection with quasi-Monte Carlo
6. constrained spaces of measure zero, as for instance matrix distributions imposing zeros outside a diagonal band
7. given the rise of the machine(-learners), are exploratory and intrinsically slow algorithms like MCMC doomed or can both fields feed one another? The section on optimisation could be expanded in that direction
8. the wasteful nature of the random walk feature of MCMC algorithms, as opposed to non-reversible kernels like HMC and other PDMPs, missing from the gradient based methods section (and can we once again learn from physicists?)
9. finer convergence issues and hence inference difficulties with complex MCMC algorithms like Gibbs samplers with incompatible conditionals
10. use of the Hastings ratio in other algorithms like ABC or EP (in link with the section on generalised Bayes)
11. adapting Metropolis-Hastings methods for emerging computing tools like GPUs and quantum computers

or possibly less covered, namely data augmentation put forward when it is a special case of auxiliary variables as in slice sampling and in earlier physics literature. For instance, both probit and logistic regressions do not truly require data augmentation and are more toy examples than really challenging applications. The approach of Carlin & Chib (1995) is another illustration, which has met with recent interest, despite requiring heavy calibration (just like RJMCMC). As well as a a somewhat awkward opposition between Gibbs and Hastings, in that I am not convinced that Gibbs does not remain ultimately necessary to handle high dimension problems, in the sense that the alternative solutions like Langevin, HMC, or PDMP, or…, are relying on Euclidean assumptions for the entire vector, while a direct product of Euclidean structures may prove more adequate.

## common derivation for Metropolis–Hastings and other MCMC algorithms

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on July 25, 2016 by xi'an

Khoa Tran and Robert Kohn from UNSW just arXived a paper on a comprehensive derivation of a large range of MCMC algorithms, beyond Metropolis-Hastings. The idea is to decompose the MCMC move into

1. a random completion of the current value θ into V;
2. a deterministic move T from (θ,V) to (ξ,W), where only ξ matters.

If this sounds like a new version of Peter Green’s completion at the core of his 1995 RJMCMC algorithm, it is because it is indeed essentially the same notion. The resort to this completion allows for a standard form of the Metropolis-Hastings algorithm, which leads to the correct stationary distribution if T is self-inverse. This representation covers Metropolis-Hastings algorithms, Gibbs sampling, Metropolis-within-Gibbs and auxiliary variables methods, slice sampling, recursive proposals, directional sampling, Langevin and Hamiltonian Monte Carlo, NUTS sampling, pseudo-marginal Metropolis-Hastings algorithms, and pseudo-marginal Hamiltonian  Monte Carlo, as discussed by the authors. Given this representation of the Markov chain through a random transform, I wonder if Peter Glynn’s trick mentioned in the previous post on retrospective Monte Carlo applies in this generic setting (as it could considerably improve convergence…)

## ergodicity of approximate MCMC chains with applications to large datasets

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , on August 31, 2015 by xi'an

Another arXived paper I read on my way to Warwick! And yet another paper written by my friend Natesh Pillai (and his co-author Aaron Smith, from Ottawa). The goal of the paper is to study the ergodicity and the degree of approximation of the true posterior distribution of approximate MCMC algorithms that recently flourished as an answer to “Big Data” issues… [Comments below are about the second version of this paper.] One of the most curious results in the paper is the fact that the approximation may prove better than the original kernel, in terms of computing costs! If asymptotically in the computing cost. There also are acknowledged connections with the approximative MCMC kernel of Pierre Alquier, Neal Friel, Richard Everitt and A Boland, briefly mentioned in an earlier post.

The paper starts with a fairly theoretical part, to follow with an application to austerity sampling [and, in the earlier version of the paper, to the Hoeffding bounds of Bardenet et al., both discussed earlier on the ‘Og, to exponential random graphs (the paper being rather terse on the description of the subsampling mechanism), to stochastic gradient Langevin dynamics (by Max Welling and Yee-Whye Teh), and to ABC-MCMC]. The assumptions are about the transition kernels of a reference Markov kernel and of one associated with the approximation, imposing some bounds on the Wasserstein distance between those kernels, K and K’. Results being generic, there is no constraint as to how K is chosen or on how K’ is derived from K. Except in Lemma 3.6 and in the application section, where the same proposal kernel L is used for both Metropolis-Hastings algorithms K and K’. While I understand this makes for an easier coupling of the kernels, this also sounds like a restriction to me in that modifying the target begs for a similar modification in the proposal, if only because the tails they are a-changin’

In the case of subsampling the likelihood to gain computation time (as discussed by Korattikara et al. and by Bardenet et al.), the austerity algorithm as described in Algorithm 2 is surprising as the average of the sampled data log-densities and the log-transform of the remainder of the Metropolis-Hastings probability, which seem unrelated, are compared until they are close enough.  I also find hard to derive from the different approximation theorems bounding exceedance probabilities a rule to decide on the subsampling rate as a function of the overall sample size and of the computing cost. (As a side if general remark, I remain somewhat reserved about the subsampling idea, given that it requires the entire dataset to be available at every iteration. This makes parallel implementations rather difficult to contemplate.)

## an extension of nested sampling

Posted in Books, Statistics, University life with tags , , , , , , , on December 16, 2014 by xi'an

I was reading [in the Paris métro] Hastings-Metropolis algorithm on Markov chains for small-probability estimation, arXived a few weeks ago by François Bachoc, Lionel Lenôtre, and Achref Bachouch, when I came upon their first algorithm that reminded me much of nested sampling: the following was proposed by Guyader et al. in 2011,

To approximate a tail probability P(H(X)>h),

• start from an iid sample of size N from the reference distribution;
• at each iteration m, select the point x with the smallest H(x)=ξ and replace it with a new point y simulated under the constraint H(y)≥ξ;
• stop when all points in the sample are such that H(X)>h;
• take

$\left(1-\dfrac{1}{N}\right)^{m-1}$

as the unbiased estimator of P(H(X)>h).

Hence, except for the stopping rule, this is the same implementation as nested sampling. Furthermore, Guyader et al. (2011) also take advantage of the bested sampling fact that, if direct simulation under the constraint H(y)≥ξ is infeasible, simulating via one single step of a Metropolis-Hastings algorithm is as valid as direct simulation. (I could not access the paper, but the reference list of Guyader et al. (2011) includes both original papers by John Skilling, so the connection must be made in the paper.) What I find most interesting in this algorithm is that it even achieves unbiasedness (even in the MCMC case!).