## parallel tempering on optimised paths

Posted in Statistics with tags , , , , , , , , , , , , , , , on May 20, 2021 by xi'an

Saifuddin Syed, Vittorio Romaniello, Trevor Campbell, and Alexandre Bouchard-Côté, whom I met and discussed with on my “last” trip to UBC, on December 2019, just arXived a paper on parallel tempering (PT), making the choice of tempering path an optimisation problem. They address the touchy issue of designing a sequence of tempered targets when the starting distribution π⁰, eg the prior, and the final distribution π¹, eg the posterior, are hugely different, eg almost singular.

“…theoretical analysis of reversible variants of PT has shown that adding too many intermediate chains can actually deteriorate performance (…) [while] on non reversible regime adding more chains is guaranteed to improve performances.”

The above applies to geometric combinations of π⁰ and π¹. Which “suffers from an arbitrarily suboptimal global communication barrier“, according to the authors (although the counterexample is not completely convincing since π⁰ and π¹ share the same variance). They propose a more non-linear form of tempering with constraints on the dependence of the powers on the temperature t∈(0,1).  Defining the global communication barrier as an average over temperatures of the rejection rate, the path characteristics (e.g., the coefficients of a spline function) can then be optimised in terms of this objective. And the temperature schedule is derived from the fact that the non-asymptotic round trip rate is maximized when the rejection rates are all equal. (As a side item, the technique exposed in the earlier tempering paper by Syed et al. was recently exploited for a night high resolution imaging of a black hole from the M87 galaxy.)

## ABC, anytime!

Posted in Books, pictures, Statistics, Travel, University life with tags , , , on January 18, 2021 by xi'an

Last June, Alix Marie d’Avigneau, Sumeet Singh, and Lawrence Murray arXived a paper on anytime ABC I intended to review right away but that sat till now on my virtual desk (and pile of to-cover-arXivals!). The notion of anytime MCMC was already covered in earlier ‘Og entries, but this anytime ABC version bypasses the problem of asynchronicity, namely, “randomly varying local move completion times when parallel tempering is implemented on a multi-processor computing resource”. The different temperatures are replaced by different tolerances in ABC. Since switches between tolerances are natural if a proposal for a given tolerance ε happens to be eligible for a lower tolerance ε’. And accounting for the different durations required to simulate a proposal under different tolerances to avoid the induced bias in the stationary distributions. Or the wait for other processors to complete their task. A drawback with the approach stands in calibrating the tolerance levels in advance (or via preliminary runs that may prove costly).

## general perspective on the Metropolis–Hastings kernel

Posted in Books, Statistics with tags , , , , , , , , , , , , , on January 14, 2021 by xi'an

[My Bristol friends and co-authors] Christophe Andrieu, and Anthony Lee, along with Sam Livingstone arXived a massive paper on 01 January on the Metropolis-Hastings kernel.

“Our aim is to develop a framework making establishing correctness of complex Markov chain Monte Carlo kernels a purely mechanical or algebraic exercise, while making communication of ideas simpler and unambiguous by allowing a stronger focus on essential features (…) This framework can also be used to validate kernels that do not satisfy detailed balance, i.e. which are not reversible, but a modified version thereof.”

A central notion in this highly general framework is, extending Tierney (1998), to see an MCMC kernel as a triplet involving a probability measure μ (on an extended space), an involution transform φ generalising the proposal step (i.e. þ²=id), and an associated acceptance probability ð. Then μ-reversibility occurs for

$\eth(\xi)\mu(\text{d}\xi)= \eth(\phi(\xi))\mu^{\phi}(\text{d}\xi)$

with the rhs involving the push-forward measure induced by μ and φ. And furthermore there is always a choice of an acceptance probability ð ensuring for this equality to happen. Interestingly, the new framework allows for mostly seamless handling of more complex versions of MCMC such as reversible jump and parallel tempering. But also non-reversible kernels, incl. for instance delayed rejection. And HMC, incl. NUTS. And pseudo-marginal, multiple-try, PDMPs, &c., &c. it is remarkable to see such a general theory emerging a this (late?) stage of the evolution of the field (and I will need more time and attention to understand its consequences).

## QuanTA

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , on September 17, 2018 by xi'an

My Warwick colleagues Nick Tawn [who also is my most frequent accomplice to running, climbing and currying in Warwick!] and Gareth Robert have just arXived a paper on QuanTA, a new parallel tempering algorithm that Nick designed during his thesis at Warwick, which he defended last semester. Parallel tempering targets in parallel several powered (or power-tempered) versions of the target distribution. With proposed switches between adjacent targets. An improved version transforms the local values before operating the switches. Ideally, the transform should be the composition of the cdf and inverse cdf, but this is impossible. Linearising the transform is feasible, but does not agree with multimodality, which calls for local transforms. Which themselves call for the identification of the different modes. In QuanTA, they are identified by N parallel runs of the standard, or rather N/2 to avoid dependence issues, and K-means estimates. The paper covers the construction of an optimal scaling of temperatures, in that the difference between the temperatures is scaled [with order 1/√d] so that the acceptance rate for swaps is 0.234. Which in turns induces a practical if costly calibration of the temperatures, especially when the size of the jump is depending on the current temperature. However, this cost issue is addressed in the paper, resorting to the acceptance rate as a proxy for effective sample size and the acceptance rate over run time to run the comparison with regular parallel tempering, leading to strong improvements in the mixture examples examined in the paper. The use of machine learning techniques like K-means or more involved solutions is a promising thread in this exciting area of tempering, where intuition about high temperatures can be actually misleading. Because using the wrong scale means missing the area of interest, which is not the mode!

## love-hate Metropolis algorithm

Posted in Books, pictures, R, Statistics, Travel with tags , , , , , , , , , on January 28, 2016 by xi'an

Hyungsuk Tak, Xiao-Li Meng and David van Dyk just arXived a paper on a multiple choice proposal in Metropolis-Hastings algorithms towards dealing with multimodal targets. Called “A repulsive-attractive Metropolis algorithm for multimodality” [although I wonder why XXL did not jump at the opportunity to use the “love-hate” denomination!]. The proposal distribution includes a [forced] downward Metropolis-Hastings move that uses the inverse of the target density π as its own target, namely 1/{π(x)+ε}. Followed by a [forced] Metropolis-Hastings upward move which target is {π(x)+ε}. The +ε is just there to avoid handling ratios of zeroes (although I wonder why using the convention 0/0=1 would not work). And chosen as 10⁻³²³ by default in connection with R smallest positive number. Whether or not the “downward” move is truly downwards and the “upward” move is truly upwards obviously depends on the generating distribution: I find it rather surprising that the authors consider the same random walk density in both cases as I would have imagined relying on a more dispersed distribution for the downward move in order to reach more easily other modes. For instance, the downward move could have been based on an anti-Langevin proposal, relying on the gradient to proceed further down…

This special choice of a single proposal however simplifies the acceptance ratio (and keeps the overall proposal symmetric). The final acceptance ratio still requires a ratio of intractable normalising constants that the authors bypass by Møller et al. (2006) auxiliary variable trick. While the authors mention the alternative pseudo-marginal approach of Andrieu and Roberts (2009), they do not try to implement it, although this would be straightforward here: since the normalising constants are the probabilities of accepting a downward and an upward move, respectively. Those can easily be evaluated at a cost similar to the use of the auxiliary variables. That is,

– generate a few moves from the current value and record the proportion p of accepted downward moves;
– generate a few moves from the final proposed value and record the proportion q of accepted downward moves;

and replace the ratio of intractable normalising constants with p/q. It is not even clear that one needs those extra moves since the algorithm requires an acceptance in the downward and upward moves, hence generate Geometric variates associated with those probabilities p and q, variates that can be used for estimating them. From a theoretical perspective, I also wonder if forcing the downward and upward moves truly leads to an improved convergence speed. Considering the case when the random walk is poorly calibrated for either the downward or upward move, the number of failed attempts before an acceptance may get beyond the reasonable.

As XXL and David pointed out to me, the unusual aspect of the approach is that here the proposal density is intractable, rather than the target density itself. This makes using Andrieu and Roberts (2009) seemingly less straightforward. However, as I was reminded this afternoon at the statistics and probability seminar in Bristol, the argument for the pseudo-marginal based on an unbiased estimator is that w Q(w|x) has a marginal in x equal to π(x) when the expectation of w is π(x). In thecurrent problem, the proposal in x can extended into a proposal in (x,w), w P(w|x) whose marginal is the proposal on x.

If we complement the target π(x) with the conditional P(w|x), the acceptance probability would then involve

{π(x’) P(w’|x’) / π(x) P(w|x)} / {w’ P(w’|x’) / w P(w|x)} = {π(x’) / π(x)} {w/w’}

so it seems the pseudo-marginal (or auxiliary variable) argument also extends to the proposal. Here is a short experiment that shows no discrepancy between target and histogram:

nozero=1e-300
#love-hate move
move<-function(x){
bacwa=1;prop1=prop2=rnorm(1,x,2)
while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){
prop1=rnorm(1,x,2);bacwa=bacwa+1}
while (runif(1)>{pi(prop2)+nozero}/{pi(prop1)+nozero})
prop2=rnorm(1,prop1,2)
y=x
if (runif(1)<pi(prop2)*bacwa/pi(x)/fowa){
y=prop2;assign("fowa",bacwa)}
return(y)}
#arbitrary bimodal target
pi<-function(x){.25*dnorm(x)+.75*dnorm(x,mean=5)}
#running the chain
T=1e5
x=5*rnorm(1);luv8=rep(x,T)
fowa=1;prop1=rnorm(1,x,2) #initial estimate
while (runif(1)>{pi(x)+nozero}/{pi(prop1)+nozero}){
fowa=fowa+1;prop1=rnorm(1,x,2)}
for (t in 2:T)
luv8[t]=move(luv8[t-1])