## almost reversed 2-lag Markov chain

Posted in Kids, R, Statistics with tags , , , , on July 7, 2021 by xi'an

Another simple riddle from the Riddler: take a binary sequence and associate to this sequence a score vector made of the numbers of consecutive ones from each position. If the sequence is ten step long and there are 3 ones located at random, what is the expected total score? (The original story is much more complex and involves as often strange sports!)

Adding two zeroes at time 11 and 12, this is quite simple to code, e.g.

```f=0*(1:10) #frequencies
for(v in 1:1e6){
r=0*f#reward
s=sample(1:10,3)
for(t in s)r[t]=1+((t+1)%in%s)*(1+((t+2)%in%s))
f[sum(r)]=f[sum(r)]+1}
f=f/1e6
```

and the outcome recovers the feature that the only possible scores are 1+1+1=3 (all ones separated), 1+1+2=4 (two ones contiguous),  and 1+2+3=6 (all ones contiguous). With respective frequencies 56/120, 56/120, and 8/120. With 120 being the number of possible locations of the 3 ones.

## non-reversible guided Metropolis–Hastings

Posted in Mountains, pictures, Statistics, Travel with tags , , , , , , , , , , , , on June 4, 2020 by xi'an Kengo Kamatani and Xiaolin Song, whom I visited in Osaka last summer in what seems like another reality!, just arXived another paper on a non-reversible Metropolis version. That exploits a group action and the associated Haar measure.

Following a proposal of Gustafson (1998), a ∆-guided Metropolis–Hastings kernel is based on a statistic ∆ that is totally ordered and determine the acceptance of a proposed value y~Q(x,.) by adding a direction (-,+) to the state space and moving from x if ∆x≤∆y in the positive direction and if ∆y≤∆x in the negative direction [with the standard Metropolis–Hastings acceptance probability]. The sign of the direction switches in case of a rejection. And the statistic ∆ is such that the proposal kernel Q(x,.) is unbiased, i.e., agnostic to the sign, i.e., it gives the same probability to ∆x≤∆y and ∆y≤∆x. This modification reduces the asymptotic variance compared with the original Metropolis–Hastings kernel.

To construct a random walk proposal that is unbiased, the authors assume that the ∆ transform takes values in a topological group, G, with Q further being invariant under the group actions. This can be constructed from a standard proposal by averaging the transforms of Q under all elements of the group over the associated right Haar measure. (Which I thought implied that the group is compact, except I forgot to account for the data update into a posterior..!) The worked-out example is based on a multivariate autoregressive kernel with ∆x being a rescaled non-central chi-squared variate. In dimension 24. The results show a clear improvement in effective sample size per second evaluation over off-the-shelf random walk and Hamiltonian Monte Carlo versions.

Seeing the Haar measure appearing in the setting of Markov chain Monte Carlo is fun!, as my last brush with it was not algorithmic. I would think the proposal only applies to settings where the components of the simulated vector are somewhat homogeneous in that the determinationthe determination of both the group action and a guiding statistic seem harder in cases where these components take different meaning (or live in a weird topology). I also lazily wonder if selecting the guiding statistic as a gradient of the log-target would have any interest.

## non-reversibility in discrete spaces

Posted in Books, Statistics, University life with tags , , , , , , , , , on January 3, 2020 by xi'an Following a recent JASA paper by Giacomo Zanella (which I have not yet read but is discussed on this blog), Sam Power and Jacob Goldman have recently arXived a paper on Accelerated sampling on discrete spaces with non-reversible Markov processes, where they use continuous-time, non-reversible algorithms à la PDMP, even though differential equations do not exist on discrete spaces. More specifically, they devise discrete versions of the coordinate sampler and of the Zig-Zag sampler, using Markov jump processes instead of differential equations, with detailed balance on the jump rate rather than the Markov kernel. A use of jump processes originating at least from Peskun (1973) and connected with MCMC algorithms in Matthew Stephens‘ 1999 PhD thesis. A neat thing about discrete settings is that the jump process can be implemented with no discretisation! However, as we noticed when working on birth-and-death processes with Olivier Cappé and Tobias Rydèn, there is a potential for disastrous implementation if an infinite sequence of instantaneous moves (out of zero probability states) is proposed.

The authors make the further assumption(s) that the discrete space is endowed with a graphical structure with a group G acting upon this graph, with an involution keeping the target (or a completion of the original target) invariant. In this framework, reversibility amounts to repeatedly using (group) generators þ with a low order (as in Bayesian variable selection, binary spin systems, where þ.þ=id, and other permutation problems), since they bring the chain back to its starting point. Their first sampler is called a Tabu sampler for avoiding such behaviour, forcing the next step to use other generators þ in the generator set Þ thanks to a binary auxiliary variable that partitions Þ into forward vs backward moves. For high order generators, the discrete coordinate and Zig-Zag samplers are instead repeatedly using the same generator (although it is unclear to me why this is beneficial, given that neither graph nor generator is not necessarily linked with the target). With the coordinate sampler being again much cheaper since it only looks at one direction in the generator group.

The paper contains a range of comparisons with (only) Zanella’s sampler, some presenting heavy gains in terms of ESS. Including one on hundreds of sensors in a football stadium. As I am not particularly familiar with these examples, except for the Bayesian variable selection one, I found it rather hard to determine whether or not the compared samplers were indeed exploring the entirety of the (highly complex and highly dimensional) target. The collection of examples is however quite rich and support the use of such non-reversible schemes. It may also be that the discrete nature of the target could facilitate the theoretical study of their convergence properties.

## stability of noisy Metropolis-Hastings

Posted in Statistics with tags , , , , , , on September 28, 2016 by xi'an Felipe Medina-Aguayo, Antony Lee and Gareth Roberts (all at Warwick University) have recently published—even though the paper was accepted a year ago—a paper in Statistics and Computing about a variant to the pseudo-marginal Metropolis-Hastings algorithm. The modification is to simulate an estimate of the likelihood or posterior at the current value of the Markov chain at every iteration, rather than reproducing the current estimate. The reason for this refreshment of the weight estimate is to prevent stickiness in the chain, when a random weight leads to a very high value of the posterior. Unfortunately, this change leads to a Markov chain with the wrong stationary distribution. When this stationary exists! The paper actually produces examples of transient noisy chains, even in simple cases such as a geometric target distribution. And even when taking the average of a large number of weights. But the paper also contains sufficient conditions, like negative weight moments or uniform ergodicity of the proposal, for the noisy chain to be geometrically ergodic. Even though the applicability of those conditions to complex targets is not always obvious.

## adaptive and delayed MCMC for expensive likelihoods [reply from the authors]

Posted in Books, Statistics with tags , , , , , , , , on October 29, 2015 by xi'an

[Chris Sherlock, Andrew Golightly and Daniel Henderson have written a reply about my earlier comments on their arXived paper which works better as a post than as a comment:]

Thank you for the constructive criticism of our paper. Our approach uses a simple weighted average of nearest neighbours and we agree that GPs offer a useful alternative. Both methods have pros and cons, however we first note a similarity: Kriging using a GP also leads to a weighted average of values.

The two most useful pros of the GP are that, (i) by estimating the parameters of the GP one may represent the scales of variability more accurately than a simple nearest neighbour approach with weighting according to Euclidean distance, and (ii) one obtains a distribution for the uncertainty in the Kriging estimate of the log-likelihood.

Both the papers in the blog entry (as well as other recent papers which use GPs), in one way or another take advantage of the second point. However, as acknowledged in Richard Wilkinson’s paper, estimating the parameters of a GP is computationally very costly, and this estimation must be repeated as the training data set grows. Probably for this reason and because of the difficulty in identifying p(p+1)/2 kernel range parameters, Wilkinson’s paper uses a diagonal covariance structure for the kernel. We can find no description of the structure of the covariance function that is used for each statistic in the Meeds & Welling paper but this issue is difficult to avoid.

Our initial training run is used to transform the parameters so that they are approximately orthogonal with unit variance and Euclidean distance is a sensible metric. This has two consequences: (i) the KD-tree is easier to set up and use, and (ii) the nearest neighbours in a KD-tree that is approximately balanced can be found in O(log N) operations, where N is the number of training points. Both (i) and (ii) only require Euclidean distance to be a reasonable measure, not perfect, so there is no need for the training run to have “properly converged”, just for it to represent the gross relationships in the posterior and for the transformation to be 1-1. We note a parallel between our approximate standardisation using training data, and the need to estimate a symmetric matrix of distance parameters from training data to obtain a fully representative GP kernel.

The GP approach might lead to a more accurate estimate of the posterior than a nearest neighbour approach (for a fixed number of training points), but this is necessary for the algorithms in the papers mentioned above since they sample from an approximation to the posterior. As noted in the blog post the delayed-acceptance step (which also could be added to GP-based algorithms) ensures that our algorithm samples from the true posterior so accuracy is helpful for efficiency rather than essential for validity.

We have made the kd-tree C code available and put some effort into making the interface straightforward to use. Our starting point is an existing simple MCMC algorithm; as it is already evaluating the posterior (or an unbiased approximation) then why not store this and take advantage of it within the existing algorithm? We feel that our proposal offers a relatively cheap and straightforward route for this.