## null recurrent = zero utility?

Posted in Books, R, Statistics with tags , , , , , , , on April 28, 2022 by xi'an

The stability result that the ratio

$\dfrac{\sum^T_{t=1} f(\theta^{(t)})}{\sum^T_{t=1} g(\theta^{(t)})}\qquad(1)$

converges holds for a Harris π-null-recurrent Markov chain for all functions f,g in L¹(π) [Meyn & Tweedie, 1993, Theorem 17.3.2] is rather fascinating. However, it is unclear it can be useful in simulation environments, as for the integral priors we have been studying over the years with Juan Antonio Cano and Diego Salmeron Martinez. Above, the result of an experiment where I simulated a Markov chain as a Normal random walk in dimension one, hence a Harris π-null-recurrent Markov chain for the Lebesgue measure λ, and monitored the stabilisation of the ratio (1) when using two densities for f and g,  to its expected value (1, shown by a red horizontal line). There is quite a variability in the outcome (repeated 100 times),  but the most intriguing is the quick stabilisation of most cumulated averages to values different from 1. Even longer runs display this feature

which I would blame on the excursions of the random walk far away from the central regions for both f and g, that is on long sequences where zeroes keep being added to numerator and denominators in (1). As far as integral approximation is concerned, this is not very helpful!

## almost reversed 2-lag Markov chain

Posted in Kids, R, Statistics with tags , , , , on July 7, 2021 by xi'an

Another simple riddle from the Riddler: take a binary sequence and associate to this sequence a score vector made of the numbers of consecutive ones from each position. If the sequence is ten step long and there are 3 ones located at random, what is the expected total score? (The original story is much more complex and involves as often strange sports!)

Adding two zeroes at time 11 and 12, this is quite simple to code, e.g.

f=0*(1:10) #frequencies
for(v in 1:1e6){
r=0*f#reward
s=sample(1:10,3)
for(t in s)r[t]=1+((t+1)%in%s)*(1+((t+2)%in%s))
f[sum(r)]=f[sum(r)]+1}
f=f/1e6


and the outcome recovers the feature that the only possible scores are 1+1+1=3 (all ones separated), 1+1+2=4 (two ones contiguous),  and 1+2+3=6 (all ones contiguous). With respective frequencies 56/120, 56/120, and 8/120. With 120 being the number of possible locations of the 3 ones.

## non-reversible guided Metropolis–Hastings

Posted in Mountains, pictures, Statistics, Travel with tags , , , , , , , , , , , , on June 4, 2020 by xi'an

Kengo Kamatani and Xiaolin Song, whom I visited in Osaka last summer in what seems like another reality!, just arXived another paper on a non-reversible Metropolis version. That exploits a group action and the associated Haar measure.

Following a proposal of Gustafson (1998), a ∆-guided Metropolis–Hastings kernel is based on a statistic ∆ that is totally ordered and determine the acceptance of a proposed value y~Q(x,.) by adding a direction (-,+) to the state space and moving from x if ∆x≤∆y in the positive direction and if ∆y≤∆x in the negative direction [with the standard Metropolis–Hastings acceptance probability]. The sign of the direction switches in case of a rejection. And the statistic ∆ is such that the proposal kernel Q(x,.) is unbiased, i.e., agnostic to the sign, i.e., it gives the same probability to ∆x≤∆y and ∆y≤∆x. This modification reduces the asymptotic variance compared with the original Metropolis–Hastings kernel.

To construct a random walk proposal that is unbiased, the authors assume that the ∆ transform takes values in a topological group, G, with Q further being invariant under the group actions. This can be constructed from a standard proposal by averaging the transforms of Q under all elements of the group over the associated right Haar measure. (Which I thought implied that the group is compact, except I forgot to account for the data update into a posterior..!) The worked-out example is based on a multivariate autoregressive kernel with ∆x being a rescaled non-central chi-squared variate. In dimension 24. The results show a clear improvement in effective sample size per second evaluation over off-the-shelf random walk and Hamiltonian Monte Carlo versions.

Seeing the Haar measure appearing in the setting of Markov chain Monte Carlo is fun!, as my last brush with it was not algorithmic. I would think the proposal only applies to settings where the components of the simulated vector are somewhat homogeneous in that the determinationthe determination of both the group action and a guiding statistic seem harder in cases where these components take different meaning (or live in a weird topology). I also lazily wonder if selecting the guiding statistic as a gradient of the log-target would have any interest.

## non-reversibility in discrete spaces

Posted in Books, Statistics, University life with tags , , , , , , , , , on January 3, 2020 by xi'an

Following a recent JASA paper by Giacomo Zanella (which I have not yet read but is discussed on this blog), Sam Power and Jacob Goldman have recently arXived a paper on Accelerated sampling on discrete spaces with non-reversible Markov processes, where they use continuous-time, non-reversible algorithms à la PDMP, even though differential equations do not exist on discrete spaces. More specifically, they devise discrete versions of the coordinate sampler and of the Zig-Zag sampler, using Markov jump processes instead of differential equations, with detailed balance on the jump rate rather than the Markov kernel. A use of jump processes originating at least from Peskun (1973) and connected with MCMC algorithms in Matthew Stephens‘ 1999 PhD thesis. A neat thing about discrete settings is that the jump process can be implemented with no discretisation! However, as we noticed when working on birth-and-death processes with Olivier Cappé and Tobias Rydèn, there is a potential for disastrous implementation if an infinite sequence of instantaneous moves (out of zero probability states) is proposed.

The authors make the further assumption(s) that the discrete space is endowed with a graphical structure with a group G acting upon this graph, with an involution keeping the target (or a completion of the original target) invariant. In this framework, reversibility amounts to repeatedly using (group) generators þ with a low order (as in Bayesian variable selection, binary spin systems, where þ.þ=id, and other permutation problems), since they bring the chain back to its starting point. Their first sampler is called a Tabu sampler for avoiding such behaviour, forcing the next step to use other generators þ in the generator set Þ thanks to a binary auxiliary variable that partitions Þ into forward vs backward moves. For high order generators, the discrete coordinate and Zig-Zag samplers are instead repeatedly using the same generator (although it is unclear to me why this is beneficial, given that neither graph nor generator is not necessarily linked with the target). With the coordinate sampler being again much cheaper since it only looks at one direction in the generator group.

The paper contains a range of comparisons with (only) Zanella’s sampler, some presenting heavy gains in terms of ESS. Including one on hundreds of sensors in a football stadium. As I am not particularly familiar with these examples, except for the Bayesian variable selection one, I found it rather hard to determine whether or not the compared samplers were indeed exploring the entirety of the (highly complex and highly dimensional) target. The collection of examples is however quite rich and support the use of such non-reversible schemes. It may also be that the discrete nature of the target could facilitate the theoretical study of their convergence properties.

## stability of noisy Metropolis-Hastings

Posted in Statistics with tags , , , , , , on September 28, 2016 by xi'an

Felipe Medina-Aguayo, Antony Lee and Gareth Roberts (all at Warwick University) have recently published—even though the paper was accepted a year ago—a paper in Statistics and Computing about a variant to the pseudo-marginal Metropolis-Hastings algorithm. The modification is to simulate an estimate of the likelihood or posterior at the current value of the Markov chain at every iteration, rather than reproducing the current estimate. The reason for this refreshment of the weight estimate is to prevent stickiness in the chain, when a random weight leads to a very high value of the posterior. Unfortunately, this change leads to a Markov chain with the wrong stationary distribution. When this stationary exists! The paper actually produces examples of transient noisy chains, even in simple cases such as a geometric target distribution. And even when taking the average of a large number of weights. But the paper also contains sufficient conditions, like negative weight moments or uniform ergodicity of the proposal, for the noisy chain to be geometrically ergodic. Even though the applicability of those conditions to complex targets is not always obvious.