## important Markov chains

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on July 21, 2022 by xi'an

With Charly Andral (PhD, Paris Dauphine), Randal Douc, and Hugo Marival (PhD, Telecom SudParis), we just arXived a paper on importance Markov chains that merges importance sampling and MCMC. An idea already mentioned in Hastings (1970) and even earlier in Fodsick (1963), and later exploited in Liu et al.  (2003) for instance. And somewhat dual of the vanilla Rao-Backwellisation paper Randal and I wrote a (long!) while ago. Given a target π with a dominating measure π⁰≥Mπ, using a Markov kernel to simulate from this dominating measure and subsampling by the importance weight ρ does produce a new Markov chain with the desired target measure as invariant distribution. However, the domination assumption is rather unrealistic and a generic approach can be implemented without it, by defining an extended Markov chain, with the addition of the number N of replicas as the supplementary term… And a transition kernel R(n|x) on N with expectation ρ, which is a minimal(ist) assumption for the validation of the algorithm.. While this initially defines a semi-Markov chain, an extended Markov representation is also feasible, by decreasing N one by one until reaching zero, and this is most helpful in deriving convergence properties for the resulting chain, including a CLT.  While the choice of the kernel R is free, the optimal choice is associated with residual sampling, where only the fractional part of ρ is estimated by a Bernoulli simulation.

## null recurrent = zero utility?

Posted in Books, R, Statistics with tags , , , , , , , on April 28, 2022 by xi'an

The stability result that the ratio

$\dfrac{\sum^T_{t=1} f(\theta^{(t)})}{\sum^T_{t=1} g(\theta^{(t)})}\qquad(1)$

converges holds for a Harris π-null-recurrent Markov chain for all functions f,g in L¹(π) [Meyn & Tweedie, 1993, Theorem 17.3.2] is rather fascinating. However, it is unclear it can be useful in simulation environments, as for the integral priors we have been studying over the years with Juan Antonio Cano and Diego Salmeron Martinez. Above, the result of an experiment where I simulated a Markov chain as a Normal random walk in dimension one, hence a Harris π-null-recurrent Markov chain for the Lebesgue measure λ, and monitored the stabilisation of the ratio (1) when using two densities for f and g,  to its expected value (1, shown by a red horizontal line). There is quite a variability in the outcome (repeated 100 times),  but the most intriguing is the quick stabilisation of most cumulated averages to values different from 1. Even longer runs display this feature

which I would blame on the excursions of the random walk far away from the central regions for both f and g, that is on long sequences where zeroes keep being added to numerator and denominators in (1). As far as integral approximation is concerned, this is not very helpful!

## informed proposals for local MCMC in discrete spaces

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , on April 17, 2020 by xi'an

Last year Giacomo Zanella published a paper entitled informed proposals for local MCMC in discrete spaces in JASA. Which I had missed somehow and only discovered through another paper, and which we recently discussed at Paris-Dauphine with graduate students, marooned by COVID-19 . Probability targets in discrete spaces are intrinsically hard[er] to simulate in my opinion if only because there is no natural distance, hence no natural neighbourhood. A random walk proposal like the reference kernel in the paper is not directly calibrated. Without demarginalisation there is neither a clear version of calculus for implementing MALA or HMC. What indeed is HMC on a discrete space? If this requires “embedding the binary space in a continuous space”, it does not sound very enticing if the construct is context dependent.

“This would allow for more moves to be accepted and longer moves to be performed, thus improving the algorithm’s efficiency.”

A interesting aspect of the paper is that for near atomic transition kernels K, informally for small σ’s, the proposal switch to Q finds target x normalising constant as new stationary and close to the actual target. Which incidentally reminded me of our vanilla Rao-Blackwellisation with Randal Douc. This however begets the worry that it may prove unwieldy in continuous cases, as except for Gaussian kernels, the  proposal switch to Q may prove intractable and requires further MCMC steps, in a form of infinite regress. Plus a musing that, were the original kernel K to be replaced with the new Q, another informed proposal transform could be applied to Q. Further infinite regress…

“[The optimality of the Metropolis-Hastings choice of acceptance probability] does not translate to the context of balancing functions.”

The paper indeed exhibits a setting that is rehabilitating Barker’ (1965) version of the acceptance probability, but I never  was very much convinced there was a significant difference in using one or the other. During our virtual (?) discussion, we also wondered at the adaptive abilities of the approach, e.g., selecting among a finite family of g’s (according to which criterion) or parameterising g towards an optimal choice of its parameter. And at the capacity for Rao-Blackwellisation since the proposal have to consider the entire set of neighbours prior to moving to a likely one.

## optimal choice among MCMC kernels

Posted in Statistics with tags , , , , , , , , , , on March 14, 2019 by xi'an

Last week in Siem Reap, Florian Maire [who I discovered originates from a Norman town less than 10km from my hometown!] presented an arXived joint work with Pierre Vandekerkhove at the Data Science & Finance conference in Cambodia that considers the following problem: Given a large collection of MCMC kernels, how to pick the best one and how to define what best means. Going by mixtures is a default exploration of the collection, as shown in (Tierney) 1994 for instance since this improves on both kernels (esp. when each kernel is not irreducible on its own!). This paper considers a move to local weights in the mixture, weights that are not estimated from earlier simulations, contrary to what I first understood.

As made clearer in the paper the focus is on filamentary distributions that are concentrated nearby lower-dimension sets or manifolds Since then the components of the kernel collections can be restricted to directions of these manifolds… Including an interesting case of a 2-D highly peaked target where converging means mostly simulating in x¹ and covering the target means mostly simulating in x². Exhibiting a schizophrenic tension between the two goals. Weight locally dependent means correction by Metropolis step, with cost O(n). What of Rao-Blackwellisation of these mixture weights, from weight x transition to full mixture, as in our PMC paper? Unclear to me as well [during the talk] is the use in the mixture of basic Metropolis kernels, which are not absolutely continuous, because of the Dirac mass component. But this is clarified by Section 5 in the paper. A surprising result from the paper (Corollary 1) is that the use of local weights ω(i,x) that depend on the current value of the chain does jeopardize the stationary measure π(.) of the mixture chain. Which may be due to the fact that all components of the mixture are already π-invariant. Or that the index of the kernel constitutes an auxiliary (if ancillary)  variate. (Algorithm 1 in the paper reminds me of delayed acceptance. Making me wonder if computing time should be accounted for.) A final question I briefly discussed with Florian is the extension to weights that are automatically constructed from the simulations and the target.

## A chance non-typo

Posted in Books, Statistics, University life with tags , , on March 18, 2011 by xi'an

A few days ago, my colleague Medhi Dafal in the admin branch of the university came across Monte Carlo Statistical Methods on another colleague’s desk and took it with him. Yesterday morning, when I came to his office to burden him with another admin nightmare, he opened the book [at random] and pointed out a formula pretending he had trouble with it! This was the residual distribution in the Markov chapter

$\tilde K(x,\cdot)=\dfrac{K(x,\cdot) - \epsilon\mathbb{I}_C(x)\nu(\cdot)}{1-\epsilon\mathbb{I}_C(x)}$

When I first saw it, I though this was indeed a typo because of the indicator in the denominator. Now that I look at it more carefully, I realise this is correct, because C is a small set, hence the residual kernel is the standard kernel outside C. Still, for a few minutes, I had this weird impression he had found a new typo completely at random! Which would have been a significant item of information about the frequency of typos in this book… Phew!