## general perspective on the Metropolis–Hastings kernel

Posted in Books, Statistics with tags , , , , , , , , , , , , , on January 14, 2021 by xi'an

[My Bristol friends and co-authors] Christophe Andrieu, and Anthony Lee, along with Sam Livingstone arXived a massive paper on 01 January on the Metropolis-Hastings kernel.

“Our aim is to develop a framework making establishing correctness of complex Markov chain Monte Carlo kernels a purely mechanical or algebraic exercise, while making communication of ideas simpler and unambiguous by allowing a stronger focus on essential features (…) This framework can also be used to validate kernels that do not satisfy detailed balance, i.e. which are not reversible, but a modified version thereof.”

A central notion in this highly general framework is, extending Tierney (1998), to see an MCMC kernel as a triplet involving a probability measure μ (on an extended space), an involution transform φ generalising the proposal step (i.e. þ²=id), and an associated acceptance probability ð. Then μ-reversibility occurs for

$\eth(\xi)\mu(\text{d}\xi)= \eth(\phi(\xi))\mu^{\phi}(\text{d}\xi)$

with the rhs involving the push-forward measure induced by μ and φ. And furthermore there is always a choice of an acceptance probability ð ensuring for this equality to happen. Interestingly, the new framework allows for mostly seamless handling of more complex versions of MCMC such as reversible jump and parallel tempering. But also non-reversible kernels, incl. for instance delayed rejection. And HMC, incl. NUTS. And pseudo-marginal, multiple-try, PDMPs, &c., &c. it is remarkable to see such a general theory emerging a this (late?) stage of the evolution of the field (and I will need more time and attention to understand its consequences).

## MCMC, variational inference, invertible flows… bridging the gap?

Posted in Books, Mountains, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , on October 2, 2020 by xi'an

Two weeks ago, my friend [see here when climbing Pic du Midi d’Ossau in 2005!] and coauthor Éric Moulines gave a very interesting on-line talk entitled MCMC, Variational Inference, Invertible Flows… Bridging the gap?, which was merging MCMC, variational autoencoders, and variational inference. I paid close attention as I plan to teach an advanced course on acronyms next semester in Warwick. (By acronyms, I mean ABC+GAN+VAE!)

The notion in this work is that variational autoencoders are based on over-simple mean-field variational distributions, that usually produce a poor approximation of the target distribution. Éric and his coauthors propose to introduce a Metropolis step in the VAE. This leads to a more general notion of Markov transitions and a global balance condition. Hamiltonian Monte Carlo can be used as well and it improves the latent distribution approximation, namely the encoder, which is surprising to me. The steps of the Markov kernel produce a manageable transform of the initial mean field approximation, a random version of the original VAE. Manageable provided not too many MCMC steps are implemented. (Now, the flow of slides was much too fast for me to get a proper understanding of the implementation of the method, of the degree of its calibration, and of the computing cost. I need to read the associated papers.)

Once the talk was over, I went back to changing tires and tubes, as two bikes of mine had flat tires, the latest being a spectacular explosion (!) that seemingly went through the tire (although I believe the opposite happened, namely the tire got slashed and induced the tube to blow out very quickly). Blame the numerous bits of broken glass over bike paths.

## transport Monte Carlo

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , on August 31, 2020 by xi'an

Read this recent arXival by Leo Duan (from UF in Gainesville) on transport approaches to approximate Bayesian computation, in connection with normalising flows. The author points out a “lack of flexibility in a large class of normalizing flows”  to bring forward his own proposal.

“…we assume the reference (a multivariate uniform distribution) can be written as a mixture of many one-to-one transforms from the posterior”

The transportation problem is turned into defining a joint distribution on (β,θ) such that θ is marginally distributed from the posterior and β is one of an infinite collection of transforms of θ. Which sounds quite different from normalizing flows, to be sure. Reverting the order, if one manages to simulate β from its marginal the resulting θ is one of the transforms. Chosen to be a location-scale modification of β, s⊗β+m. The weights when going from θ to β are logistic transforms with Dirichlet distributed scales. All with parameters to be optimised by minimising the Kullback-Leibler distance between the reference measure on β and its inverse mixture approximation, and resorting to gradient descent. (This may sound a wee bit overwhelming as an approximation strategy and I actually had to make a large cup of strong macha to get over it, but this may be due to the heat wave occurring at the same time!) Drawing θ from this approximation is custom-made straightforward and an MCMC correction can even be added, resulting in an independent Metropolis-Hastings version since the acceptance ratio remains computable. Although this may defeat the whole purpose of the exercise by stalling the chain if the approximation is poor (hence suggesting this last step being used instead as a control.)

The paper also contains a theoretical section that studies the approximation error, going to zero as the number of terms in the mixture, K, goes to infinity. Including a Monte Carlo error in log(n)/n (and incidentally quoting a result from my former HoD at Paris 6, Paul Deheuvels). Numerical experiments show domination or equivalence with some other solutions, e.g. being much faster than HMC, the remaining \$1000 question being of course the on-line evaluation of the quality of the approximation.

## non-reversible guided Metropolis–Hastings

Posted in Mountains, pictures, Statistics, Travel with tags , , , , , , , , , , , , on June 4, 2020 by xi'an

Kengo Kamatani and Xiaolin Song, whom I visited in Osaka last summer in what seems like another reality!, just arXived another paper on a non-reversible Metropolis version. That exploits a group action and the associated Haar measure.

Following a proposal of Gustafson (1998), a ∆-guided Metropolis–Hastings kernel is based on a statistic ∆ that is totally ordered and determine the acceptance of a proposed value y~Q(x,.) by adding a direction (-,+) to the state space and moving from x if ∆x≤∆y in the positive direction and if ∆y≤∆x in the negative direction [with the standard Metropolis–Hastings acceptance probability]. The sign of the direction switches in case of a rejection. And the statistic ∆ is such that the proposal kernel Q(x,.) is unbiased, i.e., agnostic to the sign, i.e., it gives the same probability to ∆x≤∆y and ∆y≤∆x. This modification reduces the asymptotic variance compared with the original Metropolis–Hastings kernel.

To construct a random walk proposal that is unbiased, the authors assume that the ∆ transform takes values in a topological group, G, with Q further being invariant under the group actions. This can be constructed from a standard proposal by averaging the transforms of Q under all elements of the group over the associated right Haar measure. (Which I thought implied that the group is compact, except I forgot to account for the data update into a posterior..!) The worked-out example is based on a multivariate autoregressive kernel with ∆x being a rescaled non-central chi-squared variate. In dimension 24. The results show a clear improvement in effective sample size per second evaluation over off-the-shelf random walk and Hamiltonian Monte Carlo versions.

Seeing the Haar measure appearing in the setting of Markov chain Monte Carlo is fun!, as my last brush with it was not algorithmic. I would think the proposal only applies to settings where the components of the simulated vector are somewhat homogeneous in that the determinationthe determination of both the group action and a guiding statistic seem harder in cases where these components take different meaning (or live in a weird topology). I also lazily wonder if selecting the guiding statistic as a gradient of the log-target would have any interest.