## control variates [seminar]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on November 5, 2021 by xi'an

Today, Petros Dellaportas (whom I have know since the early days of MCMC, when we met in CIRM) gave a seminar at the Warwick algorithm seminar on control variates for MCMC, reminding me of his 2012 JRSS paper. Based on the Poisson equation and using a second control variate to stabilise the Monte Carlo approximation do the first control variate. The difference with usual control variates is finding a first approximate G(x)-q(y|x)G(Y) to F-πF. And the first Poisson equation is using α(x,y)q(y|x) rather than π. Then the second expands log α(x,y)q(y|x) to achieve a manageable term.

Abstract: We provide a general methodology to construct control variates for any discrete time random walk Metropolis and Metropolis-adjusted Langevin algorithm Markov chains that can achieve, in a post-processing manner and with a negligible additional computational cost, impressive variance reduction when compared to the standard MCMC ergodic averages. Our proposed estimators are based on an approximate solution of the Poisson equation for a multivariate Gaussian target densities of any dimension.

I wonder if there were a neural network version that would first build G from scratch and later optimise it towards solving the Poisson equation. As in this recent arXival I haven’t read (yet).

## multilevel linear models, Gibbs samplers, and multigrid decompositions

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 22, 2021 by xi'an

A paper by Giacommo Zanella (formerly Warwick) and Gareth Roberts (Warwick) is about to appear in Bayesian Analysis and (still) open for discussion. It examines in great details the convergence properties of several Gibbs versions of the same hierarchical posterior for an ANOVA type linear model. Although this may sound like an old-timer opinion, I find it good to have Gibbs sampling back on track! And to have further attention to diagnose convergence! Also, even after all these years (!), it is always a surprise  for me to (re-)realise that different versions of Gibbs samplings may hugely differ in convergence properties.

At first, intuitively, I thought the options (1,0) (c) and (0,1) (d) should be similarly performing. But one is “more” hierarchical than the other. While the results exhibiting a theoretical ordering of these choices are impressive, I would suggest pursuing an random exploration of the various parameterisations in order to handle cases where an analytical ordering proves impossible. It would most likely produce a superior performance, as hinted at by Figure 4. (This alternative happens to be briefly mentioned in the Conclusion section.) The notion of choosing the optimal parameterisation at each step is indeed somewhat unrealistic in that the optimality zones exhibited in Figure 4 are unknown in a more general model than the Gaussian ANOVA model. Especially with a high number of parameters, parameterisations, and recombinations in the model (Section 7).

An idle question is about the extension to a more general hierarchical model where recentring is not feasible because of the non-linear nature of the parameters. Even though Gaussianity may not be such a restriction in that other exponential (if artificial) families keeping the ANOVA structure should work as well.

Theorem 1 is quite impressive and wide ranging. It also reminded (old) me of the interleaving properties and data augmentation versions of the early-day Gibbs. More to the point and to the current era, it offers more possibilities for coupling, parallelism, and increasing convergence. And for fighting dimension curses.

“in this context, imposing identifiability always improves the convergence properties of the Gibbs Sampler”

Another idle thought of mine is to wonder whether or not there is a limited number of reparameterisations. I think that by creating unidentifiable decompositions of (some) parameters, eg, μ=μ¹+μ²+.., one can unrestrictedly multiply the number of parameterisations. Instead of imposing hard identifiability constraints as in Section 4.2, my intuition was that this de-identification would increase the mixing behaviour but this somewhat clashes with the above (rigorous) statement from the authors. So I am proven wrong there!

Unless I missed something, I also wonder at different possible implementations of HMC depending on different parameterisations and whether or not the impact of parameterisation has been studied for HMC. (Which may be linked with Remark 2?)

## Gibbs sampling with incompatible conditionals

Posted in Books, Kids, R, Statistics with tags , , , , , , on July 23, 2019 by xi'an

An interesting question (with no clear motivation) on X validated wondering why a Gibbs sampler produces NAs… Interesting because multi-layered:

1. The attached R code indeed produces NAs because it calls the Negative Binomial Neg(x¹,p) random generator with a zero success parameter, x¹=0, which automatically returns NAs. This can be escaped by returning a one (1) instead.
2. The Gibbs sampler is based on a Bin(x²,p) conditional for X¹ and a Neg(x¹,p) conditional for X². When using the most standard version of the Negative Binomial random variate as the number of failures, hence supported on 0,1,2…. these two conditionals are incompatible, i.e., there cannot be a joint distribution behind that returns these as conditionals, which makes the limiting behaviour of the Markov chain harder to study. It however seems to converge to a distribution close to zero, which is not contradictory with the incompatibility property: the stationary joint distribution simply does not enjoy the conditionals used by the Gibbs sampler as its conditionals.
3. When using the less standard version of the Negative Binomial random variate understood as a number of attempts for the conditional on X², the two conditionals are compatible and correspond to a joint measure proportional to $x_1^{-1} {x_1 \choose x_2} p^{x_2} (1-p)^{x_1-x_2}$, however this pmf does not sum up to a finite quantity (as in the original Gibbs for Kids example!), hence the resulting Markov chain is at best null recurrent, which seems to be the case for p different from ½. This is unclear to me for p=½.

## automatic adaptation of MCMC algorithms

Posted in pictures, Statistics with tags , , , , , , , on March 4, 2019 by xi'an

“A typical adaptive MCMC sampler will approximately optimize performance given the kind of sampler chosen in the first place, but it will not optimize among the variety of samplers that could have been chosen.”

Last February (2018), Dao Nguyen and five co-authors arXived a paper that I missed. On a new version of adaptive MCMC that aims at selecting a wider range of proposal kernels. Still requiring a by-hand selection of this collection of kernels… Among the points addressed, beyond the theoretical guarantees that the adaptive scheme does not jeopardize convergence to the proper target, are a meta-exploration of the set of combinations of samplers and integration of the computational speed in the assessment of each sampler. Including the very difficulty of assessing mixing. One could deem the index of the proposal as an extra (cyber-)parameter to its generic parameter (like the scale in the random walk), but the discreteness of this index makes the extension more delicate than expected. And justifies the distinction between internal and external parameters. The notion of a worst-mixing dimension is quite appealing and connects with the long-term hope that one could spend the maximum fraction of the sampler runtime over the directions that are poorly mixing, while still keeping the target as should be. The adaptive scheme is illustrated on several realistic models with rather convincing gains in efficiency and time.

The convergence tools are inspired from Roberts and Rosenthal (2007), with an assumption of uniform ergodicity over all kernels considered therein which is both strong and delicate to assess in practical settings. Efficiency is rather unfortunately defined in terms of effective sample size, which is a measure of correlation or lack thereof, but which does not relate to the speed of escape from the basin of attraction of the starting point. I also wonder at the pertinence of the estimation of the effective sample size when the chain is based on different successive kernels, since the lack of correlation may be due to another kernel. Another calibration issue is the internal clock that relates to the average number of iterations required to tune properly a specific kernel, which again may be difficult to assess in a realistic situation. A last query is whether or not this scheme could be compared with an asynchronous (and valid) MCMC approach that exploits parallel capacities of the computer.

## Gibbs for incompatible kids

Posted in Books, Statistics, University life with tags , , , , , , , , , , on September 27, 2018 by xi'an

In continuation of my earlier post on Bayesian GANs, which resort to strongly incompatible conditionals, I read a 2015 paper of Chen and Ip that I had missed. (Published in the Journal of Statistical Computation and Simulation which I first confused with JCGS and which I do not know at all. Actually, when looking at its editorial board,  I recognised only one name.) But the study therein is quite disappointing and not helping as it considers Markov chains on finite state spaces, meaning that the transition distributions are matrices, meaning also that convergence is ensured if these matrices have no null probability term. And while the paper is motivated by realistic situations where incompatible conditionals can reasonably appear, the paper only produces illustrations on two and three states Markov chains. Not that helpful, in the end… The game is still afoot!