## Hastings 50 years later

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , on January 9, 2020 by xi'an

What is the exact impact of the Metropolis-Hastings algorithm on the field of Bayesian statistics? and what are the new tools of the trade? What I personally find the most relevant and attractive element in a review on the topic is the current role of this algorithm, rather than its past (his)story, since many such reviews have already appeared and will likely continue to appear. What matters most imho is how much the Metropolis-Hastings algorithm signifies for the community at large, especially beyond academia. Is the availability or unavailability of software like BUGS or Stan a help or an hindrance? Was Hastings’ paper the start of the era of approximate inference or the end of exact inference? Are the algorithm intrinsic features like Markovianity a fundamental cause for an eventual extinction because of the ensuing time constraint and the lack of practical guarantees of convergence and the illusion of a fully automated version? Or are emerging solutions like unbiased MCMC and asynchronous algorithms a beacon of hope?

In their Biometrika paper, Dunson and Johndrow (2019) recently wrote a celebration of Hastings’ 1970 paper in Biometrika, where they cover adaptive Metropolis (Haario et al., 1999; Roberts and Rosenthal, 2005), the importance of gradient based versions toward universal algorithms (Roberts and Tweedie, 1995; Neal, 2003), discussing the advantages of HMC over Langevin versions. They also recall the significant step represented by Peter Green’s (1995) reversible jump algorithm for multimodal and multidimensional targets, as well as tempering (Miasojedow et al., 2013; Woodard et al., 2009). They further cover intractable likelihood cases within MCMC (rather than ABC), with the use of auxiliary variables (Friel and Pettitt, 2008; Møller et al., 2006) and pseudo-marginal MCMC (Andrieu and Roberts, 2009; Andrieu and Vihola, 2016). They naturally insist upon the need to handle huge datasets, high-dimension parameter spaces, and other scalability issues, with links to unadjusted Langevin schemes (Bardenet et al., 2014; Durmus and Moulines, 2017; Welling and Teh, 2011). Similarly, Dunson and Johndrow (2019) discuss recent developments towards parallel MCMC and non-reversible schemes such as PDMP as highly promising, with a concluding section on the challenges of automatising and robustifying much further the said procedures, if only to reach a wider range of applications. The paper is well-written and contains a wealth of directions and reflections, including those in my above introduction. Here are some mostly disconnected directions I would have liked to see covered or more covered

1. convergence assessment today, e.g. the comparison of various approximation schemes
2. Rao-Blackwellisation and other post-processing improvements
3. other approximate inference tools than the pseudo-marginal MCMC
4. importance of the parameterisation of the problem for convergence
5. dimension issues and connection with quasi-Monte Carlo
6. constrained spaces of measure zero, as for instance matrix distributions imposing zeros outside a diagonal band
7. given the rise of the machine(-learners), are exploratory and intrinsically slow algorithms like MCMC doomed or can both fields feed one another? The section on optimisation could be expanded in that direction
8. the wasteful nature of the random walk feature of MCMC algorithms, as opposed to non-reversible kernels like HMC and other PDMPs, missing from the gradient based methods section (and can we once again learn from physicists?)
9. finer convergence issues and hence inference difficulties with complex MCMC algorithms like Gibbs samplers with incompatible conditionals
10. use of the Hastings ratio in other algorithms like ABC or EP (in link with the section on generalised Bayes)
11. adapting Metropolis-Hastings methods for emerging computing tools like GPUs and quantum computers

or possibly less covered, namely data augmentation put forward when it is a special case of auxiliary variables as in slice sampling and in earlier physics literature. For instance, both probit and logistic regressions do not truly require data augmentation and are more toy examples than really challenging applications. The approach of Carlin & Chib (1995) is another illustration, which has met with recent interest, despite requiring heavy calibration (just like RJMCMC). As well as a a somewhat awkward opposition between Gibbs and Hastings, in that I am not convinced that Gibbs does not remain ultimately necessary to handle high dimension problems, in the sense that the alternative solutions like Langevin, HMC, or PDMP, or…, are relying on Euclidean assumptions for the entire vector, while a direct product of Euclidean structures may prove more adequate.

## Gibbs sampling with incompatible conditionals

Posted in Books, Kids, R, Statistics with tags , , , , , , on July 23, 2019 by xi'an

An interesting question (with no clear motivation) on X validated wondering why a Gibbs sampler produces NAs… Interesting because multi-layered:

1. The attached R code indeed produces NAs because it calls the Negative Binomial Neg(x¹,p) random generator with a zero success parameter, x¹=0, which automatically returns NAs. This can be escaped by returning a one (1) instead.
2. The Gibbs sampler is based on a Bin(x²,p) conditional for X¹ and a Neg(x¹,p) conditional for X². When using the most standard version of the Negative Binomial random variate as the number of failures, hence supported on 0,1,2…. these two conditionals are incompatible, i.e., there cannot be a joint distribution behind that returns these as conditionals, which makes the limiting behaviour of the Markov chain harder to study. It however seems to converge to a distribution close to zero, which is not contradictory with the incompatibility property: the stationary joint distribution simply does not enjoy the conditionals used by the Gibbs sampler as its conditionals.
3. When using the less standard version of the Negative Binomial random variate understood as a number of attempts for the conditional on X², the two conditionals are compatible and correspond to a joint measure proportional to $x_1^{-1} {x_1 \choose x_2} p^{x_2} (1-p)^{x_1-x_2}$, however this pmf does not sum up to a finite quantity (as in the original Gibbs for Kids example!), hence the resulting Markov chain is at best null recurrent, which seems to be the case for p different from ½. This is unclear to me for p=½.

## A precursor of ABC-Gibbs

Posted in Books, R, Statistics with tags , , , , , , , , , , on June 7, 2019 by xi'an

All ABC algorithms, including ABC-PaSS introduced here, require that statistics are sufficient for estimating the parameters of a given model. As mentioned above, parameter-wise sufficient statistics as required by ABC-PaSS are trivial to find for distributions of the exponential family. Since many population genetics models do not follow such distributions, sufficient statistics are known for the most simple models only. For more realistic models involving multiple populations or population size changes, only approximately-sufficient statistics can be found.

While Gibbs sampling is not mentioned in the paper, this is indeed a form of ABC-Gibbs, with the advantage of not facing convergence issues thanks to the sufficiency. The drawback being that this setting is restricted to exponential families and hence difficult to extrapolate to non-exponential distributions, as using almost-sufficient (or not) summary statistics leads to incompatible conditionals and thus jeopardise the convergence of the sampler. When thinking a wee bit more about the case treated by Kousathanas et al., I am actually uncertain about the validation of the sampler. When tolerance is equal to zero, this is not an issue as it reproduces the regular Gibbs sampler. Otherwise, each conditional ABC step amounts to introducing an auxiliary variable represented by the simulated summary statistic. Since the distribution of this summary statistic depends on more than the parameter for which it is sufficient, in general, it should also appear in the conditional distribution of other parameters. At least from this Gibbs perspective, it thus relies on incompatible conditionals, which makes the conditions proposed in our own paper the more relevant.

## ABC with Gibbs steps

Posted in Statistics with tags , , , , , , , , , , , , , , , , , on June 3, 2019 by xi'an

With Grégoire Clarté, Robin Ryder and Julien Stoehr, all from Paris-Dauphine, we have just arXived a paper on the specifics of ABC-Gibbs, which is a version of ABC where the generic ABC accept-reject step is replaced by a sequence of n conditional ABC accept-reject steps, each aiming at an ABC version of a conditional distribution extracted from the joint and intractable target. Hence an ABC version of the standard Gibbs sampler. What makes it so special is that each conditional can (and should) be conditioning on a different statistic in order to decrease the dimension of this statistic, ideally down to the dimension of the corresponding component of the parameter. This successfully bypasses the curse of dimensionality but immediately meets with two difficulties. The first one is that the resulting sequence of conditionals is not coherent, since it is not a Gibbs sampler on the ABC target. The conditionals are thus incompatible and therefore convergence of the associated Markov chain becomes an issue. We produce sufficient conditions for the Gibbs sampler to converge to a stationary distribution using incompatible conditionals. The second problem is then that, provided it exists, the limiting and also intractable distribution does not enjoy a Bayesian interpretation, hence may fail to be justified from an inferential viewpoint. We however succeed in producing a version of ABC-Gibbs in a hierarchical model where the limiting distribution can be explicited and even better can be weighted towards recovering the original target. (At least with limiting zero tolerance.)