**J**ust read the news that my friend [and co-author] Arnaud Doucet (Oxford) is the winner of the 2020 Guy Silver Medal award from the Royal Statistical Society. I was also please to learn about David Spiegelhalter‘s Guy Gold medal (I first met David at the fourth Valencia Bayesian meeting in 1991, where he had a poster on the very early stages of BUGS) and Byron Morgan‘s Barnett Award for his indeed remarkable work on statistical ecology and in particular Bayesian capture recapture models. Congrats to all six recipients!

## Archive for BUGS

## Hastings 50 years later

Posted in Books, pictures, Statistics, University life with tags 1066, asynchronous algorithms, automation, Battle of Hastings, Bayesian statistics, BUGS, history of statistics, incompatible conditionals, Metropolis-Hastings algorithms, Normans, pseudo-marginal MCMC, STAN, Wilfred Keith Hastings on January 9, 2020 by xi'an**W**hat is the exact impact of the Metropolis-Hastings algorithm on the field of Bayesian statistics? and what are the new tools of the trade? What I personally find the most relevant and attractive element in a review on the topic is the current role of this algorithm, rather than its past (his)story, since many such reviews have already appeared and will likely continue to appear. What matters most imho is how much the Metropolis-Hastings algorithm signifies for the community at large, especially beyond academia. Is the availability or unavailability of software like BUGS or Stan a help or an hindrance? Was Hastings’ paper the start of the era of approximate inference or the end of exact inference? Are the algorithm intrinsic features like Markovianity a fundamental cause for an eventual extinction because of the ensuing time constraint and the lack of practical guarantees of convergence and the illusion of a fully automated version? Or are emerging solutions like unbiased MCMC and asynchronous algorithms a beacon of hope?

In their Biometrika paper, Dunson and Johndrow (2019) recently wrote a celebration of Hastings’ 1970 paper in Biometrika, where they cover adaptive Metropolis (Haario et al., 1999; Roberts and Rosenthal, 2005), the importance of gradient based versions toward universal algorithms (Roberts and Tweedie, 1995; Neal, 2003), discussing the advantages of HMC over Langevin versions. They also recall the significant step represented by Peter Green’s (1995) reversible jump algorithm for multimodal and multidimensional targets, as well as tempering (Miasojedow et al., 2013; Woodard et al., 2009). They further cover intractable likelihood cases within MCMC (rather than ABC), with the use of auxiliary variables (Friel and Pettitt, 2008; Møller et al., 2006) and pseudo-marginal MCMC (Andrieu and Roberts, 2009; Andrieu and Vihola, 2016). They naturally insist upon the need to handle huge datasets, high-dimension parameter spaces, and other scalability issues, with links to unadjusted Langevin schemes (Bardenet et al., 2014; Durmus and Moulines, 2017; Welling and Teh, 2011). Similarly, Dunson and Johndrow (2019) discuss recent developments towards parallel MCMC and non-reversible schemes such as PDMP as highly promising, with a concluding section on the challenges of automatising and robustifying much further the said procedures, if only to reach a wider range of applications. The paper is well-written and contains a wealth of directions and reflections, including those in my above introduction. Here are some mostly disconnected directions I would have liked to see covered or more covered

- convergence assessment today, e.g. the comparison of various approximation schemes
- Rao-Blackwellisation and other post-processing improvements
- other approximate inference tools than the pseudo-marginal MCMC
- importance of the parameterisation of the problem for convergence
- dimension issues and connection with quasi-Monte Carlo
- constrained spaces of measure zero, as for instance matrix distributions imposing zeros outside a diagonal band
- given the rise of the machine(-learners), are exploratory and intrinsically slow algorithms like MCMC doomed or can both fields feed one another? The section on optimisation could be expanded in that direction
- the wasteful nature of the random walk feature of MCMC algorithms, as opposed to non-reversible kernels like HMC and other PDMPs, missing from the gradient based methods section (and can we once again learn from physicists?)
- finer convergence issues and hence inference difficulties with complex MCMC algorithms like Gibbs samplers with incompatible conditionals
- use of the Hastings ratio in other algorithms like ABC or EP (in link with the section on generalised Bayes)
- adapting Metropolis-Hastings methods for emerging computing tools like GPUs and quantum computers

or possibly less covered, namely data augmentation put forward when it is a special case of auxiliary variables as in slice sampling and in earlier physics literature. For instance, both probit and logistic regressions do not truly require data augmentation and are more toy examples than really challenging applications. The approach of Carlin & Chib (1995) is another illustration, which has met with recent interest, despite requiring heavy calibration (just like RJMCMC). As well as a a somewhat awkward opposition between Gibbs and Hastings, in that I am not convinced that Gibbs does not remain ultimately necessary to handle high dimension problems, in the sense that the alternative solutions like Langevin, HMC, or PDMP, or…, are relying on Euclidean assumptions for the entire vector, while a direct product of Euclidean structures may prove more adequate.

## improperties on an astronomical scale

Posted in Books, pictures, Statistics with tags astronomy, astrostatistics, Bayesian inference, BUGS, improper posteriors, impropriety, noninformative priors, vague priors on December 15, 2017 by xi'an**A**s pointed out by Peter Coles on his blog, In the Dark, Hyungsuk Tak, Sujit Ghosh, and Justin Ellis just arXived a review of the unsafe use of improper priors in astronomy papers, 24 out of 75 having failed to establish that the corresponding posteriors are well-defined. And they exhibit such an instance (of impropriety) in a MNRAS paper by Pihajoki (2017), which is a complexification of Gelfand et al. (1990), also used by Jim Hobert in his thesis. (Even though the formal argument used to show the impropriety of the posterior in Pihajoki’s paper does not sound right since it considers divergence at a single value of a parameter β.) Besides repeating this warning about an issue that was rather quickly identified in the infancy of MCMC, if not in the very first publications on the Gibbs sampler, the paper seems to argue against using improper priors due to this potential danger, stating that instead proper priors that include all likely values and beyond are to be preferred. Which reminds me of the BUGS feature of using a N(0,10⁹) prior instead of the flat prior, missing the fact that “very large” variances do impact the resulting inference (if only for the issue of model comparison, remember Lindley-Jeffreys!). And are informative in that sense. However, it is obviously a good idea to advise checking for propriety (!) and using such alternatives may come as a safety button, providing a comparison benchmark to spot possible divergences in the resulting inference.

## adaptive rejection sampling with fixed number of nodes

Posted in Books, Statistics, University life with tags accept-reject algorithm, ARS, BUGS, Gibbs sampler, Monte Carlo Statistical Methods, Wally Gilks on October 8, 2015 by xi'an**T**his paper was arXived today by Martino and Louzada. It starts from the adaptive rejection sampling algorithm of Gilks and Wild (1993), which provided an almost automatic random generator for univariate log-concave target probability densities. By creating an envelope of the true target based on convexity. This algorithm was at the core of BUGS in its early days. And makes a good example of accept reject algorithm that I often use in class. It is however not so popular nowadays because of the unidimensional constraint. And because it is not particularly adequate for simulating a *single* realisation from a given target. As is the case when used in a Gibbs sampler. The authors only consider here the issue of the growing cost of running the adaptive rejection sampling algorithm, which is due to the update of the envelope at each rejection. They however fail to quantify this increase, which involves (a) finding the interval where the current rejection lies, among n existing intervals, which is of order O(n), and (b) creating both modified envelopes on both new intervals, which is of order O(1). The proposal of the authors, called cheap adaptive rejection sampling, settles for a fixed number τ of support points (and intervals), solely swapping a rejected point with the closest support point when this improves the acceptance rate. The test for swapping involves first computing the alternative target (on a pair of intervals) and the corresponding normalising constant, while swapping the rejected point with the closest support point involves finding the closest point, which is of order O(log τ). I am rather surprised that the authors do not even mention the execution time orders, resorting instead to a simulation where the gain brought by their proposal is not overwhelming. There is also an additional cost for figuring out the fixed number τ of support points, another issue not mentioned in the paper.

## reis naar Amsterdam

Posted in Books, Kids, pictures, Running, Statistics, Travel, University life, Wines with tags Amsterdam, Bayesian statistics, BUGS, canals, Holland, Ising model, JASP, Marc Kac, minimal description length principle, normalising constant, psychology, R, STAN, UvA on April 16, 2015 by xi'an**O**n Monday, I went to Amsterdam to give a seminar at the University of Amsterdam, in the department of psychology. And to visit Eric-Jan Wagenmakers and his group there. And I had a fantastic time! I talked about our mixture proposal for Bayesian testing and model choice without getting hostile or adverse reactions from the audience, quite the opposite as we later discussed this new notion for several hours in the café across the street. I also had the opportunity to meet with Peter Grünwald [who authored a book on the minimum description length principle] pointed out a minor inconsistency of the common parameter approach, namely that the Jeffreys prior on the first model did not have to coincide with the Jeffreys prior on the second model. (The Jeffreys prior for the mixture being unavailable.) He also wondered about a more conservative property of the approach, compared with the Bayes factor, in the sense that the non-null parameter could get closer to the null-parameter while still being identifiable.

Among the many persons I met in the department, Maarten Marsman talked to me about his thesis research, Plausible values in statistical inference, which involved handling the Ising model [a non-sparse Ising model with O(p²) parameters] by an auxiliary representation due to Marc Kac and getting rid of the normalising (partition) constant by the way. (Warning, some approximations involved!) And who showed me a simple probit example of the Gibbs sampler getting stuck as the sample size n grows. Simply because the uniform conditional distribution on the parameter concentrates faster (in 1/n) than the posterior (in 1/√n). This does not come as a complete surprise as data augmentation operates in an n-dimensional space. Hence it requires more time to get around. As a side remark [still worth printing!], Maarten dedicated his thesis as *“To my favourite random variables , Siem en Fem, and to my normalizing constant, Esther”*, from which I hope you can spot the influence of at least two of my book dedications! As I left Amsterdam on Tuesday, I had time for a enjoyable dinner with E-J’s group, an equally enjoyable early morning run [with perfect skies for sunrise pictures!], and more discussions in the department. Including a presentation of the new (delicious?!) Bayesian software developed there, JASP, which aims at non-specialists [i.e., researchers unable to code in R, BUGS, or, God forbid!, STAN] And about the consequences of mixture testing in some psychological experiments. Once again, a fantastic time discussing Bayesian statistics and their applications, with a group of dedicated and enthusiastic Bayesians!