**M**y colleague from the Université d’Orléans, Didier Chauveau, has just published on CRAN a new R package called EntropyMCMC, which contains convergence assessment tools for MCMC algorithms, based on non-parametric estimates of the Kullback-Leibler divergence between current distribution and target. (A while ago, quite a while ago!, we actually collaborated with a few others on the Springer-Verlag Lecture Note #135 Discretization and MCMC convergence assessments.) This follows from a series of papers by Didier Chauveau and Pierre Vandekerkhove that started with a nearest neighbour entropy estimate. The evaluation of this entropy is based on N iid (parallel) chains, which involves a parallel implementation. While the missing normalising constant is overwhelmingly unknown, the authors this is not a major issue “since we are mostly interested in the stabilization” of the entropy distance. Or in the comparison of two MCMC algorithms. *[Disclaimer: I have not experimented with the package so far, hence cannot vouch for its performances over large dimensions or problematic targets, but would as usual welcome comments and feedback on readers’ experiences.]*

## Archive for MCMC

## EntropyMCMC [R package]

Posted in Statistics with tags convergence assessment, CRAN, discretization, entropy, EntropyMCMC, Lecture Notes in Statistics, MCMC, MCMC convergence, Monte Carlo Statistical Methods, R package, Springer-Verlag, Université d'Orléans, untractable normalizing constant on March 26, 2019 by xi'an## a new rule for adaptive importance sampling

Posted in Books, Statistics with tags adaptive importance sampling, AMIS, empirical likelihood, Helsinki, MCMC, Monte Carlo integration, Monte Carlo Statistical Methods, multiple importance methods, pseudo-random generators, University of Warwick on March 5, 2019 by xi'an**A**rt Owen and Yi Zhou have arXived a short paper on the combination of importance sampling estimators. Which connects somehow with the talk about multiple estimators I gave at ESM last year in Helsinki. And our earlier AMIS combination. The paper however makes two important assumptions to reach optimal weighting, which is inversely proportional to the variance:

- the estimators are uncorrelated if dependent;
- the variance of the k-th estimator is of order a (negative) power of k.

The later is puzzling when considering a series of estimators, in that k appears to act as a sample size (as in AMIS), the power is usually unknown but also there is no reason for the power to be the same for all estimators. The authors propose to use ½ as the default, both because this is the standard Monte Carlo rate and because the loss in variance is then minimal, being 12% larger.

As an aside, Art Owen also wrote an invited discussion “the unreasonable effectiveness of Monte Carlo” of ” Probabilistic Integration: A Role in Statistical Computation?” by François-Xavier Briol, Chris Oates, Mark Girolami (Warwick), Michael Osborne and Deni Sejdinovic, to appear in Statistical Science, discussion that contains a wealth of smart and enlightening remarks. Like the analogy between pseudo-random number generators [which work unreasonably well!] vs true random numbers and Bayesian numerical integration versus non-random functions. Or the role of advanced bootstrapping when assessing the variability of Monte Carlo estimates (citing a paper of his from 1992). Also pointing out at an intriguing MCMC paper by Michael Lavine and Jim Hodges to appear in The American Statistician.

## automatic adaptation of MCMC algorithms

Posted in pictures, Statistics with tags adaptive MCMC methods, arXiv, asynchronous algorithms, calibration, convergence of Gibbs samplers, Gibbs sampling, MCMC, parallelisation on March 4, 2019 by xi'an

“A typical adaptive MCMC sampler will approximately optimize performance given the kind of sampler chosen in the first place, but it will not optimize among the variety of samplers that could have been chosen.”

**L**ast February (2018), Dao Nguyen and five co-authors arXived a paper that I missed. On a new version of adaptive MCMC that aims at selecting a wider range of proposal kernels. Still requiring a by-hand selection of this collection of kernels… Among the points addressed, beyond the theoretical guarantees that the adaptive scheme does not jeopardize convergence to the proper target, are a meta-exploration of the set of combinations of samplers and integration of the computational speed in the assessment of each sampler. Including the very difficulty of assessing mixing. One could deem the index of the proposal as an extra (cyber-)parameter to its generic parameter (like the scale in the random walk), but the discreteness of this index makes the extension more delicate than expected. And justifies the distinction between internal and external parameters. The notion of a worst-mixing dimension is quite appealing and connects with the long-term hope that one could spend the maximum fraction of the sampler runtime over the directions that are poorly mixing, while still keeping the target as should be. The adaptive scheme is illustrated on several realistic models with rather convincing gains in efficiency and time.

The convergence tools are inspired from Roberts and Rosenthal (2007), with an assumption of uniform ergodicity over all kernels considered therein which is both strong and delicate to assess in practical settings. Efficiency is rather unfortunately defined in terms of effective sample size, which is a measure of correlation or lack thereof, but which does not relate to the speed of escape from the basin of attraction of the starting point. I also wonder at the pertinence of the estimation of the effective sample size when the chain is based on different successive kernels, since the lack of correlation may be due to another kernel. Another calibration issue is the internal clock that relates to the average number of iterations required to tune properly a specific kernel, which again may be difficult to assess in a realistic situation. A last query is whether or not this scheme could be compared with an asynchronous (and valid) MCMC approach that exploits parallel capacities of the computer.

## call for sessions and labs at Bay2sC0mp²⁰

Posted in pictures, R, Statistics, Travel, University life with tags BayesComp, call for contributions, Florida, Gainesville, ISBA, last call, machine learning, MCMC, MCMSki, Monte Carlo methods, poster session, R, software, STAN, statistical language, University of Florida, untractable normalizing constant, USA on February 22, 2019 by xi'an**A** call to all potential participants to the incoming BayesComp 2020 conference at the University of Florida in Gainesville, Florida, 7-10 January 2020, to submit proposals [to me] for contributed sessions on everything computational *or* training labs [to David Rossell] on a specific language or software. The deadline is **April 1** and the sessions will be selected by the scientific committee, other proposals being offered the possibility to present the associated research during a poster session [which always is a lively component of the conference]. (Conversely, we reserve the possibility of a “last call” session made from particularly exciting posters on new topics.) Plenary speakers for this conference are

- David Blei (Columbia University)
- Paul Fearnhead (University of Lancaster)
- Emily Fox (University of Washington)
- Max Welling (University of Amsterdam)

and the first invited sessions are already posted on the webpage of the conference. We dearly hope to attract a wide area of research interests into a as diverse as possible program, so please accept this invitation!!!

## Computational Bayesian Statistics [book review]

Posted in Books, Statistics with tags ABC, Bayes factor, Bayesian model selection, Bayesian p-values, Bayesian paradigm, Bayesian textbook, BayesX, book review, Cambridge University Press, coda, computational Bayesian methods, cup, Gibbs sampling, information criterion, INLA, JAGS, Jeffreys prior, Kalman filter, Laplace approximation, Likelihood Principle, MCMC, Metropolis-Hastings algorithm, model assessment, Monte Carlo Statistical Methods, OpenBUGS, R, sequential Monte Carlo, STAN, subjective versus objective Bayes on February 1, 2019 by xi'an**T**his Cambridge University Press book by M. Antónia Amaral Turkman, Carlos Daniel Paulino, and Peter Müller is an enlarged translation of a set of lecture notes in Portuguese. *(Warning: I have known Peter Müller from his PhD years in Purdue University and cannot pretend to perfect objectivity. For one thing, Peter once brought me frozen-solid beer: revenge can also be served cold!)* Which reminds me of my 1994 French edition of Méthodes de Monte Carlo par chaînes de Markov, considerably upgraded into Monte Carlo Statistical Methods (1998) thanks to the input of George Casella. *(Re-warning: As an author of books on the same topic(s), I can even less pretend to objectivity.)*

“The “great idea” behind the development of computational Bayesian statistics is the recognition that Bayesian inference can be implemented by way of simulation from the posterior distribution.”

The book is written from a strong, almost militant, subjective Bayesian perspective (as, e.g., when *half-Bayesians* are mentioned!). Subjective (and militant) as in Dennis Lindley‘s writings, eminently quoted therein. As well as in Tony O’Hagan‘s. Arguing that the sole notion of a Bayesian estimator is the entire posterior distribution. Unless one brings in a loss function. The book also discusses the Bayes factor in a critical manner, which is fine from my perspective. (Although the ban on improper priors makes its appearance in a very indirect way at the end of the last exercise of the first chapter.)

Somewhat at odds with the subjectivist stance of the previous chapter, the chapter on prior construction only considers non-informative and conjugate priors. Which, while understandable in an introductory book, is a wee bit disappointing. (When mentioning Jeffreys’ prior in multidimensional settings, the authors allude to using univariate Jeffreys’ rules for the marginal prior distributions, which is not a well-defined concept or else Bernardo’s and Berger’s reference priors would not have been considered.) The chapter also mentions the likelihood principle at the end of the last exercise, without a mention of the debate about its derivation by Birnbaum. Or Deborah Mayo’s recent reassessment of the strong likelihood principle. The following chapter is a sequence of illustrations in classical exponential family models, classical in that it is found in many Bayesian textbooks. (Except for the Poison model found in Exercise 3.3!)

Nothing to complain (!) about the introduction of Monte Carlo methods in the next chapter, especially about the notion of inference by Monte Carlo methods. And the illustration by Bayesian design. The chapter also introduces Rao-Blackwellisation [prior to introducing Gibbs sampling!]. And the simplest form of bridge sampling. (Resuscitating the weighted bootstrap of Gelfand and Smith (1990) may not be particularly urgent for an introduction to the topic.) There is furthermore a section on sequential Monte Carlo, including the Kalman filter and particle filters, in the spirit of Pitt and Shephard (1999). This chapter is thus rather ambitious in the amount of material covered with a mere 25 pages. Consensus Monte Carlo is even mentioned in the exercise section.

“This and other aspects that could be criticized should not prevent one from using this [Bayes factor] method in some contexts, with due caution.”

Chapter 5 turns back to inference with model assessment. Using Bayesian p-values for model assessment. (With an harmonic mean spotted in Example 5.1!, with no warning about the risks, except later in 5.3.2.) And model comparison. Presenting the whole collection of xIC information criteria. from AIC to WAIC, including a criticism of DIC. The chapter feels somewhat inconclusive but methinks this is the right feeling on the current state of the methodology for running inference about the model itself.

“Hint: There is a very easy answer.”

Chapter 6 is also a mostly standard introduction to Metropolis-Hastings algorithms and the Gibbs sampler. (The argument given later of a Metropolis-Hastings algorithm with acceptance probability one does not work.) The Gibbs section also mentions demarginalization as a [latent or auxiliary variable] way to simulate from complex distributions [as we do], but without defining the notion. It also references the precursor paper of Tanner and Wong (1987). The chapter further covers slice sampling and Hamiltonian Monte Carlo, the later with sufficient details to lead to reproducible implementations. Followed by another standard section on convergence assessment, returning to the 1990’s feud of single versus multiple chain(s). The exercise section gets much larger than in earlier chapters with several pages dedicated to most problems. Including one on ABC, maybe not very helpful in this context!

“…dimension padding (…) is essentially all that is to be said about the reversible jump. The rest are details.”

The next chapter is (somewhat logically) the follow-up for trans-dimensional problems and marginal likelihood approximations. Including Chib’s (1995) method [with no warning about potential biases], the spike & slab approach of George and McCulloch (1993) that I remember reading in a café at the University of Wyoming!, the somewhat antiquated MC³ of Madigan and York (1995). And then the much more recent array of Bayesian lasso techniques. The trans-dimensional issues are covered by the pseudo-priors of Carlin and Chib (1995) and the reversible jump MCMC approach of Green (1995), the later being much more widely employed in the literature, albeit difficult to tune [and even to comprehensively describe, as shown by the algorithmic representation in the book] and only recommended for a large number of models under comparison. Once again the exercise section is most detailed, with recent entries like the EM-like variable selection algorithm of Ročková and George (2014).

The book also includes a chapter on analytical approximations, which is also the case in ours [with George Casella] despite my reluctance to bring them next to exact (simulation) methods. The central object is the INLA methodology of Rue et al. (2009) [absent from our book for obvious calendar reasons, although Laplace and saddlepoint approximations are found there as well]. With a reasonable amount of details, although stopping short of implementable reproducibility. Variational Bayes also makes an appearance, mostly following the very recent Blei et al. (2017).

The gem and originality of the book are primarily to be found in the final and ninth chapter where four software are described, all with interfaces to R: OpenBUGS, JAGS, BayesX, and Stan, plus R-INLA which is processed in the second half of the chapter (because this is *not* a simulation method). As in the remainder of the book, the illustrations are related to medical applications. Worth mentioning is the reminder that BUGS came in parallel with Gelfand and Smith (1990) Gibbs sampler rather than as a consequence. Even though the formalisation of the Markov chain Monte Carlo principle by the later helped in boosting the power of this software. (I also appreciated the mention made of Sylvia Richardson’s role in this story.) Since every software is illustrated in depth with relevant code and output, and even with the shortest possible description of its principle and *modus vivendi*, the chapter is 60 pages long [and missing a comparative conclusion]. Given my total ignorance of the very existence of the BayesX software, I am wondering at the relevance of its inclusion in this description rather than, say, other general R packages developed by authors of books such as Peter Rossi. The chapter also includes a description of CODA, with an R version developed by Martin Plummer [now a Warwick colleague].

In conclusion, this is a high-quality and all-inclusive introduction to Bayesian statistics and its computational aspects. By comparison, I find it much more ambitious and informative than Albert’s. If somehow less pedagogical than the thicker book of Richard McElreath. (The repeated references to Paulino et al. (2018) in the text do not strike me as particularly useful given that this other book is written in Portuguese. Unless an English translation is in preparation.)

*Disclaimer: this book was sent to me by CUP for endorsement and here is what I wrote in reply for a back-cover entry:*

An introduction to computational Bayesian statistics cooked to perfection, with the right mix of ingredients, from the spirited defense of the Bayesian approach, to the description of the tools of the Bayesian trade, to a definitely broad and very much up-to-date presentation of Monte Carlo and Laplace approximation methods, to an helpful description of the most common software. And spiced up with critical perspectives on some common practices and an healthy focus on model assessment and model selection. Highly recommended on the menu of Bayesian textbooks!

And this review is likely to appear in CHANCE, in my book reviews column.

## revisiting the Gelman-Rubin diagnostic

Posted in Books, pictures, Statistics, Travel, University life with tags ABCruise, asymptotic variance, convergence diagnostics, effective sample size, Gelman-Rubin statistic, Gulf of Bothnia, independence, MCMC, MCMC convergence, Monte Carlo Statistical Methods, stopping rule, subsampling, sunset, Titanic on January 23, 2019 by xi'an**J**ust before Xmas, Dootika Vats (Warwick) and Christina Knudson arXived a paper on a re-evaluation of the ultra-popular 1992 Gelman and Rubin MCMC convergence diagnostic. Which compares within-variance and between-variance on parallel chains started from hopefully dispersed initial values. Or equivalently an under-estimating and an over-estimating estimate of the MCMC average. In this paper, the authors take advantage of the variance estimators developed by Galin Jones, James Flegal, Dootika Vats and co-authors, which are batch mean estimators consistently estimating the asymptotic variance. They also discuss the choice of a cut-off on the ratio R of variance estimates, i.e., how close to one need it be? By relating R to the effective sample size (for which we also have reservations), which gives another way of calibrating the cut-off. The main conclusion of the study is that the recommended 1.1 bound is too large for a reasonable proximity to the true value of the Bayes estimator *(Disclaimer: The above ABCruise header is unrelated with the paper, apart from its use of the Titanic dataset!)
*

In fact, I have other difficulties than setting the cut-off point with the original scheme as a way to assess MCMC convergence or lack thereof, among which

- its dependence on the parameterisation of the chain and on the estimation of a specific target function
- its dependence on the starting distribution which makes the time to convergence not absolutely meaningful
- the confusion between getting to stationarity and exploring the whole target
- its missing the option to resort to subsampling schemes to attain pseudo-independence or scale time to convergence (albeit see 3. above)
- a potential bias brought by the stopping rule.