the BUGS Book [guest post]
(My colleague Jean-Louis Fouley, now at I3M, Montpellier, kindly agreed to write a review on the BUGS book for CHANCE. Here is the review, en avant-première! Watch out, it is fairly long and exhaustive! References will be available in the published version. The additions of book covers with BUGS in the title and of the corresponding Amazon links are mine!)
If a book has ever been so much desired in the world of statistics, it is for sure this one. Many people have been expecting it for more than 20 years ever since the WinBUGS software has been in use. Therefore, the tens of thousands of users of WinBUGS are indebted to the leading team of the BUGS project (D Lunn, C Jackson, N Best, A Thomas and D Spiegelhalter) for having eventually succeeded in finalizing the writing of this book and for making sure that the long-held expectations are not dashed.
As well explained in the Preface, the BUGS project initiated at Cambridge was a very ambitious one and at the forefront of the MCMC movement that revolutionized the development of Bayesian statistics in the early 90’s after the pioneering publication of Gelfand and Smith on Gibbs sampling.
This book comes out after several textbooks have already been published in the area of computational Bayesian statistics using BUGS and/or R (Gelman and Hill, 2007; Marin and Robert, 2007; Ntzoufras, 2009; Congdon, 2003, 2005, 2006, 2010; Kéry, 2010; Kéry and Schaub, 2011 and others). It is neither a theoretical book on foundations of Bayesian statistics (e.g. Bernardo and Smith, 1994; Robert, 2001) nor an academic textbook on Bayesian inference (Gelman et al, 2004, Carlin and Louis, 2008). Instead, it reflects very well the aims and spirit of the BUGS project and is meant to be a manual “for anyone who would like to apply Bayesian methods to real-world problems”.
In spite of its appearance, the book is not elementary. On the contrary, it addresses most of the critical issues faced by statisticians who want to apply Bayesian statistics in a clever and autonomous manner. Although very dense, its typical fluid British style of exposition based on real examples and simple arguments helps the reader to digest without too much pain such ingredients as regression and hierarchical models, model checking and comparison and all kinds of more sophisticated modelling approaches (spatial, mixture, time series, non linear with differential equations, non parametric, etc…).
The book consists of twelve chapters and three appendices specifically devoted to BUGS (A: syntax; B: functions and C: distributions) which are very helpful for practitioners. The book is illustrated with numerous examples. The exercises are well presented and explained, and the corresponding code is made available on a web site.
Chapter 1 (Introduction: Probability and Parameters) reminds us of the basics of probability theory. It emphasizes the originality of Bayesian statistics by considering probability (density) distribution to express uncertainty about our knowledge of parameters. Beginning with an example on mortality rates for high risk operations, the authors point out the advantages of the Bayesian approach to answer practical questions as compared to the classical approach based on estimates, confidence intervals and hypothesis testing. In that respect, the old dilemma of classical statistics “fixed vs. random” does not arise the same way here as most parameters are basically fixed but also unknown and consequently treated as random variables. I like the last two sections of this chapter devoted to the question of how to calculate quantities pertaining to probability distributions from different options (exact algebraic, exact numeric, physical experimentation and Monte Carlo simulation) which are usually overlooked by many of us.
Chapter 2 shows us how to carry out Monte Carlo simulations using BUGS for general purposes. I am not sure that many people know that BUGS can be used as a pure simulator of stochastic phenomena as well as for posterior inference from data. The chapter starts with an introduction to the BUGS language underlining the main characteristics of this graphical approach (directed acyclic graphs, DAG, with conditional independence) in contrast to a traditional sequential language. Here lies an essential aspect of BUGS explaining its generality and power but also its importance for the user, as a grasp of the DAG structure is essential to understand the syntax of the language (order statements in the code do not matter except for loop constraints) and to help him write his own programs efficiently.
Chapter 3 sets up the framework for Bayesian inference by introducing Bayes’ theorem first in terms of events and secondly in terms of densities and then illustrates it with simple but well chosen examples on binomial and normal data with conjugate priors. The chapter ends with a comparison of Bayesian and classical approaches. A clear distinction is made between the Fisherian approach based on the concept of likelihood and P-values pertaining solely to the null hypothesis and the Neyman-Pearson philosophy derived from decision theory with error rating in the test of the null hypothesis vs. an alternative. Oddly enough for the layman, the authors show undeniably that the Bayes approach has much more in common with the approach of Fisher than with the approach of Neyman-Pearson although both statistical schools are usually lumped together as frequentist.
Chapter 4 teaches us how the BUGS simulation machine is built and how it works. The engine is designed according to one of the most popular MCMC methods, namely Gibbs sampling relying on specification of all the conditional distributions of the stochastic nodes. This choice is well explained and justified on account of its generality and applicability to a wide range of situations and its special relevance to directed acyclic graphs (DAG). It is well known that in a DAG, the full conditional probability distribution of any unobserved node depends only on information contained in the three sets of neighbouring nodes (its parents, its children and the other parents of its children) called the Markov blanket. All this is processed automatically by the BUGS software via an expert system that is able to determine whether the conditional distribution belongs to an available closed form or not. In the latter case, a surrogate sampler (e.g. Metropolis random walk) for that distribution is proposed. The last four sections discuss very important practical issues for the user such as specification of initial values, detection and testing of convergence, assessment of accuracy of posterior means and selection of length and number of chains. In short, an excellent chapter for any bugser!
In Chapter 5, the authors tackle the crucial issue of prior specification without hiding the inherent difficulties and implications. They distinguish the case of “non-informative” from the case of “informative” priors. In the former option, they focus on uniform and Jeffrey’s priors by emphasizing the difficulty presented by improper priors (dflat(.) in WinBUGS) and proper priors with arbitrary large supports e.g. dnorm(0,0.0001) and dunif(-100,+100). They successively review location, proportion, count and scale parameters. The tutorial and the example (5.2.1) on proportions are especially interesting but at the same time may be quite challenging for the beginner. In order to explain better the relationships between priors set up on different scales, it would have been helpful to show why the uniform prior of the proportion corresponds on the logit scale to the logistic distribution because of the inverse CDF transformation. In the same way, I would have liked the authors to show us on Figure 5.1 the equivalence between Jeffreys’ prior on θ i.e. a Beta(0.5,0.5) with its uniformly distributed arcsin √θ transform, as an illustration of a more general property of Jeffrey’s priors namely a primitive of √J(θ) leading to a uniform metric. The paragraph on scale parameters was reduced to the essential minimum with a further discussion delayed to Chapter 10 on hierarchical models. However, caution is mandatory with respect to the use of the degenerate Gamma(ε,ε) as a surrogate of Jeffreys’ prior on the standard deviation since ε should not set be automatically, say to a value of 0.001, but needs some calibration from the residual sum of squares ∑ (yᵢ-μ)².
The section on the representation of informative priors is well documented and well presented. Regarding elicitation of subjective priors, the interpretation of such priors for conjugate forms in terms of “implicit data” via a prior estimate and an effective prior sample size turns out to be very helpful to convey such information. It greatly facilitates discussion between statisticians and experts. The section on analysis of sensitivity to prior specification concludes the chapter with two very convincing examples from a clinical trial and from the study of the number of urban trams. In the latter example, the comparison of a discrete uniform prior on this number N vs. Jeffreys’ prior proportional to the reciprocal of N clearly highlights how dangerous some choices of priors are and how robust results are vis-à-vis some assumptions (here the value of the upper bound of N). Incidentally, if the ML estimator of N (100) makes no sense, it can be noted that the moment based estimation (2N̂-1) i.e. 199 is quite close to the posterior median estimators (197 and 200) based on Jeffreys’ prior.
Chapter 6 provides a treatment overview of regression including linear, nonlinear, and generalized linear models. The approach is standard (restricted to fixed-type models), but the authors show very well via convincing examples how Bayesian modeling can easily handle such issues as non normal errors, outliers, parameter constraints, functions of parameters and predictions. They also briefly introduce regression for multivariate responses via an example (6.4.1) on longitudinal data involving a residual variance-covariance matrix among responses at several measurement times. In this scenario, as honestly admitted by the authors, BUGS offers very little choice but the (inverse) Wishart distribution, the limitations of which must not be overlooked. The last section on further reading is especially welcome as it refers to textbooks dealing with theoretical and practical points not covered in this chapter, e.g. variable selection.
After dealing with simple examples with binomial and count data, Chapter 7 goes further into the analysis of categorical data. It starts with 2×2 contingency tables under different situations with fixed margins (none, one and both) and illustrates them with the famous, but still controversial experiment of Fisher of the tea-tasting lady (see a recent account of the story by Stephen Senn, 2012). I must confess that I still do not feel at ease with this example due to the multiplicity of possible prior specifications and the difficult issue of conditioning vs. ancillarity which do not fall into Bayesian and frequentist inferences the same way. Multinomial models (nominal and ordinal) without or with covariate adjustment are then tackled and the equivalence between the logistic multinomial and Poisson models is mentioned. A genetic example is used to illustrate how such concepts as composite link functions arising in GLMs (Thompson and Baker, 1981) can be solved easily with BUGS. The domain of categorical data is so huge that it would easily deserve a full volume and the list of references for further reading (e.g. Congdon, 2005) is appreciated.
Next, the reader faces a very long chapter (#8, 45 pages) on crucial but also difficult and sometimes controversial issues pertaining to model checking and comparison. The distinction between these two matters, although often ignored or overlooked by many of us, is clearly made here. Techniques based on residuals and predictive checks of the model at hand are successively considered, in the latter case, with the concern of checking the model on data not used to fit it. A key aspect illustrated by some examples (e.g. 8.4.6 on assurance claims) turns out to be the choice of the discrepancy function for an appropriate focus on potential deficiencies of the model as well. As expected, in model comparison, the main emphasis is on the Deviance Information Criterion (DIC) which is the basic criterion proposed by the BUGS software to carry out such comparisons. The authors have to be commended for disclosing both positive and negative aspects of this criterion and alternatives. Non-invariance to reparameterisation, inadequacy of inclusion of categorical parameters (e.g. in mixtures) and problems with marginal vs. conditional model specification are very serious limitations that users can no longer ignore. Several alternatives to DIC are presented with respect to either the measure of penalty due to “optimism” in using data twice (Gelman, 2004; Plummer, 2008) or to other general criteria such as the pseudo marginal likelihood, the Bayes factor (BF) and its BIC approximation. I was glad to see a section on the Lindley paradox showing opposite conclusions drawn by Bayesians and frequentists in testing sharp null hypotheses. But I would have expected more comments on why this conflict is only “apparent” and it makes many Bayesians so uncomfortable with it (Gelman and Shalizi, 2013). It would have been helpful to remind us that setting the significance level for rejecting H0 at a constant value (e.g. the famous 5%!) no matter what the sample size N is does not make sense. This has been recognized for a long time by many authors who denoted “the diminishing significance of a fixed p-value when sample size increased” (Zellner, 1987; Good, 1988). I would also have liked to see the connection between the BF and the penalty measure p log N in the BIC which personally intrigued me for a long time till I realized that it has to do with the need for the deviance as a measure of evidence to be adjusted for N. Finally, as shown by the authors, much care ought to be devoted to the choice of priors under the alternative H1 regarding in particular their magnitude of variation (avoiding a value of c too large as shown in § 8.7.1). Anyway, computing Bayes factors remains a challenge full of traps (e.g. the harmonic mean formula) requiring specialized algorithms (path and bridge sampling, nested sampling, power posteriors, Chib’s method,…). The section on Bayesian model uncertainty and averaging also opens interesting perspectives relating to bootstrapping techniques and expanded models that can efficiently supplement the toolbox of practitioners.
In practice, most statisticians are confronted with real data sets requiring non- standard statistical techniques. Chapter 9 illustrates how Bayesian statistics and BUGS can easily address such issues. I especially liked the sections on missing covariates, measurement errors (both classical and Berkson-type) and censored vs. truncated data, the difference between these two types of data being not obvious as well as the way to handle each of them in BUGS. For sure, this chapter will make many of us happy thanks to all the examples which help a lot to figure out exactly what to do using a battery of specialized functions and several ingenious devices such as cut(.), C(.), I(.), rank(.), pick, zeros and ones tricks.
Hierarchical models occupy a special niche in Bayesian statistics. They take advantage of the hierarchical structure of many populations organized as clusters. They are also well suited to take into account uncertainty in the parameters of priors by specifying additional priors on them at possibly several nested levels. Therefore, including a special chapter (Chapter 10) on them was an apposite initiative even though many tools deployed here have already been presented. I especially appreciated the complements given here on how to specify priors for variance components e.g. by using Gelman’s half-Cauchy on standard deviation or by modeling them as in regression with explanatory variables via log-links (Foulley et al, 1992). However, many of us will still regret that BUGS does not offer more choice (as an alternative to the inverse Wishart) for priors on variance covariance matrices except for special models. Notice also the short paragraph (10.5) on the value of Bayesian Parameter Expansion (PX) implementation to improve the convergence of MCMC chains. The chapter ends with a thorough discussion and illustration of the difficult issues of checking and comparing such models with pertinent guidelines regarding which criteria to use according to the purposes of validation and comparison.
Chapter 11 is a long chapter (44 pages) devoted to what the authors call “specialized models”. It will delight advanced users of BUGS as it covers no fewer than eight different areas (survival data, times series, spatial statistics, ODE, mixtures, semi parametric regression and Dirichlet processes). It is rather impressive that BUGS can handle such diverse and sometimes sophisticated models although, as honestly pointed out by the authors, the BUGS machinery might not be optimum for running some problems. Some families of models have motivated the development of dedicated interfaces such as GeoBUGS for spatial statistics and WBDiff for ODE models. However, implementing and mastering such models need a very good knowledge of both model theory and “idiosyncrasies” of the language. A good example of this lies in the finite mixture models that require more care than anticipated (influence of identifiability constraints chosen on parameters to avoid label switching; DIC not allowed in model comparison with standard coding in WinBUGS and OpenBUGS).
The final chapter (Chapter #12) is devoted to the BUGS software. This is actually a generic name covering three different “engines”: the original one, WinBUGS, now frozen in its 1.4.3 version but still available and functional; its child, OpenBUGS on which new development is concentrated and the last one, JAGS, developed completely independently from the first two by Martin Plummer. Common and different features of these three versions are reviewed in this chapter concerning the syntax, the data format and the interfaces emphasizing both positive and negative sides. The main differences are between Win-OpenBUGS and JAGS. The latter includes some functions and distributions that allow more flexibility of model specification. OpenBUGS however, offers the largest collection of samplers and will soon contain an executable program for parallel programming. Each engine can be run either interactively or directly via script facilities or using a dedicated R interfaces (R2Winbugs, BRugs and rjags respectively) which opens a lot of possibilities for further manipulations of outputs.
The book ends with three appendices (A, B and C) describing in detail the language, functions and distributions in BUGS and usefully supplement the core of the book.
One may discuss the organization of the book but as always a sequential volume cannot perfectly reflect the complexity of the knowledge network involved in this topic. The approach adopted here has the merit of gradually guiding people to solving more and more complex models. In addition, the book benefits from its homogeneous, crystal clear, and concise style and strikes the right distance between advanced theory and pure practice. I especially like the numerous examples given in the successive chapters which always help readers to figure out what is going on and give them new ideas to improve their BUGS skills.
In conclusion, it turns out that “The BUGS book” is not only a major textbook on a topical subject, but it is also a mandatory one for all statisticians willing to learn and analyze data with Bayesian statistics at any level. It will be the companion and reference book for all users (beginners or advanced) of the BUGS software. I have no doubt it will meet the same success as BUGS and become very soon a classic in the literature of computational Bayesian statistics.