Archive for programming

the BUGS Book [guest post]

Posted in Books, R, Statistics with tags , , , , , , , , , on February 25, 2013 by xi'an

(My colleague Jean-Louis Fouley, now at I3M, Montpellier, kindly agreed to write a review on the BUGS book for CHANCE. Here is the review, en avant-première! Watch out, it is fairly long and exhaustive! References will be available in the published version. The additions of book covers with BUGS in the title and of the corresponding Amazon links are mine!)

If a book has ever been so much desired in the world of statistics, it is for sure this one. Many people have been expecting it for more than 20 years ever since the WinBUGS software has been in use. Therefore, the tens of thousands of users of WinBUGS are indebted to the leading team of the BUGS project (D Lunn, C Jackson, N Best, A Thomas and D Spiegelhalter) for having eventually succeeded in finalizing the writing of this book and for making sure that the long-held expectations are not dashed.

As well explained in the Preface, the BUGS project initiated at Cambridge was a very ambitious one and at the forefront of the MCMC movement that revolutionized the development of Bayesian statistics in the early 90’s after the pioneering publication of Gelfand and Smith on Gibbs sampling.

This book comes out after several textbooks have already been published in the area of computational Bayesian statistics using BUGS and/or R (Gelman and Hill, 2007; Marin and Robert, 2007; Ntzoufras, 2009; Congdon, 2003, 2005, 2006, 2010; Kéry, 2010; Kéry and Schaub, 2011 and others). It is neither a theoretical book on foundations of Bayesian statistics (e.g. Bernardo and Smith, 1994; Robert, 2001) nor an academic textbook on Bayesian inference (Gelman et al, 2004, Carlin and Louis, 2008). Instead, it reflects very well the aims and spirit of the BUGS project and is meant to be a manual “for anyone who would like to apply Bayesian methods to real-world problems”.

In spite of its appearance, the book is not elementary. On the contrary, it addresses most of the critical issues faced by statisticians who want to apply Bayesian statistics in a clever and autonomous manner. Although very dense, its typical fluid British style of exposition based on real examples and simple arguments helps the reader to digest without too much pain such ingredients as regression and hierarchical models, model checking and comparison and all kinds of more sophisticated modelling approaches (spatial, mixture, time series, non linear with differential equations, non parametric, etc…).

The book consists of twelve chapters and three appendices specifically devoted to BUGS (A: syntax; B: functions and C: distributions) which are very helpful for practitioners. The book is illustrated with numerous examples. The exercises are well presented and explained, and the corresponding code is made available on a web site. Continue reading

ISBA 2012 [#3]

Posted in Statistics, Travel, University life with tags , , , , , , on June 29, 2012 by xi'an

A third and again very intense day at ISBA 2012: as Steve Scott said, “we are  getting Bayes-ed out”… It started for me with Robert Kohn’s particle filter session, where Julien Cornebise gave us programming recommendations to improve our code, performances, and overall impact of our research, passionately pleading for an object oriented approach that would make everything we program much more portable. Scott Sisson presented a new approach to density estimation for ABC purposes, using first a marginal estimation for each component of the statistic vector, then a normal mixture copula on the normal transforms of the inverse cdfs, and Robert concluded with a extension of  PMCMC to eliminate nuisance parameters by importance sampling, a topic we will discuss again when I visit Sydney in two weeks. The second session of the morning was ABC II, where David Nott spoke about the combination of ABC with Bayes linear tools, a paper Scott had presented in Banff last Spring, Michael Blum summarised the survey on the selection of summary statistics discussed earlier on the ‘Og, Jean-Michel spoke about our (recently accepted) LDA paper, acknowledging our initial (2005) misgivings about ABC (!), and Olie Ratmann concluded the session with a fairly exciting new notion of using a testing perspective to define acceptable draws. While I clearly enjoyed the amount of “ABC talks” during this meeting, several attendees mentioned to me it was a bit overwhelming… Well, my impression is that this conveyed high and loud the message that ABC is now truly part of the Bayesian toolbox, and that further theoretical exploration would be most welcomed.

The afternoon session saw another session I was involved in organising, along with Marc Suchard, on parallel computing for Bayesian calculations. Marc motivated the use of GPUs for a huge medical dataset, showing impressive gains in time for a MAP calculation, with promises of a more complete Bayesian processing. Steve Scott gave the distributed computing version of the session, with Google requirements for a huge and superfast logistic regression, Jarad Niemi went into the (highly relevant!) details of random processors on GPUs and Kenichiro McAlinn described an application to portfolio selection using GPUs. (The topic attracted a huge crowd and the room was packed!) I am sorry the parallel session on Bayesian success stories was taking place at the same time. As it related very much to our on-going project with Kerrie Mengersen (we are currently waiting for the return from  selected authors). Then it was time for a bit of joint work, along with a succulent macha ice-cream in Kyoto station, and another fairly exhausting if quality poster session.

I am sorry to miss the sessions of Friday (and got “flak” from Arnaud for missing his lecture!) as these were promising as well. (Again, anyone for a guest post?!) Overall, I come home exhausted but richer for the exchanges and all I learn from a very good and efficient meeting. Not even mentioning this first experience of Japan. (Written from Kansai Osaka airport on a local machine.)

the Art of R Programming [guest post]

Posted in Books, R, Statistics, University life with tags , , , , on January 31, 2012 by xi'an

(This post is the preliminary version of a book review by Alessandra Iacobucci, to appear in CHANCE. Enjoy [both the review and the book]!)

As Rob J. Hyndman enthusiastically declares in his blog, “this is a gem of a book”. I would go even further and argue that The Art of R programming is a whole mine of gems. The book is well constructed, and has a very coherent structure.

After an introductory chapter, where the reader gets a quick overview on R basics that allows her to work through the examples in the following chapters, the rest of the book can be divided in three main parts. In the first part (Chapters 2 to 6) the reader is introduced to main R objects and to the functions built to handle and operate on each of them. The second part (Chapters 7 to 13) is focussed on general programming issues: R structures and object-oriented nature, I/O, string handling and manipulating issues, and graphics. Chapter 13 is all devoted to the topic of debugging. The third part deals with more advanced topics, such as speed of execution and performance issues (Chapter 14), mix-matching functions written in R and C (or Python), and parallel processing with R. Even though this last part is intended for more experienced programmers, the overall programming skills of the intended reader “may range anywhere from those of a professional software developer to `I took a programming course in college’.” (p.xxii).

With a fluent style, Matloff is able to deal with a large number of topics in a relatively limited number of pages, resulting in an astonishingly complete yet handy guide. At almost every page we discover a new command, most likely the command we had always looked for and done without by means of more or less cumbersome roundabouts. As a matter of fact, it is possible that there exists a ready-made and perfectly suited R function for nearly anything that comes up to one’s mind. Users coming from compiled programming languages may find it difficult to get used to this wealth of functions, just as they may feel uncomfortable not declaring variable types, not initializing vectors and arrays, or getting rid of loops. Nevertheless, through numerous examples and a precise knowledge of its strengths and limitations, Matloff masterly introduces the reader to the flexibility of R. He repeatedly underlines the functional nature of R in every part of the book and stresses from the outset how this feature has to be exploited for an effective programming. Continue reading

Computational Statistics

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , , , , , , on May 10, 2010 by xi'an

Do not resort to Monte Carlo methods unnecessarily.

When I received this 2009 Springer-Verlag book, Computational Statistics, by James Gentle a while ago, I briefly took a look at the table of contents and decided to have a better look later… Now that I have gone through the whole book, I can write a short review on its scope and contents (to be submitted). Despite its title, the book aims at covering both computational statistics and statistical computing. (With 752 pages at his disposal, Gentle can afford to do both indeed!)

The book Computational Statistics is separated into four parts:

  • Part I: Mathematical and statistical preliminaries.
  • Part II: Statistical Computing (Computer storage and arithmetic.- Algorithms and programming.- Approximation of functions and numerical quadrature.- Numerical linear algebra.- Solution of nonlinear equations and optimization.- Generation of random numbers.)
  • Part III: Methods of Computational Statistics (Graphical methods in computational statistics.- Tools for identification of structure in data.- Estimation of functions.- Monte Carlo methods for statistical inference.- Data randomization, partitioning, and augmentation.- Bootstrap methods.)
  • Part IV: Exploring Data Density and Relationship (Estimation of probability density functions using parametric models.- Nonparametric estimation of probability density functions.- Statistical learning and data mining.- Statistical models of dependencies.)

Computational inference, together with exact inference and asymptotic inference, is an important component of statistical methods.

The first part of Computational Statistics is indeed a preliminary containing essentials of math and probability and statistics. A reader unfamiliar with too many topics within this chapter should first consider improving his or her background in the corresponding area! This is a rather large chapter, with 82 pages, and it should not be extremely useful to readers, except to signal deficiencies in their background, as noted above. Given this purpose, I am not certain the selected exercises of this chapter are necessary (especially when considering that some involve tools introduced much later in the book).

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different .

The second part of Computational Statistics is truly about computing, meaning the theory of computation, i.e. of using computers for numerical approximation, with discussions about the representation of numbers in computers, approximation errors, and of course random number generators. While I judge George Fishman’s Monte Carlo to provide a deeper and more complete coverage of those topics, I appreciate the need for reminding our students of those hardware subtleties as they often seem unaware of them, despite their advanced computer skills. This second part is thus a crash course of 250 pages on numerical methods (like function approximations by basis functions and …) and on random generators, i.e. cover the same ground as Gentle’s earlier books, Random Number Generation and Monte Carlo Methods and Numerical Linear Algebra for Applications in Statistics, while the more recent Elements of Computational Statistics looks very much like a shorter entry on the same topics as those of Parts III IV of Computational Statistics. This part could certainly sustain a whole semester undergraduate course while only advanced graduate students could be expected to gain from a self-study of those topics. It is nonetheless the most coherent and attractive part of the book. It constitutes a must-read for all students and researchers engaging into any kind of serious programming. Obviously, some notions are introduced a bit too superficially, given the scope of this section (as for instance Monte Carlo methods, in particular MCMC techniques that are introduced in less than six pages), but I came to realise this is the point of the book, which provides an entry into “all” necessary topics, along with links to the relevant literature (if missing Monte Carlo Statistical Methods!). I however deplore that the important issue of Monte Carlo experiments, whose construction is often a hardship for students, is postponed till the 100 page long appendix. (I suspect that students do not read appendices is another of those folk theorems!)

Monte Carlo methods differ from other methods of numerical analysis in yielding an estimate rather than an approximation.

The third and fourth parts of the book cover methods of computational statistics, including Monte Carlo methods, randomization and cross validation, the bootstrap, probability density estimation, and statistical learning. Unfortunately, I find the level of Part III to be quite uneven, where all chapters are short and rather superficial because they try to be all-encompassing. (For instance, Chapter 8 includes two pages on the RGB colour coding.) Part IV does a better job of presenting machine learning techniques, if not with the thoroughness of Hastie et al.’s The Elements of Statistical Learning: Data Mining, Inference, and Prediction… It seems to me that the relevant sections of Part III would have fitted better where they belong in Part IV. For instance, Chapter 10 on estimation of functions only covers the evaluation of estimators of functions, postponing the construction of those estimators till Chapter 15. Jackknife is introduced on its own in Chapter 12 (not that I find this introduction ultimately necessary) without the bootstrap covered in eight pages in Chapter 13 (bootstrap for non-iid data is dismissed rather quickly, given the current research in the area). The first chapter of Part IV covers some (non-Bayesian) estimation approaches for parametric families, but I find this chapter somehow superfluous as it does not belong to the computational statistic domain (except as an approximation method, as stressed in Section 14.4). While Chapter 16 is a valuable entry on clustering and data-analysis tools like PCA, the final section on high dimensions feels out of context (and mentioning the curse of dimensionality only that close to the end of the book does not seem appropriate). Chapter 17 on dependent data is missing the rich literature on graphical models and their use in the determination of dependence structures.

Programming is the best way to learn programming (read that again) .

In conclusion, Computational Statistics is a very diverse book that can be used at several levels as textbook, as well as a reference for researchers (even if as an entry towards further and deeper references). The book is well-written, in a lively and personal style. (I however object to the reduction of the notion of Markov chains to discrete state-spaces!) There is no requirement for a specific programming language, although R is introduced in a somewhat dismissive way (R most serious flaw is usually lack of robustness since some [packages] are not of high-quality) and some exercises start with Design and write either a C or a Fortran subroutine. BUGS is not mentioned at all. The appendices of Computational Statistics also contain the solutions to some exercises, even though the level of detail is highly variable, from one word (“1”) to one page (see, e.g., Exercise 11.4). The 20 page list of references is preceded by a few pages on available journals and webpages, which could get obsolete rather quickly. Despite the reservations raised above about some parts of Computational Statistics that would benefit from a deeper coverage, I think this book is a reference book that should appear in the shortlist of any computational statistics/statistical computing graduate course as well as on the shelves of any researcher supporting his or her statistical practice with a significant dose of computing backup.