Archive for statistical computing

the most important statistical ideas of the past 50 years

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , , , , , , , , , on January 10, 2020 by xi'an

A grand building entrance near the train station in HelsinkiAki and Andrew are celebrating the New Year in advance by composing a list of the most important statistics ideas occurring (roughly) since they were born (or since Fisher died)! Like

  • substitution of computing for mathematical analysis (incl. bootstrap)
  • fitting a model with a large number of parameters, using some regularization procedure to get stable estimates and good predictions (e.g., Gaussian processes, neural networks, generative adversarial networks, variational autoencoders)
  • multilevel or hierarchical modelling (incl. Bayesian inference)
  • advances in statistical algorithms for efficient computing (with a long list of innovations since 1970, including ABC!), pointing out that a large fraction was of the  divide & conquer flavour (in connection with large—if not necessarily Big—data)
  • statistical decision analysis (e.g., Bayesian optimization and reinforcement learning, getting beyond classical experimental design )
  • robustness (under partial specification, misspecification or in the M-open world)
  • EDA à la Tukey and statistical graphics (and R!)
  • causal inference (via counterfactuals)

Now, had I been painfully arm-bent into coming up with such a list, it would have certainly been shorter, for lack of opinion about some of these directions (even the Biometrika deputeditoship has certainly helped in reassessing the popularity of different branches!), and I would have have presumably been biased towards Bayes as well as more mathematical flavours. Hence objecting to the witty comment that “theoretical statistics is the theory of applied statistics”(p.10) and including Ghosal and van der Vaart (2017) as a major reference. Also bemoaning the lack of long-term structure and theoretical support of a branch of the machine-learning literature.

Maybe also more space and analysis could have been spent on “debates remain regarding appropriate use and interpretation of statistical methods” (p.11) in that a major difficulty with the latest in data science is not so much the method(s) as the data on which they are based, which in a large fraction of the cases, is not representative and is poorly if at all corrected for this bias. The “replication crisis” is thus only one (tiny) aspect of the challenge.

I am cold all over…

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on October 29, 2014 by xi'an

unusual snowfall on Bois de Boulogne, March 12, 2013An email from one of my Master students who sent his problem sheet (taken from Monte Carlo Statistical Methods) late:

Bonsoir Professeur
Je « suis » votre cours du mercredi dont le formalisme mathématique me fait froid partout
Avec beaucoup de difficulté je vous envoie mes exercices du premier chapitre de votre livre.

which translates as

Good evening Professor,
I “follow” your Wednesday class which mathematical formalism makes me cold all over. With much hardship, I send you the first batch of problems from your book.

I know that winter is coming, but, still, making students shudder from mathematical cold is not my primary goal when teaching Monte Carlo methods!

future of computational statistics

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , , , , , , on September 29, 2014 by xi'an

I am currently preparing a survey paper on the present state of computational statistics, reflecting on the massive evolution of the field since my early Monte Carlo simulations on an Apple //e, which would take a few days to return a curve of approximate expected squared error losses… It seems to me that MCMC is attracting more attention nowadays than in the past decade, both because of methodological advances linked with better theoretical tools, as for instance in the handling of stochastic processes, and because of new forays in accelerated computing via parallel and cloud computing, The breadth and quality of talks at MCMski IV is testimony to this. A second trend that is not unrelated to the first one is the development of new and the rehabilitation of older techniques to handle complex models by approximations, witness ABC, Expectation-Propagation, variational Bayes, &tc. With a corollary being an healthy questioning of the models themselves. As illustrated for instance in Chris Holmes’ talk last week. While those simplifications are inevitable when faced with hardly imaginable levels of complexity, I still remain confident about the “inevitability” of turning statistics into an “optimize+penalize” tunnel vision…  A third characteristic is the emergence of new languages and meta-languages intended to handle complexity both of problems and of solutions towards a wider audience of users. STAN obviously comes to mind. And JAGS. But it may be that another scale of language is now required…

If you have any suggestion of novel directions in computational statistics or instead of dead ends, I would be most interested in hearing them! So please do comment or send emails to my gmail address bayesianstatistics

Foundations of Statistical Algorithms [book review]

Posted in Books, Linux, R, Statistics, University life with tags , , , , , , , , , , , , , on February 28, 2014 by xi'an

There is computational statistics and there is statistical computing. And then there is statistical algorithmic. Not the same thing, by far. This 2014 book by Weihs, Mersman and Ligges, from TU Dortmund, the later being also a member of the R Core team, stands at one end of this wide spectrum of techniques required by modern statistical analysis. In short, it provides the necessary skills to construct statistical algorithms and hence to contribute to statistical computing. And I wish I had the luxury to teach from Foundations of Statistical Algorithms to my graduate students, if only we could afford an extra yearly course…

“Our aim is to enable the reader (…) to quickly understand the main ideas of modern numerical algorithms [rather] than having to memorize the current, and soon to be outdated, set of popular algorithms from computational statistics.”(p.1)

The book is built around the above aim, first presenting the reasons why computers can produce answers different from what we want, using least squares as a mean to check for (in)stability, then second establishing the ground forFishman Monte Carlo methods by discussing (pseudo-)random generation, including MCMC algorithms, before moving in third to bootstrap and resampling techniques, and  concluding with parallelisation and scalability. The text is highly structured, with frequent summaries, a division of chapters all the way down to sub-sub-sub-sections, an R implementation section in each chapter, and a few exercises. Continue reading

Computational Statistics

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , , , , , , on May 10, 2010 by xi'an

Do not resort to Monte Carlo methods unnecessarily.

When I received this 2009 Springer-Verlag book, Computational Statistics, by James Gentle a while ago, I briefly took a look at the table of contents and decided to have a better look later… Now that I have gone through the whole book, I can write a short review on its scope and contents (to be submitted). Despite its title, the book aims at covering both computational statistics and statistical computing. (With 752 pages at his disposal, Gentle can afford to do both indeed!)

The book Computational Statistics is separated into four parts:

  • Part I: Mathematical and statistical preliminaries.
  • Part II: Statistical Computing (Computer storage and arithmetic.- Algorithms and programming.- Approximation of functions and numerical quadrature.- Numerical linear algebra.- Solution of nonlinear equations and optimization.- Generation of random numbers.)
  • Part III: Methods of Computational Statistics (Graphical methods in computational statistics.- Tools for identification of structure in data.- Estimation of functions.- Monte Carlo methods for statistical inference.- Data randomization, partitioning, and augmentation.- Bootstrap methods.)
  • Part IV: Exploring Data Density and Relationship (Estimation of probability density functions using parametric models.- Nonparametric estimation of probability density functions.- Statistical learning and data mining.- Statistical models of dependencies.)

Computational inference, together with exact inference and asymptotic inference, is an important component of statistical methods.

The first part of Computational Statistics is indeed a preliminary containing essentials of math and probability and statistics. A reader unfamiliar with too many topics within this chapter should first consider improving his or her background in the corresponding area! This is a rather large chapter, with 82 pages, and it should not be extremely useful to readers, except to signal deficiencies in their background, as noted above. Given this purpose, I am not certain the selected exercises of this chapter are necessary (especially when considering that some involve tools introduced much later in the book).

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different .

The second part of Computational Statistics is truly about computing, meaning the theory of computation, i.e. of using computers for numerical approximation, with discussions about the representation of numbers in computers, approximation errors, and of course random number generators. While I judge George Fishman’s Monte Carlo to provide a deeper and more complete coverage of those topics, I appreciate the need for reminding our students of those hardware subtleties as they often seem unaware of them, despite their advanced computer skills. This second part is thus a crash course of 250 pages on numerical methods (like function approximations by basis functions and …) and on random generators, i.e. cover the same ground as Gentle’s earlier books, Random Number Generation and Monte Carlo Methods and Numerical Linear Algebra for Applications in Statistics, while the more recent Elements of Computational Statistics looks very much like a shorter entry on the same topics as those of Parts III IV of Computational Statistics. This part could certainly sustain a whole semester undergraduate course while only advanced graduate students could be expected to gain from a self-study of those topics. It is nonetheless the most coherent and attractive part of the book. It constitutes a must-read for all students and researchers engaging into any kind of serious programming. Obviously, some notions are introduced a bit too superficially, given the scope of this section (as for instance Monte Carlo methods, in particular MCMC techniques that are introduced in less than six pages), but I came to realise this is the point of the book, which provides an entry into “all” necessary topics, along with links to the relevant literature (if missing Monte Carlo Statistical Methods!). I however deplore that the important issue of Monte Carlo experiments, whose construction is often a hardship for students, is postponed till the 100 page long appendix. (I suspect that students do not read appendices is another of those folk theorems!)

Monte Carlo methods differ from other methods of numerical analysis in yielding an estimate rather than an approximation.

The third and fourth parts of the book cover methods of computational statistics, including Monte Carlo methods, randomization and cross validation, the bootstrap, probability density estimation, and statistical learning. Unfortunately, I find the level of Part III to be quite uneven, where all chapters are short and rather superficial because they try to be all-encompassing. (For instance, Chapter 8 includes two pages on the RGB colour coding.) Part IV does a better job of presenting machine learning techniques, if not with the thoroughness of Hastie et al.’s The Elements of Statistical Learning: Data Mining, Inference, and Prediction… It seems to me that the relevant sections of Part III would have fitted better where they belong in Part IV. For instance, Chapter 10 on estimation of functions only covers the evaluation of estimators of functions, postponing the construction of those estimators till Chapter 15. Jackknife is introduced on its own in Chapter 12 (not that I find this introduction ultimately necessary) without the bootstrap covered in eight pages in Chapter 13 (bootstrap for non-iid data is dismissed rather quickly, given the current research in the area). The first chapter of Part IV covers some (non-Bayesian) estimation approaches for parametric families, but I find this chapter somehow superfluous as it does not belong to the computational statistic domain (except as an approximation method, as stressed in Section 14.4). While Chapter 16 is a valuable entry on clustering and data-analysis tools like PCA, the final section on high dimensions feels out of context (and mentioning the curse of dimensionality only that close to the end of the book does not seem appropriate). Chapter 17 on dependent data is missing the rich literature on graphical models and their use in the determination of dependence structures.

Programming is the best way to learn programming (read that again) .

In conclusion, Computational Statistics is a very diverse book that can be used at several levels as textbook, as well as a reference for researchers (even if as an entry towards further and deeper references). The book is well-written, in a lively and personal style. (I however object to the reduction of the notion of Markov chains to discrete state-spaces!) There is no requirement for a specific programming language, although R is introduced in a somewhat dismissive way (R most serious flaw is usually lack of robustness since some [packages] are not of high-quality) and some exercises start with Design and write either a C or a Fortran subroutine. BUGS is not mentioned at all. The appendices of Computational Statistics also contain the solutions to some exercises, even though the level of detail is highly variable, from one word (“1”) to one page (see, e.g., Exercise 11.4). The 20 page list of references is preceded by a few pages on available journals and webpages, which could get obsolete rather quickly. Despite the reservations raised above about some parts of Computational Statistics that would benefit from a deeper coverage, I think this book is a reference book that should appear in the shortlist of any computational statistics/statistical computing graduate course as well as on the shelves of any researcher supporting his or her statistical practice with a significant dose of computing backup.