Archive for Hadoop

years (and years) of data science

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 4, 2016 by xi'an

In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.

“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

  1. formal statistics
  2. advanced computing and graphical devices
  3. the ability to face ever-growing data flows
  4. its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate   Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level?  Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

Foundations of Statistical Algorithms [book review]

Posted in Books, Linux, R, Statistics, University life with tags , , , , , , , , , , , , , on February 28, 2014 by xi'an

There is computational statistics and there is statistical computing. And then there is statistical algorithmic. Not the same thing, by far. This 2014 book by Weihs, Mersman and Ligges, from TU Dortmund, the later being also a member of the R Core team, stands at one end of this wide spectrum of techniques required by modern statistical analysis. In short, it provides the necessary skills to construct statistical algorithms and hence to contribute to statistical computing. And I wish I had the luxury to teach from Foundations of Statistical Algorithms to my graduate students, if only we could afford an extra yearly course…

“Our aim is to enable the reader (…) to quickly understand the main ideas of modern numerical algorithms [rather] than having to memorize the current, and soon to be outdated, set of popular algorithms from computational statistics.”(p.1)

The book is built around the above aim, first presenting the reasons why computers can produce answers different from what we want, using least squares as a mean to check for (in)stability, then second establishing the ground forFishman Monte Carlo methods by discussing (pseudo-)random generation, including MCMC algorithms, before moving in third to bootstrap and resampling techniques, and  concluding with parallelisation and scalability. The text is highly structured, with frequent summaries, a division of chapters all the way down to sub-sub-sub-sections, an R implementation section in each chapter, and a few exercises. Continue reading