(more) years of data science
Here is David Draper’s discussion on David Donoho’s 50 Years of Data Science:
This was a good choice for a jumping-off point for a round-table discussion on the Future of Data Science; David Donoho hits a number of cogent nails on the head, and also leaves room for other perspectives (if all the round-table participants had written their own versions of Donoho’s paper, the 9 different experimental paths the participants have taken would have resulted in 9 quite different versions). The same issue applies to Donoho’s paper: he’s superb on things in his experimental path about which he’s thought carefully, but (like all of us) he has experientially-driven blind spots.
I write from the point of view of an academic statistician — working in all three of theory, methodology, applications — with a total of about 3 years of industrial experience in Data Science at research labs in eBay and Amazon; I’ve also talked at length with statisticians at Google and Facebook, and I’ve given Data Science seminars at all four companies. To date I’ve worked on the following Data Science problems:
• optimal design and analysis of A/B tests (randomized controlled experiments) with 10–100 million subjects in each of the treatment and control groups;
• optimal design and analysis of observational studies (because randomized experiments are not always possible), again with 10–100 million subjects in each arm of the study;
• one-step-ahead forecasts of 1–100 million (related) non-stationary time series, for the purpose of anomaly detection; and
• multi-step-ahead forecasts of 30 million (related) non-stationary time series, to support optimal inventory control decisions.
My blind spots (at least the ones I know about) include (a) less familiarity with machine learning and econometrics than I would like and (b) no personal experience with what below is called 2016-style High-Performance Computing (although two of my Ph.D. students are currently dragging me into the 21st century). My comments are aimed primarily at statisticians who want to become Data Scientists.
1.1 Donoho is certainly right when he makes the following points:
• Data Science is currently sexy, and this is likely to remain true for some time. My personal definition of a Good Data Scientist looks like this:
Good Data Scientist: A methodological and applied statistician who
(a) completely understands both the strengths and weaknesses of both the statistics and machine-learning paradigms;
(b) completely understands both the strengths and weaknesses of both the Bayesian and frequentist statistical paradigms;
(c) has had wide experience in All 4 of the Things Statisticians Do:
∗ description of an existing data set D;
∗ inference, both about the process leading to D and (causally) counterfactual data sets arising from alternative processes, in both experimental and observational settings;
∗ prediction of new data sets D∗, together with well-calibrated uncertainty assessments for these predictions; and
∗ decision-making under uncertainty; and
(d) knows how to data-wrangle and efficiently code at Big-Data/Big-Models scale. (Under this definition, Good Data Scientists are not thick on the ground.) A more cynical definition of Data Science might look like this:
Data Science: A two-word phrase that, when introduced into the title of Your grant proposal and liberally sprinkled about in its text, increases Your chances of funding.
Substitute the phrase Big Data for Data Science to get another currently correct definition.
• Most academic statistics Departments currently do not train Data Scientists in anything like the full range of necessary skills. See the definition of Good Data Scientist above. At present, Good Academic Statistics Departments
– thoroughly teach the theory and methodology of statistical inference from samples to populations, in one or the other (but generally not both) of the Bayesian and frequentist paradigms;
– briefly discuss description, prediction, causal inference, and real-world decision-making (here I’m talking about working with non-statistician colleagues to construct real-world-relevant action spaces and utility functions, not (e.g.) treating point estimation as a fake decision problem);
– train their graduate students in the analysis of small data sets, typically in R, by writing bespoke R code and calling routines from the CRAN library; and
– encourage their students to learn how to code in a language such as C++, for those occasions when R code runs too slowly.
At present, Really Good Academic Statistics Departments do all of the above and also –thoroughly teach — from both the Bayesian and frequentist perspectives — the theory and methodology supporting All 4 of the Things Statisticians Do; – offer mandatory courses in statistical consulting, so that students can learn problem formulation by watching good statistical consultants in action;
– briefly mention that there’s a discipline called Machine Learning, and briefly cover a few of its current highlights (e.g., Support Vector Machines, neural networks, …);
– perhaps offer an “introduction to High-Performance Computing (HPC)” based on parallel-processing technology — such as MPI — that was cutting-edge in the 1990s.
(Under this definition, Really Good Academic Statistics Departments are also not thick on the ground.) In my experience it’s currently rare to find academic statistics Departments that properly prepare statisticians for 2016-style HPC:
– Hardware: Multiple cores (possibly massively so), typically with hyper-threading; CPUs versus GPUs; how much RAM? How is RAM shared between the cores? How big is the cache available to each CPU/GPU? How much faster is the cache than the rest of RAM? How much slower is writing to disk than accessing RAM? …
– Software: What distributed-computing software environment is available? Hadoop Distributed File System + MapReduce? Apache Spark? Interrupt- driven (synchronous/sequential), or actor-model (asynchronous), or functional, or object -oriented programming, or a mix? Python, Java (JavaVirtual Machine (JVM)), Perl, Lisp, Haskell, Scala, Akka, …?
– Hardware + Software: What’s the best set of hardware and software tools to solve this problem? How about that one? More generally, what would an accurate partition of Data Science problem space look like, when the partition is organized by optimal use of hardware and software?
• The goal is not Statistics versus Machine Learning}, it’s Statistics + Machine Learning + Econometrics + …. In every industrial research lab at which I’ve given talks about statistics and Data Science, I’ve included a slide that proselytizes for a world in which all such labs need statisticians, machine learners, econometricians, … all working side by side, because each Data Science specialization area brings a different set of skills to the mix.
• The idea of Reproducible Research is key to accelerated progress in Science, not just Data Science. This is an extremely important point; kudos to Donoho for treating it at such length and with such care. Here’s a caricature of a bad Data-Science analysis:
(1) I open an edit window and an R (not RStudio) execution window on my laptop.
(2) I write some complicated bespoke code in the edit window, copying and pasting bits of it into the R window until it appears to run correctly; I don’t comment my code because “it’s just me.” I use my code to create some PDF files containing interesting graphs.
(3) I open a third window and type up my results in LaTeX, bringing in the PDF graph files with \includegraphics and laboriously creating tables by hand, into which I type (or paste) results from the R code. I submit the resulting manuscript for review by a journal or by my colleagues at my company. I do not create a Wiki documenting my work or post my LaTeX and R code on GitHub, again because “it’s just me.”
(4) If it’s a journal publication or my company colleagues are slow, I get constructive criticism weeks or months later, requiring new analyses. I go back to my R code and find that I cannot understand it, because I didn’t comment it the first time around. I repeat steps (1)–(3), with difficulty and much grumbling, as often as needed until my work is accepted or I give up.
(This caricature is embarrassingly close to how I conducted research until students and colleagues dragged me into the worlds of RStudio, Wikis and GitHub). Now imagine instead a world in which everybody routinely uses tools such as knitr to integrate plain- text, formulas, tables, static and dynamic graphs and executable code (with data) into two products:
(i) a static journal article or company white paper, and
(ii) a Wiki linked to (i), available to everybody (journal article) or everybody with need- to-know (company white paper), so that Your analyses and results are completely reproducible.
As Donoho points out, this would make meta-research — studies harvesting patterns across studies — much simpler; I agree with him that this offers the potential for a significant leap forward in Science in general and Data Science in particular.
1.2 Donoho’s perspective is incomplete or outdated along the following dimensions:
• The death of Moore’s Law. In 1965 Gordon Moore, the director of research and development at Fairchild Semiconductor, was asked to forecast the future course of miniaturization in the semiconductor industry. He predicted a linear relationship between log (transistor count per 0.25 inch × 0.25 inch CPU, a measure of calculational through-put per square area) and time, with a doubling of through-put every two years for at least the next 10 years; this is Moore’s Law. It proved remarkably accurate from 1971 to 2011, as the transistor count went from 2.3 · 103 to 2.6 · 109. However, by 2015 the CEO of Intel revised the doubling time to 2.5 years and suggested that this would only be accurate through 2017, after which further slowing with current technology is inevitable: at present we don’t know how to make transistors any smaller than 14 · 10−9 meters (14 nanometres (nm)) at commercial scale, and it’s speculated that an as-yet-undreamed-of technology would be needed to go below 5 nm.
• Why the death of Moore’s Law is important. Although Markov Chain Monte Carlo (MCMC) methods (the current Bayesian computational algorithmic workhorse) date to the mid 1940s, statisticians didn’t stumble onto them until about 1990, a point in computing-hardware history when 5,000 sequential (synchronous) MCMC iterations in a simple hierarchical model necessitated an overnight run on Your single-CPU desktop machine. For the past 25 years Bayesian statisticians have been counting on Moore’s Law to bail them out when they wanted to fit more complex models to bigger data sets, and it’s true that Your desktop can now give you 100,000 synchronous MCMC iterations in the same simple hierarchical model with today’s single CPU in a few seconds. But The fatal slowing-down of Moore’s Law means that sequential computation will soon no longer be good enough for small-to-moderate-scale Bayesian calculations (and is already not good enough for such calculations at large scale); the current computing world is increasingly asynchronous, multi-core and distributed. Teaching statistics graduate students how to write single-thread R code as the capstone of their computing training is already obsolete, and will soon become laughably so; giving them facility with foreach and %dopar% on a machine with 4 cores and 8 threads is a small step forward, but will not properly prepare them for the current Data Science computing environment.
If statisticians want to become Good Data Scientists, they have to begin paying closer attention to all properties/capabilities of the available hardware (cores, threads, CPU/GPU, RAM, cache, disk, …) and how the available software (MPI, Hadoop, MapReuce, Scala, Akka, OpenCL, …) makes use of those capabilities. Machine-learners are miles ahead of statisticians along this dimension of knowledge; statisticians need to catch up fast or incur substantial risk of being left behind. To the argument “But universities don’t have Hadoop clusters” the obvious response is “Yes, they do; they just have to rent time in a cloud computing environment, or — better yet — get the company that owns the cloud to donate time to them as a tax write-off.”
• Donoho’s emphasis (his Section 6.3) on the Predictive Modelling culture, while necessary, is not sufficient for Good Data Science. It’s certainly true that, from a Data Science perspective, statisticians currently spend too much effort on inference and not enough on prediction. But
– Good predictive modelling requires a search for methods that produce not only good point predictions but also good predictive calibration: (point) predictions without well-calibrated interval uncertainty assessments cannot serve as the basis of good scientific or business decisions; and
– While I’m on the topic of decision — as noted under the definition of a Good Data Scientist — Donoho sharply understates the importance of
∗ valid causal inference and
∗ good decision-making under uncertainty in Data Science.
• Donoho places too many eggs in the Common Task Framework (CTF) basket. I agree wholeheartedly with the idea that there is substantial value in documenting the successes and failures of Data Science methodologies in actual applications; the discipline of statistics has been burdened with too many asymptotic theorems, of dubious value in assessing the performance of a method in the actual problem You’re working on, with the finite sample size You actually have. And I accept Donoho’s summary (in his Section 6.2) of CTF successes with canned datasets. My difficulty with this as a proposed DataScience panacea is that, in the large-scale problems on which I’ve worked, too much important information about the problem context is lost in translation from the Problem to the Canned Dataset.As a related and final point, Breiman (2001) —discussed in Donoho’s Section 5— is certainly correct in identifying two different Cultures — Generative Modelling and Predictive Modelling — and in noting that academic statistics Departments don’t place enought emphasis on the latter. However, this does not of course mean that we should switch the weights on the two Cultures from Breiman’s estimate of (0.98, 0.02) to (0, 1.0).
It’s an inescapable fact that a Predictive-Modelling black box (such as a deep-learning neural network) works precisely when, and only when, the new instance to be predicted is similar to (exchangeable with ) previously observed instances, conditional on relevant covariates (features). The Generative-Modelling outlook, driven by
Problem Context, is extremely helpful in forming valid conditional exchangeability judgements that lead to good Predictive-Modelling results; I propose a Data Science world in which the weights are (0.5, 0.5).