Archive for data science

truth or truthiness [book review]

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , , , , on March 21, 2017 by xi'an

This 2016 book by Howard Wainer has been sitting (!) on my desk for quite a while and it took a long visit to Warwick to find a free spot to quickly read it and write my impressions. The subtitle is, as shown on the picture, “Distinguishing fact from fiction by learning to think like a data scientist”. With all due respect to the book, which illustrates quite pleasantly the dangers of (pseudo-)data mis- or over- (or eve under-)interpretation, and to the author, who has repeatedly emphasised those points in his books and tribunes opinion columns, including those in CHANCE, I do not think the book teaches how to think like a data scientist. In that an arbitrary neophyte reader would not manage to handle a realistic data centric situation without deeper training. But this collection of essays, some of which were tribunes, makes for a nice reading  nonetheless.

I presume that in this post-truth and alternative facts [dark] era, the notion of truthiness is familiar to most readers! It is often based on a misunderstanding or a misappropriation of data leading to dubious and unfounded conclusions. The book runs through dozens of examples (some of them quite short and mostly appealing to common sense) to show how this happens and to some extent how this can be countered. If not avoided as people will always try to bend, willingly or not, the data to their conclusion.

There are several parts and several themes in Truth or Truthiness, with different degrees of depth and novelty. The more involved part is in my opinion the one about causality, with illustrations in educational testing, psychology, and medical trials. (The illustration about fracking and the resulting impact on Oklahoma earthquakes should not be in the book, except that there exist officials publicly denying the facts. The same remark applies to the testing cheat controversy, which would be laughable had not someone ended up the victim!) The section on graphical representation and data communication is less exciting, presumably because it comes after Tufte’s books and message. I also feel the 1854 cholera map of John Snow is somewhat over-exploited, since he only drew the map after the epidemic declined.  The final chapter Don’t Try this at Home is quite anecdotal and at the same time this may the whole point, namely that in mundane questions thinking like a data scientist is feasible and leads to sometimes surprising conclusions!

“In the past a theory could get by on its beauty; in the modern world, a successful theory has to work for a living.” (p.40)

The book reads quite nicely, as a whole and a collection of pieces, from which class and talk illustrations can be borrowed. I like the “learned” tone of it, with plenty of citations and witticisms, some in Latin, Yiddish and even French. (Even though the later is somewhat inaccurate! Si ça avait pu se produire, ça avait dû se produire [p.152] would have sounded more vernacular in my Gallic opinion!) I thus enjoyed unreservedly Truth or Truthiness, for its rich style and critical message, all the more needed in the current times, and far from comparing it with a bag of potato chips as Andrew Gelman did, I would like to stress its classical tone, in the sense of being immersed in a broad and deep culture that seems to be receding fast.

(more) years of data science

Posted in Mountains, Statistics, University life with tags , on January 4, 2016 by xi'an

Here is David Draper’s discussion on David Donoho’s 50 Years of Data Science:

This was a good choice for a jumping-off point for a round-table discussion on the Future of Data Science; David Donoho hits a number of cogent nails on the head, and also leaves room for other perspectives (if all the round-table participants had written their own versions of Donoho’s paper, the 9 different experimental paths the participants have taken would have resulted in 9 quite different versions). The same issue applies to Donoho’s paper: he’s superb on things in his experimental path about which he’s thought carefully, but (like all of us) he has experientially-driven blind spots.

I write from the point of view of an academic statistician — working in all three of theory, methodology, applications — with a total of about 3 years of industrial experience in Data Science at research labs in eBay and Amazon; I’ve also talked at length with statisticians at Google and Facebook, and I’ve given Data Science seminars at all four companies. To date I’ve worked on the following Data Science problems:
• optimal design and analysis of A/B tests (randomized controlled experiments) with 10–100 million subjects in each of the treatment and control groups;
• optimal design and analysis of observational studies (because randomized experiments are not always possible), again with 10–100 million subjects in each arm of the study;
• one-step-ahead forecasts of 1–100 million (related) non-stationary time series, for the purpose of anomaly detection; and
• multi-step-ahead forecasts of 30 million (related) non-stationary time series, to support optimal inventory control decisions.

My blind spots (at least the ones I know about) include (a) less familiarity with machine learning and econometrics than I would like and (b) no personal experience with what below is called 2016-style High-Performance Computing (although two of my Ph.D. students are currently dragging me into the 21st century). My comments are aimed primarily at statisticians who want to become Data Scientists. Continue reading

years (and years) of data science

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 4, 2016 by xi'an

In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.

“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

  1. formal statistics
  2. advanced computing and graphical devices
  3. the ability to face ever-growing data flows
  4. its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate   Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level?  Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

methods for quantifying conflict casualties in Syria

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 3, 2014 by xi'an

On Monday November 17, 11am, Amphi 10, Université Paris-Dauphine,  Rebecca Steorts from CMU will give a talk at the GT Statistique et imagerie seminar:

Information about social entities is often spread across multiple large databases, each degraded by noise, and without unique identifiers shared across databases.Entity resolution—reconstructing the actual entities and their attributes—is essential to using big data and is challenging not only for inference but also for computation.

In this talk, I motivate entity resolution by the current conflict in Syria. It has been tremendously well documented, however, we still do not know how many people have been killed from conflict-related violence. We describe a novel approach towards estimating death counts in Syria and challenges that are unique to this database. We first introduce computational speed-ups to avoid all-to-all record comparisons based upon locality-sensitive hashing from the computer science literature. We then introduce a novel approach to entity resolution by discovering a bipartite graph, which links manifest records to a common set of latent entities. Our model quantifies the uncertainty in the inference and propagates this uncertainty into subsequent analyses. Finally, we speak to the success and challenges of solving a problem that is at the forefront of national headlines and news.

This is joint work with Rob Hall (Etsy), Steve Fienberg (CMU), and Anshu Shrivastava (Cornell University).

[Note that Rebecca will visit the maths department in Paris-Dauphine for two weeks and give a short course in our data science Master on data confidentiality, privacy and statistical disclosure (syllabus).]

position at Warwick

Posted in Statistics, University life with tags , , , , , , , on September 17, 2014 by xi'an

the pond in front of the Zeeman building, University of Warwick, July 01, 2014the pond in front of the Zeeman building, University of Warwick, July 01, 2014the pond in front of the Zeeman building, University of Warwick, July 01, 2014

A new position for the of Professor Of Statistics and Data Science / Director of the [newly created] Warwick Data Science Institute has been posted. To quote from the job description, “the position arises from the Department of Statistics’ commitment, in collaboration with the Warwick Mathematics Institute and the Department of Computer Science, to a coherent methodological approach to the fundamentals of Data Science and the challenges of complex data sets (for example big data).”  The interview date is November 27, 2014. All details available here.

OxWaSP (The Oxford-Warwick Statistics Programme)

Posted in Kids, Statistics, University life with tags , , , , , , on January 21, 2014 by xi'an

University of Warwick, May 31 2010This is an official email promoting OxWaSP, our joint doctoral training programme, which I [objectively] think is definitely worth considering if planning a PhD in Statistics. Anywhere.

The Statistics Department – University of Oxford and the Statistics Department – University Of Warwick, supported by the EPSRC, will run a joint Centre of Doctoral Training in the theory, methods and applications of Statistical Science for 21st Century data-intensive environments and large-scale models. This is the first centre of its type in the World and will equip its students to work in an area in growing demand both in academia and industry.Oxford, Feb. 23, 2012

Each year from October 2014 OxWaSP will recruit at least 5 students attached to Warwick and at least 5 attached to Oxford. Each student will be funded with a grant for four years of study. Students spend the first year at Oxford developing advanced skills in statistical science. In the first two terms students are given research training through modular courses: Statistical Inference in Complex Models; Multivariate Stochastic Processes; Bayesian Analyses for Complex Structural Information; Machine Learning and Probabilistic Graphical Models; Stochastic Computation for Intractable Inference. In the third term, students carry out two small research projects. At the end of year 1, students begin a three-year research project with a chosen supervisor, five continuing at Oxford and five moving to the University of Warwick.

Training in years 2-4 includes annual retreats, workshops and a research course in machine learning at Amazon (Berlin). There are funded opportunities for students to work with our leading industrial partners and to travel in their third year to an international summer placement in some of the strongest Statistics groups in the USA, Europe and Asia including UC Berkeley, Columbia University, Duke University, the University of Washington in Seattle, ETH Zurich and NUS Singapore.

Applications will be considered in gathered fields with the next deadline of 24 January 2014 (Non-EU applicants should apply by this date to maximise their chance of funding). Interviews for successful applicants who submit by the January deadline will take place at the end of February 2014. There will be a second deadline for applications at the end of February (Warwick) and 14th March (Oxford).