Archive for data-analyst

Le Monde lacks data scientists!

Posted in Books, Statistics with tags , , , , , , , on July 11, 2017 by xi'an

In a paper in Le Monde today, a journalist is quite critical of statistical analyses of voting behaviours regressed on socio-economic patterns. Warning that correlation is not causation and so on and so forth…But the analysis of the votes as presented in the article is itself quite appalling! Just judging from the above graph, where the vertical and horizontal axes are somewhat inverted (as predicting the proportion of over 65 in the population from their votes does not seem that relevant), with an incomprehensible drop in the over 65 proportion within a district between the votes for the fascist party and the other ones, both indicators of an inversion of the axes!, where the curves are apparently derived from four points [correction at the end explaining they used the whole data collection to draw the curve],  where the variability in the curves is not opposed to the overall variability in the population, where more advanced tools than mere correlation are not broached upon, and so on… They should have asked Andrew. Or YouGov!

numbersense (book review)

Posted in Books, Statistics with tags , , , , , , on August 22, 2013 by xi'an

While I got an advance reader’s copy of numbersense, Kaiser Fung’s latest book, sent to me by the publisher McGraw-Hill, I did not managed to write a review until the book had been out for two months. The title of the book is clear enough about the purpose of the author, but the subtitle “How to use Big Data to your advantage” stresses it even further. And includes the sesame “Big Data”, much more likely to appeal to the general reader than “statistics”…!

“I wouldn’t blame you if you are ready to burn this book, and vow never to talk to the lying statisticians ever again.” (p.4)

So why did it take me such a long while to compose this review?! Besides the break induced by The Accident (I took the book to the hospital but ended up reviewing R for Dummies instead!), I figure I got rather taken aback by the style and intended audience of numbersense, given my earlier reading and enjoying Numbers rule your world. While the book remains of interest for statisticians (and other CHANCE readers!), providing examples to use in the classroom, the statistical connection is all but visible to the casual reader who may well conclude that numbersense is a form of numerical common sense of about fighting innumeracy, rather than modelling uncertainty thru statistical models.

“In analyzing data, there is no way to avoid having theoretical assumptions (…) The world has never run out of theoreticians; in the era of Big Data, the bar of evidence is reset lower, making it tougher to tell right from wrong.” (p.11)

Overall, the intended audience of numbersense seems even further away from statistically savy readers than Numbers rule your world. The book is divided into four sections: social data (Chap. 1 & 2), marketing data (Chap. 3-5), economic data (Chap. 6 & 7), and sport data (Chap. 8). Plus a prologue on the Simpson paradox (in marketing), involving Howard Wainer whose Uneducated Guesses: Using Evidence to Uncover Misguided Education Policies I reviewed a while ago. The first (more marketing than social) chapter is about doctoring admission policies against GPA and LSAT scores (whatever that means!) to improve the ranking of a school. This does not sound such a major numerical issue (once the trick is uncovered) and the chapter meanders too much to my taste. The second chapter goes back to Quetelet‘s impossible average man. Asking the reader to question the role of indices in definitions (like obesity). And mentioning the “significant result” bias in medical journals in passing. As well as causality. As in the previous chapter, I finished it waiting for a conclusion that never came. Chapters 3 and 4 focus on Groupon, Without much of a statistical model (except maybe a second-order Simpson paradox?). Chapter 6 is about how companies like Amazon target their suggestions to customers. Not elaborating on the logit or whatever model is behind, though, and drifting aside on the breach of data secrecy by most of “those” companies.  The economics chapters are more to my liking, presumably because they are more standard, covering the subtleties of unemployment and inflation (official) statistics. They fall into what I call the Gini index branch of statistics. At last, the sport chapter is about fantasy football (FF) and not about Moneyball (even though it has links, obviously). I did not go father than a quick perusal at the chapter as I did not understand (most of) the point of the chapter (or of playing FF). For instance, the conclusion seemed quite distanced from the actual story…

“today’s computers do not understand languages. All they  do is match text: they can tell me whether the words “empirical Bayes model” are found on a specific Web page.” (p. 209)

The epilogue is of a different nature as it describes two examples of the tasks undertaken by Kaiser Fung as a data analyst. A nasty data transfer. And a manual classification of some Google queries. This may be the part of numbersense that I enjoyed the most. Again, let me stress I have no scientific complaint about the book: it just sounds too low-tech’ for my taste. And I find it is not helping readers to go beyond the first level of scepticism about raw and processed data. Because they are not data-analysts.