Archive for CHANCE

9 pitfalls of data science [book review]

Posted in Books, Kids, Statistics, Travel, University life with tags , , , , , , , , , , , , , on September 11, 2019 by xi'an

I received The 9 pitfalls of data science by Gary Smith [who has written a significant number of general public books on personal investment, statistics and AIs] and Jay Cordes from OUP for review a few weeks ago and read it on my trip to Salzburg. This short book contains a lot of anecdotes and what I would qualify of small talk on job experiences and colleagues’ idiosyncrasies…. More fundamentally, it reads as a sequence of examples of bad or misused statistics, as many general public books on statistics do, but with little to say on how to spot such misuses of statistics. Its title (It seems like the 9 pitfalls of… is a rather common début for a book title!) however started a (short) conversation with my neighbour on the train to Salzburg as she wanted to know if the job opportunities in data sciences were better in Germany than in Austria. A practically important question for which I had no clue. And I do not think the book would have helped either! (My neighbour in the earlier plane to München had a book on growing lotus, which was not particularly enticing for launching a conversation either.)

Chapter I “Using bad data” is made of examples of truncated or cherry picked data often associated with poor graphics. Only one dimensional outcome and also very US centric. Chapter II “Data before theory” highlights spurious correlations and post hoc predictions, criticism of data mining, some examples being quite standard. Chapter III “Worshiping maths” sounds like the perfect opposite of the previous cahpter: it discusses the fact that all models are wrong but some may be more wrong than others. And gives examples of over fitting, p-value hacking, regression applied to longitudinal data. With the message that (maths) assumptions are handy and helpful but not always realistic. Chapter IV “Worshiping computers” is about the new golden calf and contains rather standard stuff on trusting the computer output because it is a machine. However, the book is somewhat falling foul of the same mistake by trusting a Monte Carlo simulation of a shortfall probability for retirees since Monte Carlo also depends on a model! Computer simulations may be fine for Bingo night or poker tournaments but much more uncertain for complex decisions like retirement investments. It is also missing the biasing aspects in constructing recidivism prediction models pointed out in Weapons of math destruction. Until Chapter 9 at least. The chapter is also mentioning adversarial attacks if not GANs (!). Chapter V “Torturing data” mentions famous cheaters like Wansink of the bottomless bowl and pizza papers and contains more about p-hacking and reproducibility. Chapter VI “Fooling yourself” is a rather weak chapter in my opinion. Apart from Ioannidis take on Theranos’ lack of scientific backing, it spends quite a lot of space on stories about poker gains in the unregulated era of online poker, with boasts of significant gains that are possibly earned from compulsive gamblers playing their family savings, which is not particularly praiseworthy. And about Brazilian jiu-jitsu. Chapter VII “Correlation vs causation” predictably mentions Judea Pearl (whose book of why I just could not finish after reading one rant too many about statisticians being unable to get causality right! Especially after discussing the book with Andrew.). But not so much to gather from the chapter, which could have instead delved into deep learning and its ways to avoid overfitting. The first example of this chapter is more about confusing conditionals (what is conditional on what?) than turning causation around. Chapter VII “Regression to the mean” sees Galton’s quincunx reappearing here after Pearl’s book where I learned (and checked with Steve Stiegler) that the device was indeed intended for that purpose of illustrating regression to the mean. While the attractive fallacy is worth pointing out there are much worse abuses of regression that could be presented. CHANCE’s Howard Wainer also makes an appearance along SAT scores. Chapter IX “Doing harm” does engage into the issue that predicting social features like recidivism by a (black box) software is highly worrying (and just plain wrong) if only because of this black box nature. Moving predictably to chess and go with the right comment that this does not say much about real data problems. A word of warning about DNA testing containing very little about ancestry, if only because of the company limited and biased database. With further calls for data privacy and a rather useless entry on North Korea. Chapter X “The Great Recession“, which discusses the subprime scandal (as in Stewart’s book), contains a set of (mostly superfluous) equations from Samuelson’s paper (supposed to scare or impress the reader?!) leading to the rather obvious result that the expected concave utility of a weighted average of iid positive rvs is maximal when all the weights are equal, result that is criticised by laughing at the assumption of iid-ness in the case of mortgages. Along with those who bought exotic derivatives whose construction they could not understand. The (short) chapter keeps going through all the (a posteriori) obvious ingredients for a financial disaster to link them to most of the nine pitfalls. Except the second about data before theory, because there was no data, only theory with no connection with reality. This final chapter is rather enjoyable, if coming after the facts. And containing this altogether unnecessary mathematical entry. [Usual warning: this review or a revised version of it is likely to appear in CHANCE, in my book reviews column.]

prime suspects [book review]

Posted in Books, Kids, University life with tags , , , , , , , , , , , , , , on August 6, 2019 by xi'an

I was contacted by Princeton University Press to comment on the comic book/graphic novel Prime Suspects (The Anatomy of Integers and Permutations), by Andrew Granville (mathematician) & Jennifer Granville (writer), and Robert Lewis (illustrator), and they sent me the book. I am not a big fan of graphic book entries to mathematical even less than to statistical notions (Logicomix being sort of an exception for its historical perspective and nice drawing style) and this book did nothing to change my perspective on the subject. First, the plot is mostly a pretense at introducing number theory concepts and I found it hard to follow it for more than a few pages. The [noires maths] story is that “forensic maths” detectives are looking at murders that connects prime integers and permutations… The ensuing NCIS-style investigation gives the authors the opportunity to skim through the whole cenacle of number theorists, plus a few other mathematicians, who appear as more or less central characters. Even illusory ones like Nicolas Bourbaki. And Alexander Grothendieck as a recluse and clairvoyant hermit [who in real life did not live in a Pyrénées cavern!!!]. Second, I [and nor is Andrew who was in my office when the book arrived!] am not particularly enjoying the drawings or the page composition or the colours of this graphic novel, especially because I find the characters drawn quite inconsistently from one strip to the next, to the point of being unrecognisable, and, if it matters, hardly resembling their real-world equivalent (as seen in the portrait of Persi Diaconis). To be completely honest, the drawings look both ugly and very conventional to me, in that I do not find much of a characteristic style to them. To contemplate what Jacques TardiFrançois Schuiten or José Muñoz could have achieved with the same material… (Or even Edmond Baudoin, who drew the strips for the graphic novels he coauthored with Cédric Villani.) The graphic novel (with a prime 181 pages) is postfaced with explanations about the true persons behind the characters, from Carl Friedriech Gauß to Terry Tao, and of course on the mathematical theory for the analogies between the prime and cycles frequencies behind the story. Which I find much more interesting and readable, obviously. (With a surprise appearance of Kingman’s coalescent!) But also somewhat self-defeating in that so much has to be explained on the side for the links between the story, the characters and the background heavily loaded with “obscure references” to make sense to more than a few mathematician readers. Who may prove to be the core readership of this book.

There is also a bit of a Gödel-Escher-and-Bach flavour in that a piece by Robert Schneider called Réverie in Prime Time Signature is included, while an Escher’s infinite stairway appears in one page, not far from what looks like Milano Vittorio Emmanuelle gallery (On the side, I am puzzled by the footnote on p.208 that “I should clarify that selecting a random permutation and a random prime, as described, can be done easily, quickly, and correctly”. This may be connected to the fact that the description of Bach’s algorithm provided therein is incomplete.)

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about Chance!]

chance call for book reviewers

Posted in Statistics with tags , , , , , , , , on May 14, 2019 by xi'an

Since I have been unable to find local reviewers for my CHANCE review column of the above recent CRC Press books, namely

I am calling for volunteers among ‘Og’s readers. Please contact me if interested.

Statistics and Health Care Fraud & Measuring Crime [ASA book reviews]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , on May 7, 2019 by xi'an

From the recently started ASA books series on statistical reasoning in science and society (of which I already reviewed a sequel to The Lady tasting Tea), a short book, Statistics and Health Care Fraud, I read at the doctor while waiting for my appointment, with no chances of cheating! While making me realise that there is a significant amount of health care fraud in the US, of which I had never though of before (!), with possibly specific statistical features to the problem, besides the use of extreme value theory, I did not find me insight there on the techniques used to detect these frauds, besides the accumulation of Florida and Texas examples. As  such this is a very light introduction to the topic, whose intended audience of choice remains unclear to me. It is stopping short of making a case for statistics and modelling against more machine-learning options. And does not seem to mention false positives… That is, the inevitable occurrence of some doctors or hospitals being above the median costs! (A point I remember David Spiegelhalter making a long while ago, during a memorable French statistical meeting in Pau.) The book also illustrates the use of a free auditing software called Rat-stats for multistage sampling, which apparently does not go beyond selecting claims at random according to their amount. Without learning from past data. (I also wonder if the criminals can reduce the chances of being caught by using this software.)

A second book on the “same” topic!, Measuring Crime, I read, not waiting at the police station, but while flying to Venezia. As indicated by the title, this is about measuring crime, with a lot of emphasis on surveys and census and the potential measurement errors at different levels of surveying or censusing… Again very little on statistical methodology, apart from questioning the data, the mode of surveying, crossing different sources, and establishing the impact of the way questions are stated, but also little on bias and the impact of policing and preventing AIs, as discussed in Weapons of Math Destruction and in some of Kristin Lum’s papers.Except for the almost obligatory reference to Minority Report. The book also concludes on an history chapter centred at Edith Abbott setting the bases for serious crime data collection in the 1920’s.

[And the usual disclaimer applies, namely that this bicephalic review is likely to appear later in CHANCE, in my book reviews column.]

Bayes for good

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on November 27, 2018 by xi'an

A very special weekend workshop on Bayesian techniques used for social good in many different sense (and talks) that we organised with Kerrie Mengersen and Pierre Pudlo at CiRM, Luminy, Marseilles. It started with Rebecca (Beka) Steorts (Duke) explaining [by video from Duke] how the Syrian war deaths were processed to eliminate duplicates, to be continued on Monday at the “Big” conference, Alex Volfonsky (Duke) on a Twitter experiment on the impact of being exposed to adverse opinions as depolarising (not!) or further polarising (yes), turning into network causal analysis. And then Kerrie Mengersen (QUT) on the use of Bayesian networks in ecology, through observational studies she conducted. And the role of neutral statisticians in case of adversarial experts!

Next day, the first talk of David Corlis (Peace-Work), who writes the Stats for Good column in CHANCE and here gave a recruiting spiel for volunteering in good initiatives. Quoting Florence Nightingale as the “first” volunteer. And presenting a broad collection of projects as supports to his recommendations for “doing good”. We then heard [by video] Julien Cornebise from Element AI in London telling of his move out of DeepMind towards investing in social impacting projects through this new startup. Including working with Amnesty International on Darfour village destructions, building evidence from satellite imaging. And crowdsourcing. With an incoming report on the year activities (still under embargo). A most exciting and enthusiastic talk!

Continue reading

Is that a big number? [book review]

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , , , on July 31, 2018 by xi'an

A book I received prior to its publication a few days ago from OXford University Press (OUP), as a book editor for CHANCE (usual provisions apply: the contents of this post will be more or less reproduced in my column in CHANCE when it appears). Copy that I found in my mailbox in Warwick last week and read over the (very hot) weekend.

The overall aim of this book by Andrew Elliott is to encourage numeracy (or fight innumeracy) by making sense of absolute quantities by putting them in perspective, teaching about log scales, visualisation, and divide-and-conquer techniques. And providing a massive list of examples and comparisons, sometimes for page after page… The book is associated with a fairly rich website, itself linked with the many blogs of the author and a myriad of other links and items of information (among which I learned of the recent and absurd launch of Elon Musk’s Tesla car in space! A première in garbage dumping…). From what I can gather from these sites, some (most?) of the material in the book seems to have emerged from the various blog entries.

“Length of River Thames (386 km) is 2 x length of the Suez Canal (193.3 km)”

Maybe I was too exhausted by heat and a very busy week in Warwick for our computational statistics week, the football  2018 World Cup having nothing to do with this, but I could not keep reading the chapters of the book in a continuous manner, suffering from massive information overdump! Being given thousands of entries kills [for me] the appeal of outing weight or sense to large and very large and humongous quantities. And the final vignette in each chapter of pairing of numbers like the one above or the one below

“Time since earliest writing (5200 y) is 25 x time since birth of Darwin (208 y)”

only evokes the remote memory of some kid journal I read from time to time as a kid with this type of entries (I cannot remember the name of the journal!). Or maybe it was a journal I would browse while waiting at the hairdresser’s (which brings back memories of endless waits, maybe because I did not like going to the hairdresser…) Some of the background about measurement and other curios carry a sense of Wikipediesque absolute in their minute details.

A last point of disappointment about the book is the poor graphical design or support. While the author insists on the importance of visualisation on grasping the scales of large quantities, and the webpage is full of such entries, there is very little backup with great graphs to be found in “Is that a big number?” Some of the pictures seem taken from an anonymous databank (where are the towers of San Geminiano?!) and there are not enough graphics. For instance, the fantastic graphics of xkcd conveying the xkcd money chart poster. Or about future. Or many many others

While the style is sometimes light and funny, an overall impression of dryness remains and in comparison I much more preferred Kaiser Fung’s Numbers rule your world and even more both Guesstimation books!

chance meeting

Posted in Statistics with tags , , , , , , , , , on July 10, 2018 by xi'an

As I was travelling to Coventry yesterday, I spotted this fellow passenger on the train from Birmingham with a Valencia 9 bag, and a chat with him. It was a pure chance encounter as he was not attending our summer school, but continued down the line. (These bags are quite sturdy and I kept mine until a zipper broke.)