**J**ust found out that our paper On parameter estimation with the Wasserstein distance with Espen Bernton, Pierre Jacob, and Mathieu Gerber, has now appeared on-line on Information and Inference: A Journal of the IMA,

## Archive for Oxford University Press

## on-line parameter estimation with Wasserstein

Posted in Books, Statistics, University life with tags ABC, consistency of ABC methods, empirical distribution, IMA, Information and Inference, misspecified model, Oxford University Press, publication, Wasserstein distance on November 27, 2019 by xi'an## 9 pitfalls of data science [book review]

Posted in Books, Kids, Statistics, Travel, University life with tags Austria, book review, CHANCE, Germany, jiu-jitsu, lotus, OUP, Oxford University Press, poker, Salzburg, The Book of Why, Theranos, train travel, USA on September 11, 2019 by xi'an**I** received The 9 pitfalls of data science by Gary Smith [who has written a significant number of general public books on personal investment, statistics and AIs] and Jay Cordes from OUP for review a few weeks ago and read it on my trip to Salzburg. This short book contains a lot of anecdotes and what I would qualify of small talk on job experiences and colleagues’ idiosyncrasies…. More fundamentally, it reads as a sequence of examples of bad or misused statistics, as many general public books on statistics do, but with little to say on how to spot such misuses of statistics. Its title (It seems like *the 9 pitfalls of…* is a rather common début for a book title!) however started a (short) conversation with my neighbour on the train to Salzburg as she wanted to know if the job opportunities in data sciences were better in Germany than in Austria. A practically important question for which I had no clue. And I do not think the book would have helped either! (My neighbour in the earlier plane to München had a book on growing lotus, which was not particularly enticing for launching a conversation either.)

Chapter I “*Using bad data*” is made of examples of truncated or cherry picked data often associated with poor graphics. Only one dimensional outcome and also very US centric. Chapter II “*Data before theory*” highlights spurious correlations and post hoc predictions, criticism of data mining, some examples being quite standard. Chapter III “*Worshiping maths*” sounds like the perfect opposite of the previous cahpter: it discusses the fact that all models are wrong but some may be more wrong than others. And gives examples of over fitting, p-value hacking, regression applied to longitudinal data. With the message that (maths) assumptions are handy and helpful but not always realistic. Chapter IV “*Worshiping computers*” is about the new golden calf and contains rather standard stuff on trusting the computer output because it is a machine. However, the book is somewhat falling foul of the same mistake by trusting a Monte Carlo simulation of a shortfall probability for retirees since Monte Carlo also depends on a model! Computer simulations may be fine for Bingo night or poker tournaments but much more uncertain for complex decisions like retirement investments. It is also missing the biasing aspects in constructing recidivism prediction models pointed out in Weapons of math destruction. Until Chapter 9 at least. The chapter is also mentioning adversarial attacks if not GANs (!). Chapter V “*Torturing data*” mentions famous cheaters like Wansink of the bottomless bowl and pizza papers and contains more about p-hacking and reproducibility. Chapter VI “*Fooling yourself*” is a rather weak chapter in my opinion. Apart from Ioannidis take on Theranos’ lack of scientific backing, it spends quite a lot of space on stories about poker gains in the unregulated era of online poker, with boasts of significant gains that are possibly earned from compulsive gamblers playing their family savings, which is not particularly praiseworthy. And about Brazilian jiu-jitsu. Chapter VII “*Correlation vs causation*” predictably mentions Judea Pearl (whose book of why I just could not finish after reading one rant too many about statisticians being unable to get causality right! Especially after discussing the book with Andrew.). But not so much to gather from the chapter, which could have instead delved into deep learning and its ways to avoid overfitting. The first example of this chapter is more about confusing conditionals (what is conditional on what?) than turning causation around. Chapter VII “*Regression to the mean*” sees Galton’s quincunx reappearing here after Pearl’s book where I learned (and checked with Steve Stiegler) that the device was indeed intended for that purpose of illustrating regression to the mean. While the attractive fallacy is worth pointing out there are much worse abuses of regression that could be presented. CHANCE’s Howard Wainer also makes an appearance along SAT scores. Chapter IX “*Doing harm*” does engage into the issue that predicting social features like recidivism by a (black box) software is highly worrying (and just plain wrong) if only because of this black box nature. Moving predictably to chess and go with the right comment that this does not say much about real data problems. A word of warning about DNA testing containing very little about ancestry, if only because of the company limited and biased database. With further calls for data privacy and a rather useless entry on North Korea. Chapter X “*The Great Recession*“, which discusses the subprime scandal (as in Stewart’s book), contains a set of (mostly superfluous) equations from Samuelson’s paper (supposed to scare or impress the reader?!) leading to the rather obvious result that the expected concave utility of a weighted average of iid positive rvs is maximal when all the weights are equal, result that is criticised by laughing at the assumption of iid-ness in the case of mortgages. Along with those who bought exotic derivatives whose construction they could not understand. The (short) chapter keeps going through all the (a posteriori) obvious ingredients for a financial disaster to link them to most of the nine pitfalls. Except the second about data before theory, because there was no data, only theory with no connection with reality. This final chapter is rather enjoyable, if coming after the facts. And containing this altogether unnecessary mathematical entry. *[Usual warning: this review or a revised version of it is likely to appear in CHANCE, in my book reviews column.]*

## Is that a big number? [book review]

Posted in Books, Kids, pictures, Statistics with tags big numbers, Book, book review, CHANCE, counting, Guesstimation, innumeracy, measurement, Oxford University Press, xkcd on July 31, 2018 by xi'an **A** book I received prior to its publication a few days ago from OXford University Press (OUP), as a book editor for CHANCE (*usual provisions apply:* the contents of this post will be more or less reproduced in my column in CHANCE when it appears). Copy that I found in my mailbox in Warwick last week and read over the (very hot) weekend.

The overall aim of this book by Andrew Elliott is to encourage numeracy (or fight innumeracy) by making sense of absolute quantities by putting them in perspective, teaching about log scales, visualisation, and divide-and-conquer techniques. And providing a massive list of examples and comparisons, sometimes for page after page… The book is associated with a fairly rich website, itself linked with the many blogs of the author and a myriad of other links and items of information (among which I learned of the recent and absurd launch of Elon Musk’s Tesla car in space! A première in garbage dumping…). From what I can gather from these sites, some (most?) of the material in the book seems to have emerged from the various blog entries.

“Length of River Thames (386 km) is 2 x length of the Suez Canal (193.3 km)”

Maybe I was too exhausted by heat and a very busy week in Warwick for our computational statistics week, the football 2018 World Cup having nothing to do with this, but I could not keep reading the chapters of the book in a continuous manner, suffering from massive information overdump! Being given thousands of entries kills [for me] the appeal of outing weight or sense to large and very large and humongous quantities. And the final vignette in each chapter of pairing of numbers like the one above or the one below

“Time since earliest writing (5200 y) is 25 x time since birth of Darwin (208 y)”

only evokes the remote memory of some kid journal I read from time to time as a kid with this type of entries (I cannot remember the name of the journal!). Or maybe it was a journal I would browse while waiting at the hairdresser’s (which brings back memories of endless waits, maybe because I did not like going to the hairdresser…) Some of the background about measurement and other curios carry a sense of Wikipediesque absolute in their minute details.

A last point of disappointment about the book is the poor graphical design or support. While the author insists on the importance of visualisation on grasping the scales of large quantities, and the webpage is full of such entries, there is very little backup with great graphs to be found in *“Is that a big number?”* Some of the pictures seem taken from an anonymous databank (where are the towers of San Geminiano?!) and there are not enough graphics. For instance, the fantastic graphics of xkcd conveying the xkcd money chart poster. Or about future. Or many many others…

While the style is sometimes light and funny, an overall impression of dryness remains and in comparison I much more preferred Kaiser Fung’s Numbers rule your world and even more both Guesstimation books!

## expectation-propagation from Les Houches

Posted in Books, Mountains, pictures, Statistics, University life with tags belief propagation, book review, Chamonix, CHANCE, expectation-propagation, French Alps, Lasso, Les Houches, Oxford University Press, sparsity, summer school on February 3, 2016 by xi'an**A**s CHANCE book editor, I received the other day from Oxford University Press acts from an École de Physique des Houches on Statistical Physics, Optimisation, Inference, and Message-Passing Algorithms that took place there in September 30 – October 11, 2013. While it is mostly unrelated with Statistics, and since Igor Caron already reviewed the book a year and more ago, I skimmed through the few chapters connected to my interest, from Devavrat Shah’s chapter on graphical models and belief propagation, to Andrea Montanari‘s denoising and sparse regression, including LASSO, and only read in some detail Manfred Opper’s expectation propagation chapter. This paper made me realise (or re-realise as I had presumably forgotten an earlier explanation!) that expectation propagation can be seen as a sort of variational approximation that produces by a sequence of iterations the distribution within a certain parametric (exponential) family that is the closest to the distribution of interest. By writing the Kullback-Leibler divergence the opposite way from the usual variational approximation, the solution equates the expectation of the natural sufficient statistic under both models… Another interesting aspect of this chapter is the connection with estimating normalising constants. (I noticed a slight typo on p.269 in the final form of the Kullback approximation q() to p().

## The Unimaginable Mathematics of Borges’ Library of Babel [book review]

Posted in Books, Statistics, Travel, University life with tags book review, Boston, cohomology, combinatorics, infinity, information theory, Jorge Luis Borges, JSM 2014, Library of Babel, Oxford University Press, Turing's machine on September 30, 2014 by xi'an**T**his is a book I carried away from JSM in Boston as the Oxford University Press representative kindly provided my with a copy at the end of the meeting. After I asked for it, as I was quite excited to see a book linking Jorge Luis Borges’ great Library of Babel short story with mathematical concepts. Even though many other short stories by Borges have a mathematical flavour and are bound to fascinate mathematicians, the Library of Babel is particularly prone to mathemati-sation as it deals with the notions of infinite, periodicity, permutation, randomness… As it happens, William Goldbloom Bloch [a patronym that would surely have inspired Borges!], professor of mathematics at Wheaton College, Mass., published the unimaginable mathematics of Borges’ Library of Babel in 2008, so this is not a recent publication. But I had managed to miss through the several conferences where I stopped at OUP exhibit booth. (Interestingly William Bloch has also published a mathematical paper on Neil Stephenson’s Cryptonomicon.)

**N**ow, what is unimaginable in the maths behind Borges’ great Library of Babel??? The obvious line of entry to the mathematical aspects of the book is combinatorics: how many different books are there in total? [Ans. 10¹⁸³⁴⁰⁹⁷…] how many hexagons are needed to shelf that many books? [Ans. 10⁶⁸¹⁵³¹…] how long would it take to visit all those hexagons? how many librarians are needed for a Library containing all volumes once and only once? how many different libraries are there [Ans. 10^{10⁶}…] Then the book embarks upon some cohomology, Cavalieri’s infinitesimals (mentioned by Borges in a footnote), Zeno’s paradox, topology (with Klein’s bottle), graph theory (and the important question as to whether or not each hexagon has one or two stairs), information theory, Turing’s machine. The concluding chapters are comments about other mathematical analysis of Borges’ Grand Œuvre and a discussion on how much maths Borges knew.

**S**o a nice escapade through some mathematical landscapes with more or less connection with the original masterpiece. I am not convinced it brings any further dimension or insight about it, or even that one should try to dissect it that way, because it kills the poetry in the story, especially the play around the notion(s) of infinite. The fact that the short story is incomplete [and short on details] makes its beauty: if one starts wondering at the possibility of the Library or at the daily life of the librarians [like, what do they eat? why are they there? where are the readers? what happens when they die? &tc.] the intrusion of realism closes the enchantment! Nonetheless, the unimaginable mathematics of Borges’ Library of Babel provides a pleasant entry into some mathematical concepts and as such may initiate a layperson not too shy of maths formulas to the beauty of mathematics.

## straightforward statistics [book review]

Posted in Books, Kids, Statistics, University life with tags hypothesis testing, introductory textbooks, multiple tests, Oxford University Press, p-values, power, psychology, tests on July 3, 2014 by xi'an

“I took two different statistics courses as an undergraduate psychology major [and] four different advanced statistics classes as a PhD student.”G. Geher

*Straightforward Statistics: Understanding the Tools of Research* by Glenn Geher and Sara Hall is an introductory textbook for psychology and other social science students. (That Oxford University Press sent me for review in CHANCE. Nice cover, by the way!) I can spot the purpose behind the title, purpose heavily stressed anew in the preface and the first chapter, but it nonetheless irks me as conveying the message that one semester of reasonable diligence in class will suffice to any college students to *“not only understanding research findings from psychology, but also to uncovering new truths about the world and our place in it”* (p.9). Nothing less. While, in essence, it covers the basics found in all introductory textbooks, from descriptive statistics to ANOVA models. The inclusion of “real research examples” in the chapters of the book rather demonstrates how far from real research a reader of the book would stand… Continue reading