I heard of the death of the writer V.S. Naipal late today, after arriving on the North coast of Vancouver Island. While not familiar with many of his books, I remember reading a House for Mr. Biswas and A Bend in the River in the mid 80’s, following a suggestion by my late friend José de Sam Lazaro, who was a professor in Rouen when I was doing my PhD there and with whom I would travel from Paris to Rouen by the first morning train… As most suggestions from José, it was an eye-opener on different views and different stories, as well as a pleasure to read the crisp style of Naipaul. Who thus remains inextricably linked with my memories of José. I also remember later discussing with, by postal letters, while in Purdue, on the strength of Huston’s The Dead, the last and possibly best novel of Joyce’s Dubliners, which stroke me as expressing so clearly and deeply the final feelings of utter failure of Conroy, Gretta’s husband. As well as his defense of Forman’s Amadeus!

## Archive for the Books Category

## Vidiadhar Surajprasad Naipaul (1932-2018)

Posted in Books, pictures, Travel with tags A House for Mister Biswas, Amadeus, Dubliners, English literature, James Joyce, John Huston, José de Sam Lazaro, Man Booker Prize, Nobel Prize, The Dead, Vidiadhar Surajprasad Naipaul on August 13, 2018 by xi'an## running shoes

Posted in Books, Running, Statistics with tags 4%, Annals of Applied Statistics, half-marathon, long distance running, New York City Marathon, Nike, NYT, The New York Times, three-hour marathon, USA, vaporfly on August 12, 2018 by xi'an**A** few days ago, when back from my morning run, I spotted a NYT article on Nike shoes that are supposed to bring on average a 4% gain in speed. Meaning for instance a 3 to 4 minute gain in a half-marathon.

“Using public race reports and shoe records from Strava, a fitness app that calls itself the social network for athletes, The Times found that runners in Vaporflys ran3 to 4 percent fasterthan similar runners wearing other shoes, andmore than 1 percent fasterthan the next-fastest racing shoe.”

What is interesting in this NYT article is that the two journalists who wrote it have analysed their own data, taken from Strava. Using a statistical model or models (linear regression? non-linear regression? neural net?) to predict the impact of the shoe make, against “all” other factors contributing to the overall time or position or percentage gain or yet something else. In most analyses produced in the NYT article, the 4% gain is reproduced (with a 2% gain for female shoe switcher and a 7% gain for slow runners).

“Of course, these observations do not constitute a randomized control trial. Runners choose to wear Vaporflys; they are not randomly assigned them. One statistical approach that seeks to address this uses something called propensity scores, which attempt to control for the likelihood that someone wears the shoes in the first place. We tried this, too. Our estimates didn’t change.”

The statistical analysis (or analyses) seems rather thorough, from what is reported in the NYT article, with several attempts at controlling for confounders. Still, the data itself is observational, even if providing a lot of variables to run the analyses, as it only covers runners using Strava (from 5% in Tokyo to 25% in London!) and indicating the type of shoes they wear during the race. There is also the issue that the shoes are quite expensive, at $250 a pair, especially if the effect wears out after 100 miles (this was not tested in the study), as I would hesitate to use them unless the race conditions look optimal (and they never do!). There is certainly a new shoes effect on top of that, between the real impact of a better response and a placebo effect. As shown by a similar effect of many other shoe makes. Hence, a moderating impact on the NYT conclusion that these Nike Vaporflys (flies?!) are an “outlier”. But nonetheless a fairly elaborate and careful statistical study that could potentially make it to a top journal like Annals of Applied Statistics!

## Le Monde puzzle [#1063]

Posted in Books, Kids, R with tags arithmetics, competition, Le Monde, mathematical puzzle, prime factor decomposition, primeFactors(), R on August 9, 2018 by xi'an**A** simple (summertime?!) arithmetic Le Monde mathematical puzzle

*A “powerful integer” is such that all its prime divisors are at least with multiplicity 2. Are there two powerful integers in a row, i.e. such that both n and n+1 are powerful?*

*Are there odd integers n such that n² – 1 is a powerful integer ?*

**T**he first question can be solved by brute force. Here is a R code that leads to the solution:

isperfz <- function(n){ divz=primeFactors(n) facz=unique(divz) ordz=rep(0,length(facz)) for (i in 1:length(facz)) ordz[i]=sum(divz==facz[i]) return(min(ordz)>1)} lesperf=NULL for (t in 4:1e5) if (isperfz(t)) lesperf=c(lesperf,t) twinz=lesperf[diff(lesperf)==1]

with solutions 8, 288, 675, 9800, 12167.

The second puzzle means rerunning the code only on integers n²-1…

[1] 8 [1] 288 [1] 675 [1] 9800 [1] 235224 [1] 332928 [1] 1825200 [1] 11309768

except that I cannot exceed n²=10⁸. (The Le Monde puzzles will now stop for a month, just like about everything in France!, and then a new challenge will take place. Stay tuned.)

## Is that a big number? [book review]

Posted in Books, Kids, pictures, Statistics with tags big numbers, Book, book review, CHANCE, counting, Guesstimation, innumeracy, measurement, Oxford University Press, xkcd on July 31, 2018 by xi'an **A** book I received prior to its publication a few days ago from OXford University Press (OUP), as a book editor for CHANCE (*usual provisions apply:* the contents of this post will be more or less reproduced in my column in CHANCE when it appears). Copy that I found in my mailbox in Warwick last week and read over the (very hot) weekend.

The overall aim of this book by Andrew Elliott is to encourage numeracy (or fight innumeracy) by making sense of absolute quantities by putting them in perspective, teaching about log scales, visualisation, and divide-and-conquer techniques. And providing a massive list of examples and comparisons, sometimes for page after page… The book is associated with a fairly rich website, itself linked with the many blogs of the author and a myriad of other links and items of information (among which I learned of the recent and absurd launch of Elon Musk’s Tesla car in space! A première in garbage dumping…). From what I can gather from these sites, some (most?) of the material in the book seems to have emerged from the various blog entries.

“Length of River Thames (386 km) is 2 x length of the Suez Canal (193.3 km)”

Maybe I was too exhausted by heat and a very busy week in Warwick for our computational statistics week, the football 2018 World Cup having nothing to do with this, but I could not keep reading the chapters of the book in a continuous manner, suffering from massive information overdump! Being given thousands of entries kills [for me] the appeal of outing weight or sense to large and very large and humongous quantities. And the final vignette in each chapter of pairing of numbers like the one above or the one below

“Time since earliest writing (5200 y) is 25 x time since birth of Darwin (208 y)”

only evokes the remote memory of some kid journal I read from time to time as a kid with this type of entries (I cannot remember the name of the journal!). Or maybe it was a journal I would browse while waiting at the hairdresser’s (which brings back memories of endless waits, maybe because I did not like going to the hairdresser…) Some of the background about measurement and other curios carry a sense of Wikipediesque absolute in their minute details.

A last point of disappointment about the book is the poor graphical design or support. While the author insists on the importance of visualisation on grasping the scales of large quantities, and the webpage is full of such entries, there is very little backup with great graphs to be found in *“Is that a big number?”* Some of the pictures seem taken from an anonymous databank (where are the towers of San Geminiano?!) and there are not enough graphics. For instance, the fantastic graphics of xkcd conveying the xkcd money chart poster. Or about future. Or many many others…

While the style is sometimes light and funny, an overall impression of dryness remains and in comparison I much more preferred Kaiser Fung’s Numbers rule your world and even more both Guesstimation books!

## Le Monde puzzle [#1062]

Posted in Books, Kids, pictures, R with tags Heron's formula, Le Monde, mathematical puzzle, Pythagorean theorem, R, triangle geometry on July 28, 2018 by xi'an**A** simple Le Monde mathematical puzzle none too geometric:

*Find square triangles which sides are all integers and which surface is its perimeter.**Extend to non-square rectangles.*

No visible difficulty by virtue of Pythagore’s formula:

for (a in 1:1e4) for (b in a:1e4) if (a*b==2*(a+b+round(sqrt(a*a+b*b)))) print(c(a,b))

produces two answers

5 12 6 8

and in the more general case, Heron’s formula to the rescue!,

for (a in 1:1e2) for (b in a:1e2) for (z in b:1e2){ s=(a+b+z)/2 if (abs(4*s-abs((s-a)*(s-b)*(s-z)))<1e-4) print(c(a,b,z))}

returns

4 15 21 5 9 16 5 12 13 6 7 15 6 8 10 6 25 29 7 15 20 9 10 17

## more multiple proposal MCMC

Posted in Books, Statistics with tags delayed rejection sampling, directed acyclic graphs, Gibbs sampler, multiple-try Metropolis algorithm, parallelisation, prefetching, pseudo-posterior, subsampling on July 26, 2018 by xi'an**L**uo and Tjelmeland just arXived a paper on a new version of multiple-try Metropolis Hastings, the addendum being in defining the additional proposed copies via a dependence graph like (a) above, with one version from the target and all others from operational and conditional proposal kernels. Respecting the dependence graph, as in (b). As I did, you may then wonder where both the graph and the conditional do come from. Which reminds me of the pseudo-posteriors of Carlin and Chib (1995), even though this is not terribly connected. Green (1995).) (But not disconnected either since the authors mention But, given the graph, following a Gibbs scheme, one of the 17 nodes is chosen as generated from the target, using the proper conditional on that index [which is purely artificial from the point of view of the original simulation problem!]. As noted above, the graph is an issue, but since it is artificial, it can be devised to either allow for quasi-independence between the proposed values or on the opposite to induce long range dependence, which corresponds to conducting multiple MCMC steps before reaching the end nodes, a feature that is very appealing in my opinion. And reminds me of prefetching. (As I am listening to Nicolas Chopin’s lecture in Warwick at the moment, there also seems to be a connection with pMCMC.) Still, I remain unclear as to the devising of the graph of dependent proposals, as its depth should be somehow connected with the mixing properties of the original proposal. Gains in convergence may thus come at a high cost at the construction stage.

## troubling trends in machine learning

Posted in Books, pictures, Running, Statistics, University life with tags academic research, arXiv, Coventry, Crayfield Grange, ICML, Kenilworth, machine learning, mathiness, NIPS, PCI Evol Biol, proceedings, sunrise, University of Warwick, Warwickshire on July 25, 2018 by xi'an**T**his morning, in Coventry, while having an n-th cup of tea after a very early morning run (light comes early at this time of the year!), I spotted an intriguing title in the arXivals of the day, by Zachary Lipton and Jacob Steinhard. Addressing the academic shortcomings of machine learning papers. While I first thought little of the attempt to address poor scholarship in the machine learning literature, I read it with growing interest and, although I am pessimistic at the chances of inverting the trend, considering the relentless pace and massive production of the community, I consider the exercise worth conducting, if only to launch a debate on the excesses found in the literature.

“…desirable characteristics: (i) provide intuition to aid the reader’s understanding, but clearly distinguish it from stronger conclusions supported by evidence; (ii) describe empirical investigations that consider and rule out alternative hypotheses; (iii) make clear the relationship between theoretical analysis and intuitive or empirical claims; and (iv) use language to empower the reader, choosing terminology to avoid misleading or unproven connotations, collisions with other definitions, or conflation with other related but distinct concepts”

The points made by the authors are (p.1)

*Failure to distinguish between explanation and speculation**Failure to identify the sources of empirical gains**Mathiness**Misuse of language*

Again, I had misgiving about point 3., but this is not an anti-maths argument, rather about the recourse to vaguely connected or oversold mathematical results as a way to support a method.

Most interestingly (and living dangerously!), the authors select specific papers to illustrate their point, picking from well-established authors *and from their own papers*, rather than from junior authors. And also include counter-examples of papers going the(ir) right way. Among the recommendations for emerging from the morass of poor scholarship papers, they suggest favouring critical writing and retrospective surveys (provided authors can be found for these!). And mention open reviews before I can mention these myself. One would think that published anonymous reviews are a step in the right direction, I would actually say that this should be the norm (plus or minus anonymity) for all journals or successors of journals (PCis coming strongly to mind). But requiring more work from the referees implies rewards for said referees, as done in some biology and hydrology journals I refereed for (and PCIs of course).