## a computational approach to statistical learning [book review]

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on April 15, 2020 by xi'an

This book was sent to me by CRC Press for review for CHANCE. I read it over a few mornings while [confined] at home and found it much more computational than statistical. In the sense that the authors go quite thoroughly into the construction of standard learning procedures, including home-made R codes that obviously help in understanding the nitty-gritty of these procedures, what they call try and tell, but that the statistical meaning and uncertainty of these procedures remain barely touched by the book. This is not uncommon to the machine-learning literature where prediction error on the testing data often appears to be the final goal but this is not so traditionally statistical. The authors introduce their work as (a computational?) supplementary to Elements of Statistical Learning, although I would find it hard to either squeeze both books into one semester or dedicate two semesters on the topic, especially at the undergraduate level.

Each chapter includes an extended analysis of a specific dataset and this is an asset of the book. If sometimes over-reaching in selling the predictive power of the procedures. Printed extensive R scripts may prove tiresome in the long run, at least to me, but this may simply be a generational gap! And the learning models are mostly unidimensional, see eg the chapter on linear smoothers with imho a profusion of methods. (Could someone please explain the point of Figure 4.9 to me?) The chapter on neural networks has a fairly intuitive introduction that should reach fresh readers. Although meeting the handwritten digit data made me shift back to the late 1980’s, when my wife was working on automatic character recognition. But I found the visualisation of the learning weights for character classification hinting at their shape (p.254) most alluring!

Among the things I am missing when reading through this book, a life-line on the meaning of a statistical model beyond prediction, attention to misspecification, uncertainty and variability, especially when reaching outside the range of the learning data, and further especially when returning regression outputs with significance stars, discussions on the assessment tools like the distance used in the objective function (for instance lacking in scale invariance when adding errors on the regression coefficients) or the unprincipled multiplication of calibration parameters, some asymptotics, at least one remark on the information loss due to splitting the data into chunks, giving some (asymptotic) substance when using “consistent”, waiting for a single page 319 to see the “data quality issues” being mentioned. While the methodology is defended by algebraic and calculus arguments, there is very little on the probability side, which explains why the authors consider that the students need “be familiar  with the concepts of expectation, bias and variance”. And only that. A few paragraphs on the Bayesian approach are doing more harm than well, especially with so little background in probability and statistics.

The book possibly contains the most unusual introduction to the linear model I can remember reading: Coefficients as derivatives… Followed by a very detailed coverage of matrix inversion and singular value decomposition. (Would not sound like the #1 priority were I to give such a course.)

The inevitable typo “the the” was found on page 37! A less common typo was Jensen’s inequality spelled as “Jenson’s inequality”. Both in the text (p.157) and in the index, followed by a repetition of the same formula in (6.8) and (6.9). A “stwart” (p.179) that made me search a while for this unknown verb. Another typo in the Nadaraya-Watson kernel regression, when the bandwidth h suddenly turns into n (and I had to check twice because of my poor eyesight!). An unusual use of partition where the sets in the partition are called partitions themselves. Similarly, fluctuating use of dots for products in dimension one, including a form of ⊗ for matricial product (in equation (8.25)) followed next page by the notation for the Hadamard product. I also suspect the matrix K in (8.68) is missing 1’s or am missing the point, since K is the number of kernels on the next page, just after a picture of the Eiffel Tower…) A surprising number of references for an undergraduate textbook, with authors sometimes cited with full name and sometimes cited with last name. And technical reports that do not belong to this level of books. Let me add the pedant remark that Conan Doyle wrote more novels “that do not include his character Sherlock Holmes” than novels which do include Sherlock.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE.]

## nonparametric Bayesian clay for robust decision bricks

Posted in Statistics with tags , , , , , , on January 30, 2017 by xi'an

Just received an email today that our discussion with Judith of Chris Holmes and James Watson’s paper was now published as Statistical Science 2016, Vol. 31, No. 4, 506-510… While it is almost identical to the arXiv version, it can be read on-line.

## comments on Watson and Holmes

Posted in Books, pictures, Statistics, Travel with tags , , , , , , , , , on April 1, 2016 by xi'an

“The world is full of obvious things which nobody by any chance ever observes.” The Hound of the Baskervilles

In connection with the incoming publication of James Watson’s and Chris Holmes’ Approximating models and robust decisions in Statistical Science, Judith Rousseau and I wrote a discussion on the paper that has been arXived yesterday.

“Overall, we consider that the calibration of the Kullback-Leibler divergence remains an open problem.” (p.18)

While the paper connects with earlier ones by Chris and coauthors, and possibly despite the overall critical tone of the comments!, I really appreciate the renewed interest in robustness advocated in this paper. I was going to write Bayesian robustness but to differ from the perspective adopted in the 90’s where robustness was mostly about the prior, I would say this is rather a Bayesian approach to model robustness from a decisional perspective. With definitive innovations like considering the impact of posterior uncertainty over the decision space, uncertainty being defined e.g. in terms of Kullback-Leibler neighbourhoods. Or with a Dirichlet process distribution on the posterior. This may step out of the standard Bayesian approach but it remains of definite interest! (And note that this discussion of ours [reluctantly!] refrained from capitalising on the names of the authors to build easy puns linked with the most Bayesian of all detectives!)

## Sherlock [#3]

Posted in Books with tags , , , , , , on March 14, 2015 by xi'an

After watching the first two seasons of the BBC TV Series Sherlock while at the hospital, I found myself looking forward further adventures of Holmes and Watson and eventually “bought” the third season. And watched it over the past weekends. I liked it very much as this new season distanced itself from the sheer depiction of Sherlock’s amazing powers to a quite ironic and self-parodic story, well in tune with a third season where the audience is now utterly familiar with the main characters. They all put on weight (mostly figuratively!), from Sherlock’s acknowledgement of his psychological shortcomings, to Mrs. Hudson’s revealing her drug trafficking past and expressing her dislike of Mycroft, to  John Watson’s engagement and acceptance of Sherlock’s idiosyncrasies, making him the central character of the series in a sort of fatherly figure. Some new characters are also terrific, including Mary Morstan and the new archvillain, C.A. Magnussen. Paradoxically, this makes the detective part of the stories secondary, which is all for the best as, in my opinion, the plots are rather weak and the resolutions hardly relying on high intellectual powers, albeit always surprising. More sleuthing in the new season would be most welcome! As an aside, the wedding place sounded somewhat familiar to me, until I realised it was Goldney Hall, where the recent workshops I attended in Bristol took place.

## Unusual timing shows how random mass murder can be (or even less)

Posted in Books, R, Statistics, Travel with tags , , , , , , , , on November 29, 2013 by xi'an

This post follows the original one on the headline of the USA Today I read during my flight to Toronto last month. I remind you that the unusual pattern was about observing four U.S. mass murders happening within four days, “for the first time in at least seven years”. Which means that the difference between the four dates is at most 3, not 4!

I asked my friend Anirban Das Gupta from Purdue University are the exact value of this probability and the first thing he pointed out was that I used a different meaning of “within 4”. He then went into an elaborate calculation to find an upper bound on this probability, upper bound that was way above my Monte Carlo approximation and my rough calculation of last post. I rechecked my R code and found it was not achieving the right approximation since one date was within 3 days of three other days, at least… I thus rewrote the following R code

T=10^6
four=rep(0,T)
for (t in 1:T){
day=sort(sample(1:365,30,rep=TRUE)) #30 random days
day=c(day,day[day>363]-365) #account for toric difference
tem=outer(day,day,"-")
four[t]=(max(apply(((tem>-1)&(tem<4)),1,sum)>3))
}
mean(four)


[checked it was ok for two dates within 1 day, resulting in the birthday problem probability] and found 0.070214, which is much larger than the earlier value and shows it takes an average 14 years for the “unlikely” event to happen! And the chances that it happens within seven years is 40%.

Another coincidence relates to this evaluation, namely the fact that two elderly couples in France committed couple suicide within three days, last week. I however could not find the figures for the number of couple suicides per year. Maybe because it is extremely rare. Or undetected…

## Unusual timing shows how random mass murder can be (or not)

Posted in Books, R, Statistics, Travel with tags , , , , , , , , on November 4, 2013 by xi'an

This was one headline in the USA Today I picked from the hotel lobby on my way to Pittsburgh airport and then Toronto this morning. The unusual pattern was about observing four U.S. mass murders happening within four days, “for the first time in at least seven years”. The article did not explain why this was unusual. And reported one mass murder expert’s opinion instead of a statistician’s…

Now, there are about 30 mass murders in the U.S. each year (!), so the probability of finding at least four of those 30 events within 4 days of one another should be related to von Mises‘ birthday problem. For instance, Abramson and Moser derived in 1970 that the probability that at least two people (among n) have birthday within k days of one another (for an m days year) is

$p(n,k,m) = 1 - \dfrac{(m-nk-1)!}{m^{n-1}(m-nk-n)!}$

but I did not find an extension to the case of the four (to borrow from Conan Doyle!)… A quick approximation would be to turn the problem into a birthday problem with 364/4=91 days and count the probability that four share the same birthday

${30 \choose 4} \frac{90^{26}}{91^{29}}=0.0273$

which is surprisingly large. So I checked with a R code in the plane:

T=10^5
four=rep(0,T)
for (t in 1:T){
day=sample(1:365,30,rep=TRUE)
four[t]=(max(apply((abs(outer(day,day,"-"))<4),1,sum))>4)}
mean(four)


and found 0.0278, which means the above approximation is far from terrible! I think it may actually be “exact” in the sense that observing exactly four murders within four days of one another is given by this probability. The cases of five, six, &tc. murders are omitted but they are also highly negligible. And from this number, we can see that there is a 18% probability that the case of the four occurs within seven years. Not so unlikely, then.