## a computational approach to statistical learning [book review]

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , on April 15, 2020 by xi'an

This book was sent to me by CRC Press for review for CHANCE. I read it over a few mornings while [confined] at home and found it much more computational than statistical. In the sense that the authors go quite thoroughly into the construction of standard learning procedures, including home-made R codes that obviously help in understanding the nitty-gritty of these procedures, what they call try and tell, but that the statistical meaning and uncertainty of these procedures remain barely touched by the book. This is not uncommon to the machine-learning literature where prediction error on the testing data often appears to be the final goal but this is not so traditionally statistical. The authors introduce their work as (a computational?) supplementary to Elements of Statistical Learning, although I would find it hard to either squeeze both books into one semester or dedicate two semesters on the topic, especially at the undergraduate level.

Each chapter includes an extended analysis of a specific dataset and this is an asset of the book. If sometimes over-reaching in selling the predictive power of the procedures. Printed extensive R scripts may prove tiresome in the long run, at least to me, but this may simply be a generational gap! And the learning models are mostly unidimensional, see eg the chapter on linear smoothers with imho a profusion of methods. (Could someone please explain the point of Figure 4.9 to me?) The chapter on neural networks has a fairly intuitive introduction that should reach fresh readers. Although meeting the handwritten digit data made me shift back to the late 1980’s, when my wife was working on automatic character recognition. But I found the visualisation of the learning weights for character classification hinting at their shape (p.254) most alluring!

Among the things I am missing when reading through this book, a life-line on the meaning of a statistical model beyond prediction, attention to misspecification, uncertainty and variability, especially when reaching outside the range of the learning data, and further especially when returning regression outputs with significance stars, discussions on the assessment tools like the distance used in the objective function (for instance lacking in scale invariance when adding errors on the regression coefficients) or the unprincipled multiplication of calibration parameters, some asymptotics, at least one remark on the information loss due to splitting the data into chunks, giving some (asymptotic) substance when using “consistent”, waiting for a single page 319 to see the “data quality issues” being mentioned. While the methodology is defended by algebraic and calculus arguments, there is very little on the probability side, which explains why the authors consider that the students need “be familiar  with the concepts of expectation, bias and variance”. And only that. A few paragraphs on the Bayesian approach are doing more harm than well, especially with so little background in probability and statistics.

The book possibly contains the most unusual introduction to the linear model I can remember reading: Coefficients as derivatives… Followed by a very detailed coverage of matrix inversion and singular value decomposition. (Would not sound like the #1 priority were I to give such a course.)

The inevitable typo “the the” was found on page 37! A less common typo was Jensen’s inequality spelled as “Jenson’s inequality”. Both in the text (p.157) and in the index, followed by a repetition of the same formula in (6.8) and (6.9). A “stwart” (p.179) that made me search a while for this unknown verb. Another typo in the Nadaraya-Watson kernel regression, when the bandwidth h suddenly turns into n (and I had to check twice because of my poor eyesight!). An unusual use of partition where the sets in the partition are called partitions themselves. Similarly, fluctuating use of dots for products in dimension one, including a form of ⊗ for matricial product (in equation (8.25)) followed next page by the notation for the Hadamard product. I also suspect the matrix K in (8.68) is missing 1’s or am missing the point, since K is the number of kernels on the next page, just after a picture of the Eiffel Tower…) A surprising number of references for an undergraduate textbook, with authors sometimes cited with full name and sometimes cited with last name. And technical reports that do not belong to this level of books. Let me add the pedant remark that Conan Doyle wrote more novels “that do not include his character Sherlock Holmes” than novels which do include Sherlock.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE.]

## the worst possible proof [X’ed]

Posted in Books, Kids, Statistics, University life with tags , , , , , , on July 18, 2015 by xi'an

Another surreal experience thanks to X validated! A user of the forum recently asked for an explanation of the above proof in Lynch’s (2007) book, Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. No wonder this user was puzzled: the explanation makes no sense outside the univariate case… It is hard to fathom why on Earth the author would resort to this convoluted approach to conclude about the posterior conditional distribution being a normal centred at the least square estimate and with σ²X’X as precision matrix. Presumably, he has a poor opinion of the degree of matrix algebra numeracy of his readers [and thus should abstain from establishing the result]. As it seems unrealistic to postulate that the author is himself confused about matrix algebra, given his MSc in Statistics [the footnote ² seen above after “appropriately” acknowledges that “technically we cannot divide by” the matrix, but it goes on to suggest multiplying the numerator by the matrix

$(X^\text{T}X)^{-1} (X^\text{T}X)$

which does not make sense either, unless one introduces the trace tr(.) operator, presumably out of reach for most readers]. And this part of the explanation is unnecessarily confusing in that a basic matrix manipulation leads to the result. Or even simpler, a reference to Pythagoras’  theorem.

## beware, nefarious Bayesians threaten to take over frequentism using loss functions as Trojan horses!

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , , on November 12, 2012 by xi'an

“It is not a coincidence that textbooks written by Bayesian statisticians extol the virtue of the decision-theoretic perspective and then proceed to present the Bayesian approach as its natural extension.” (p.19)

“According to some Bayesians (see Robert, 2007), the risk function does represent a legitimate frequentist error because it is derived by taking expectations with respect to [the sampling density]. This argument is misleading for several reasons.” (p.18)

During my R exam, I read the recent arXiv posting by Aris Spanos on why “the decision theoretic perspective misrepresents the frequentist viewpoint”. The paper is entitled “Why the Decision Theoretic Perspective Misrepresents Frequentist Inference: ‘Nuts and Bolts’ vs. Learning from Data” and I found it at the very least puzzling…. The main theme is the one caricatured in the title of this post, namely that the decision-theoretic analysis of frequentist procedures is a trick brought by Bayesians to justify their own procedures. The fundamental argument behind this perspective is that decision theory operates in a “for all θ” referential while frequentist inference (in Spanos’ universe) is only concerned by one θ, the true value of the parameter. (Incidentally, the “nuts and bolt” refers to the only case when a decision-theoretic approach is relevant from a frequentist viewpoint, namely in factory quality control sampling.)

“The notions of a risk function and admissibility are inappropriate for frequentist inference because they do not represent legitimate error probabilities.” (p.3)

“An important dimension of frequentist inference that has not been adequately appreciated in the statistics literature concerns its objectives and underlying reasoning.” (p.10)

“The factual nature of frequentist reasoning in estimation also brings out the impertinence of the notion of admissibility stemming from its reliance on the quantifier ‘for all’.” (p.13)

One strange feature of the paper is that Aris Spanos seems to appropriate for himself the notion of frequentism, rejecting the choices made by (what I would call frequentist) pioneers like Wald, Neyman, “Lehmann and LeCam [sic]”, Stein. Apart from Fisher—and the paper is strongly grounded in neo-Fisherian revivalism—, the only frequentists seemingly finding grace in the eyes of the author are George Box, David Cox, and George Tiao. (The references are mostly to textbooks, incidentally.) Modern authors that clearly qualify as frequentists like Bickel, Donoho, Johnstone, or, to mention the French school, e.g., Birgé, Massart, Picard, Tsybakov, none of whom can be suspected of Bayesian inclinations!, do not appear either as satisfying those narrow tenets of frequentism. Furthermore, the concept of frequentist inference is never clearly defined within the paper. As in the above quote, the notion of “legitimate error probabilities” pops up repeatedly (15 times) within the whole manifesto without being explicitely defined. (The closest to a definition is found on page 17, where the significance level and the p-value are found to be legitimate.) Aris Spanos even rejects what I would call the von Mises basis of frequentism: “contrary to Bayesian claims, those error probabilities have nothing to to do with the temporal or the physical dimension of the long-run metaphor associated with repeated samples” (p.17), namely that a statistical  procedure cannot be evaluated on its long term performance… Continue reading