The foundations of Statistics: a simulation-based approach

“We have seen that a perfect correlation is perfectly linear, so an imperfect correlation will be `imperfectly linear’.” page 128

This book has been written by two linguists, Shravan Vasishth and Michael Broe, in order to teach statistics “in  areas that are traditionally not mathematically demanding” at a deeper level than traditional textbooks “without using too much mathematics”, towards building “the confidence necessary for carrying more sophisticated analyses” through R simulation. This is a praiseworthy goal, bound to produce a great book. However, and most sadly, I find the book does not live up to expectations. As in Radford Neal’s recent coverage of introductory probability books with R, there are statements there that show a deep misunderstanding of the topic… (This post has also been published on the Statistics Forum.)

“The least that you need to know about is LaTeX, Emacs, and Emacs Speaks Statistics. Other tools that will further enhance your working experience with LaTeX are AucTeX, RefTeX, preview-latex, and Python.” page 1

The above recommendation is cool (along with the point that these tools are “already pre-installed in Linux”, while, for Windows or Macintosh, users “will need to read the manual”!) but eventually rather daunting when considering the intended audience. While I am using LaTeX and only LaTeX in my everyday work, the recommendation to learn LaTeX prior to “understand the principles behind inferential statistics” sounds inappropriate. The book clearly does not require an understanding of LaTeX to be read, understood, and practiced. (Same thing for Python!)

The authors advertise a blog about the book that contains very little information. (The last entry is from December 2010: “The book is out”.) This was a neat idea, had it been implemented.

“Let us convince ourselves of the observation that the sum of the deviations from the mean always equals zero.” page 5

What I dislike the most about this book is the waste of space dedicated to expository developments that aim at bypassing mathematical formulae, only to provide at the very end of the argument this mathematical formula. And then the permanent confusion between the distribution and the sample, the true parameters and their estimates. (Plus the many foundational mistakes, as those reported below.) If a reader has had some earlier exposition to statistics, the style and pace are likely to unsettle/infuriate her. If not, she will be left with gaping holes in her statistical bases: no proper definition of unbiasedness (hence a murky justification of the degrees of freedom whenever they appear), of the Central Limit theorem, of the t distribution, no mention being made of the Law of Large Numbers (although a connection is made in the summary, page 63). This does not seem a  material that is sufficient enough to engage in reading Gelman and Hill (2007), as suggested at the end of the book… Having the normal density defined as the “somewhat intimidating-looking function” (page 39)

f(x) = \dfrac{1}{(\sigma\sqrt{2\pi})}\,E^{-((x-\mu)^2/2\sigma^2)}

certainly does not help! (Nor does the call to integrate rather than pnorm to compute normal tail probabilities (pages 69-70)).

“The key idea for inferential statistics is as follows: If we know what a `random’ distribution looks like, we can tell random variation from non-random variation.” page 9

The above quote gives a rather obscure and confusing entry to statistical inference. Especially when it appears at the beginning of a chapter (Chapter 2) centred on the binomial distribution. As the authors seem reluctant to introduce the binomial probability function from the start, they resort to an intuitive discourse based on (rather repetitive) graphs (with an additional potential confusion induced by the choice of a binomial probability of p=0.5, since pk(1-p)n-k is then constant in k…) In Section 2.3, the distinction between binomial and hypergeometric sampling is not mentioned, i.e. the binomial approximation is used without any warning that it is an approximation. The fact that the mean of the binomial distribution B(n,p) is np is not established and the variance being np(1-p) is not stated (except in the appendix). (However, the book spends four pages [36-39] showing through an R experiment that “the sum of squared deviations from the mean are [sic!] smaller than from any other number”.)

“The mean of a sample is more likely to be close to the population mean than not.” page 49

The above is the conclusive summary about the Central Limit theorem, after an histogram with 8 bins showing that “the distribution of the means is normal!”… It is then followed by a section on “s is an Unbiased Estimator of σ“, nothing less!!! This completely false result (s is the standard estimator of the standard deviation σ) is again based on the “fact” that it is “more likely than not to get close to the right value”. The introduction of the t distribution is motivated by the “fact that the sampling distribution of the sample mean is no longer be modeled by the normal distribution” (page 55). With such flaws in the presentation, it is difficult to recommend the book at any level. Especially the most introductory level.

“We know that the value is within 6 of 20, 95% of the time.” page 27

I am also dissatisfied with the way confidence and testing are handled (and not only because of my Bayesian inclinations!). The above quote, which replicates the usual fallacy about the interpretation of confidence intervals, is found a few lines away from a warning about the inversion of confidence statements! A warning only repeated later “it’s a statement about the probability that the hypothetical confidence intervals (that would be computed from the hypothetical repeated samples) will contain the population mean” (page 59). The book spends a large amount of pages on hypothesis testing, presumably because of the own interests of the authors, however it is unclear a neophyte could gain enough expertise from those pages to conduct his own tests. Worse, statements like (page 75)

H_0: \bar x = \mu_0

show a deep misunderstanding of the nature of both testing and random variables. How can one test a property about the observed sample mean?! A similar confusion appears in the ANOVA chapter (e.g. (5.51) on page 112).

“The research goal is to find out if the treatment is effective or not; if it is not, the difference between the means should be `essentially’ equivalent.” page 92

The following chapters are about analysis of variance (5), linear models (6), and linear mixed models (7). all of which face fatal deficiencies similar to the ones noted above. The book would have greatly benefited from a statistician’s review before being published.  (I cannot judge whether or not the book belongs to a particular series.) As is, it cannot deliver the expected outcome on its readers and train them towards more sophisticated statistical analyses. As a non-expert on linguistics, I cannot judge of the requirements of the field and of the complexity of the statistical models it involves. However, even the most standard models and procedures should be treated with the appropriate statistical rigour. While the goals of the book were quite commendable, it seems to me it cannot endow its intended readers with the proper perspective on statistics…

7 Responses to “The foundations of Statistics: a simulation-based approach”

  1. annoporci Says:

    excellent, small typo: gaping hole

  2. […] and Michael Broe’s The Foundations of Statistics: A Simulation-Based Approach, reviewed in a post of July 12, with a reply from the first author on July 18. (We have been in touch since then towards a revised […]

  3. As you say somewhere else, the real disappointment is with Springer, to let such an “oeuvre” slip out…

    The fact that “using R” sells is backslashing now, very sadly.
    So, we see victims of R’s success, besides all the beautiful power it has brought us — says one of the founding members of the R Core team:

    • Thank you Martin for the comments (and for the involvement with R). I agree with the (surprising?) lack of professionalism of Springer-Verlag (or of their referee[s]). I do not think, however, that this reflects poorly on R. (No more than a poor book on Bayesian Statistics [no name!] reflects poorly on BUGS.) The way the authors use R is not arguable per se, it is rather in the transition from the R output to the probabilistic interpretation that they fail for lack of a proper background. On the opposite, I find the whole thing interesting in that, thanks to the availability of R, people with little prior training can experiment with probability concepts and statistical methods. Obviously, this is up to some limit, where theory becomes a necessary step! In Dauphine, we start the statistics courses with an R class and the students react rather well to this experimental introduction, in that they can later check more theoretical notions against their computational intuitions…

  4. […] Xi’an reviews a statistics book for students with little mathematical background (in short: I don’t think he liked it very much). You can read it also here, with a response of one of the authors. […]

  5. […] Vasishth has written a response to my review both published on the Statistics Forum. His response is quite straightforward and honest. In […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 720 other followers