## learning base R [book review]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , , on February 26, 2022 by xi'an

This second edition of an introductory R book was sent to me by the author for a potential CHANCE book review.  As there are many (many) books in the same spirit, the main question behind my reading it (in one go) was on the novelty it brings. The topics Learning Base R covers are

• arithmetics with R
• data structures
• built-in and user-written R functions
• R utilities
• more data structures
• comparison and coercion
• lists and data frames
• resident R datasets
• R interface
• probability calculations in R
• R graphics
• R programming
• simulations
• statistical inference in R
• linear algebra
• use of R packages

within as many short chapters. The style is rather standard, that is, short paragraphs with mostly raw reproductions of line commands and their outcome. Sometimes a whole page long of code examples (if with comments). All in all I feel there are rather too few tables when compared with examples, at least for my own taste. The exercises are mostly short and, while they vary in depth, they show that the book is rather intended for students with some mathematical background (e.g., with a chapter on complex numbers and another one on linear algebra that do not seem immediately relevant for most intended readers). Or more than that, when considering one (of several) exercise (19.30) on the Black-Scholes process that mentions Brownian motion. Possibly less appealing for would-be statisticians.

I also wonder at the pedagogical choice of not including and involving more clearly graphical interfaces like R studio as students are usually not big fans of “old-style” [their wording not mine!] line command languages. For instance, the chapter on packages would have benefited from this perspective. Nothing on Rmarkdown either. Apparently nothing on handling big data, more advanced database manipulation, the related realistic dangers of memory freeze and compulsory reboot, the intricacies of managing different directories and earlier sessions, little on the urgency of avoiding loops (p.233) by vectorial programming, a paradoxically if function being introduced after ifelse, and again not that much on statistics (with density only occurring in exercises).The chapter on customising R graphics may possibly scare the intended reader when considering the all-in-one example of p.193! As we advance though the book, the more advanced examples often are fairly standard programming ones (found in other language manuals) like creating Fibonacci numbers, implementing Eratosthenes sieve, playing the Hanoi Tower game… (At least they remind me of examples read in the language manuals I read as a student.) The simulation chapter could have gone into the one (Chap. 19) on probability calculations, rather than superfluously redefining standard distributions. (Except when defining a random number as a uniformly random number (p.162).)  This chapter also spends an unusual amount of space on linear congruencial pseudo-random generators, while missing to point out the trivia that the randu dataset mentioned twice earlier is actually an outcome from the infamous RANDU Fortran generator. The following section in that chapter is written in such a way that it may give the wrong impression that one can find the analytic solution from repeated Monte Carlo experiments and hence the error. Which is rarely the case, even in finite environments with rational expectations, as one usually does not know of which unit fraction the expectation should be a multiple of. (Remember the Squid Games paradox!) And no mention is made of the prescription of always returning an error estimate along with the numerical approximation. The statistics chapter is obviously more developed, with descriptive statistics, ecdf, but no bootrstap, a t.test curiously applied to the Michelson measurements of the speed of light (how could it be zero?!), ANOVA, regression handled via lm and glm, time series analysis by ARIMA models, which I hope will not be the sole exposure of readers to these concepts.

In conclusion, there is nothing critically wrong with this manual introducing R to newcomers and I would not mind having my undergraduate students reading it (rather than our shorter and home-made handout, polished along the years) before my first mathematical statistics lab. However I do not find it massively innovative in its presentation or choice of concept, even though the most advanced examples are not necessarily standard, and may not appeal to all categories of students.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Book Review section in CHANCE.]

## complex Cauchys

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 8, 2018 by xi'an

During a visit of Don Fraser and Nancy Reid to Paris-Dauphine where Nancy gave a nice introduction to confidence distributions, Don pointed out to me a 1992 paper by Peter McCullagh on the Cauchy distribution. Following my recent foray into the estimation of the Cauchy location parameter. Among several most interesting aspects of the Cauchy, Peter re-expressed the density of a Cauchy C(θ¹,θ²) as

f(x;θ¹,θ²) = |θ²| / |x-θ|²

when θ=θ¹+ιθ² [a complex number on the half-plane]. Denoting the Cauchy C(θ¹,θ²) as Cauchy C(θ), the property that the ratio aX+b/cX+d follows a Cauchy for all real numbers a,b,c,d,

C(aθ+b/cθ+d)

[when X is C(θ)] follows rather readily. But then comes the remark that

“those properties follow immediately from the definition of the Cauchy as the ratio of two correlated normals with zero mean.”

which seems to relate to the conjecture solved by Natesh Pillai and Xiao-Li Meng a few years ago. But the fact that  a ratio of two correlated centred Normals is Cauchy is actually known at least from the1930’s, as shown by Feller (1930, Biometrika) and Geary (1930, JRSS B).

## Bayes at the Bac’ [again]

Posted in Kids, Statistics with tags , , , , , , , , on June 19, 2014 by xi'an

When my son took the mathematics exam of the baccalauréat a few years ago, the probability problem was a straightforward application of Bayes’ theorem.  (Problem which was later cancelled due to a minor leak…) Surprise, surprise, Bayes is back this year for my daughter’s exam. Once again, the topic is a pharmaceutical lab with a test, test with different positive rates on two populations (healthy vs. sick), and the very basic question is to derive the probability that a person is sick given the test is positive. Then a (predictable) application of the CLT-based confidence interval on a binomial proportion. And the derivation of a normal confidence interval, once again compounded by  a CLT-based confidence interval on a binomial proportion… Fairly straightforward with no combinatoric difficulty.

The other problems were on (a) a sequence defined by the integral

$\int_0^1 (x+e^{-nx})\text{d}x$

(b) solving the equation

$z^4+4z^2+16=0$

in the complex plane and (c) Cartesian 2-D and 3-D geometry, again avoiding abstruse geometric questions… A rather conventional exam from my biased perspective.

## Bayes at the Bac’ [and out!]

Posted in Kids, Statistics with tags , , , , , , on June 24, 2011 by xi'an

In the mathematics exam of the baccalauréat my son (and 160,000 other students) took on Tuesday, the probability problem was a straightforward application of Bayes’ theorem. Given a viral test with 99% positives for infected patients and 97% negatives for non-infected patients, in a population with 2% of infected patients, what is the probability that the patient is infected given that the test is positive? (It looks like another avatar of Exercise 1.7  in The Bayesian Choice!) A lucky occurrence, given that I had explained to my son Bayes’ formula earlier this year (neither the math book nor the math teacher mentioned Bayes, incidentally!) and even more given that, in a crash revision Jean-Michel Marin gave him the evening before, they went over it once again! The other problems were a straightforward multiple choice about complex numbers (with one mistake!), some calculus around the functional sequence xne-x, and some arithmetic questions around Gauss’s and Bezout’s theorems. A few hours after I wrote the above, the (official) news came that this question had been posted on the web prior to the exam by someone and thus that it would be canceled from the exam by the Ministry for Education! The grade will then be computed on the other problems, which is rather unfair for the students. (On the side, the press release from the Ministry contains a highly specious argument that regulation allows for three to five exercises in the exam, hence that there is nothing wrong with reducing the number of exercises to three!) Not so lucky an occurrence, then, and I very deadly hope this will not impact in a drastic manner my son’s result! (Most likely, the grading will be more tolerant and students will not unduly suffer from the action of a very few….)

## Typos in Chapters 1, 4 & 8

Posted in Books, R, Statistics with tags , , , , on February 10, 2010 by xi'an

Thomas Clerc from Fribourg pointed out an embarassing typo in Chapter 8 of “Introducing Monte Carlo Methods with R”, namely that I defined on page 247 the complex number $\iota$ as the squared root of 1 and not of -1! Not that this impacts much on the remainder of the book but still an embarassment!!!

An inconsistent notation was uncovered by Bastien Boussau from Berkeley this time for the book The Bayesian Choice. In Example 1.1.3, on page 3, I consider an hypergeometric $\mathcal{H}(30,N,20/N)$ distribution, while in Appendix A, I denote hypergeometric distributions as $\mathcal{H}(N;n;p)$, inverting the role of the population size and of the sample size. Sorry about that, inconsistencies in notations are alas occuring in my books… In case I have not mentioned it so far, Example 4.3.3 further involves a typo (detected by Cristiano Passerini from Pontecchio Marconi) again with the hypergeometric distribution  $\mathcal{H}(N;n;p)$! The ratio should be

$\dfrac{{n_1\choose n_{11}} {n-n_1\choose n_2-n_{11}}\big/ {n\choose n_2}\pi(N=n)}{\sum_{k=36}^{50} {n_1\choose n_{11}} {k-n_1\choose n_2-n_{11}}\big/ {k\choose n_2}\pi(N=k)}$