the Art of R Programming [guest post]

(This post is the preliminary version of a book review by Alessandra Iacobucci, to appear in CHANCE. Enjoy [both the review and the book]!)

As Rob J. Hyndman enthusiastically declares in his blog, “this is a gem of a book”. I would go even further and argue that The Art of R programming is a whole mine of gems. The book is well constructed, and has a very coherent structure.

After an introductory chapter, where the reader gets a quick overview on R basics that allows her to work through the examples in the following chapters, the rest of the book can be divided in three main parts. In the first part (Chapters 2 to 6) the reader is introduced to main R objects and to the functions built to handle and operate on each of them. The second part (Chapters 7 to 13) is focussed on general programming issues: R structures and object-oriented nature, I/O, string handling and manipulating issues, and graphics. Chapter 13 is all devoted to the topic of debugging. The third part deals with more advanced topics, such as speed of execution and performance issues (Chapter 14), mix-matching functions written in R and C (or Python), and parallel processing with R. Even though this last part is intended for more experienced programmers, the overall programming skills of the intended reader “may range anywhere from those of a professional software developer to `I took a programming course in college’.” (p.xxii).

With a fluent style, Matloff is able to deal with a large number of topics in a relatively limited number of pages, resulting in an astonishingly complete yet handy guide. At almost every page we discover a new command, most likely the command we had always looked for and done without by means of more or less cumbersome roundabouts. As a matter of fact, it is possible that there exists a ready-made and perfectly suited R function for nearly anything that comes up to one’s mind. Users coming from compiled programming languages may find it difficult to get used to this wealth of functions, just as they may feel uncomfortable not declaring variable types, not initializing vectors and arrays, or getting rid of loops. Nevertheless, through numerous examples and a precise knowledge of its strengths and limitations, Matloff masterly introduces the reader to the flexibility of R. He repeatedly underlines the functional nature of R in every part of the book and stresses from the outset how this feature has to be exploited for an effective programming.

“One of the most effective ways to achieve speed in R code is to use operations that are {\em vectorized}, meaning that a function applied to a vector is actually applied individually to each element.” (p.40). 

The result is so convincing that it pushes even the strictest code purist to free herself from prejudices and surrender to the  pleasures of an interpreted language. This probably was the hardest challenge in writing The Art of R programming, and the author brilliantly met it.

The climax is unquestionably attained in the final chapters, where Matloff introduces some advanced and unusual topics with remarkable clarity and briskness. Within a few pages, he manages to tackle the object-oriented side of R, to advise and instruct the reader on debugging and performance issues, to show how to deal with R and C (or Python) mixed codes, and finally to open new perspectives by presenting the different approaches to parallel R. There is even a mention of GPU programming, a short paragraph certainly inexhaustive, but still instructive. To my knowledge, this is the only R handbook in which parallel programming with R is tackled with some degree of detail (I only found a hint of it in R in a nutshell, yet no programming details are given therein.). Also, the importance and prominence given to debugging are commendable, since this topic is often and mistakenly disregarded in most programming handbooksexcept those explicitly written on the subject. Among the sharpest passages of the book, I definitely include the ones on scope and environment issues, to which are devoted both a long section in Chapter 17 and a tiny simple yet enlightening example as early as page 9.

“Note carefully the role of w. The R interpreter found that there was no local variable of that name, so it ascended to the next level […] where it found a variable w with value 12. […]. It is possible (though not desirable) to deliberately allow name conflicts in this hierarchy. […] In such a situation the innermost environment is used first.” (p.153).

The message is clear: know exactly what you want to implement, keep track of all your objects, and scoping will not be an issue but another tool.

“In C, we would not have functions defined within functions […]. Yet, since functions are objects, it is possible–and sometimes desirable from the point of view of the encapsulation goal of object-oriented programming—to define a function within a function; we are simply creating an object, which we can do anywhere.” (p.152-3).

Another little gem is Section 7.9 on recursion, a concept that Matloff presents in a very clear and intuitive way. This section ends with one the most inspired extended examples proposed in the book, where recursion is used to implement a binary search tree. Other interesting extended examples are those about discrete-event simulation (Section 7.8.3), Markov Chains (Section 8.4.2) and polynomial regression (Section 9.1.7), though these applications may be a little too challenging for readers lacking a solid background in Statistics.

Although The Art of R programming is a book of many virtues, there are in my opinion some flaws:

The presence of lines of R code starting from the first few pages encourages the user to test her understandings straight away while reading, making The Art of R programming a sort of plug-and-play guide through R. Unfortunately, the pleasure of real-time testing is spoiled by two things. First, the reader has to copy those codes line by line. This is unquestionably useful for the many simple examples scattered throughout the book. However, it may become an inexhaustible source of typos, both pointless and annoyingnot to mention time-consumingwhen it comes to more complicated programs like those expounded in the many Extended Example sections. Second, the databases are unavailable so some applications are simply unusable (I managed to find the abalone data set for extended examples of Sections 2.9.2 and 4.4.3 thus discovering this interesting repository, but for the rest my research was rather inconclusive.) I am referring here to virtually all the extended examples in Chapters 5 and 6 on data frames, factors and tables. In particular, I find the application on the aids for learning Chinese dialect (Section 5.4.3) so over-elaborate to be nearly worthless. I would certainly suggest designing a dedicated package assembling all the necessary material for a fully profitable training with the book, like the package mcsm conceived by Robert and Casella for reproducing the results contained in their book on Monte Carlo methods with R.

In addition, surely R can handle huge databases with great ease, and maybe I am giving way to my personal preferences here, but I find that two whole chapters on data frames and factors (adding up to almost 40 pages!) are perhaps too much. On the contrary, I believe that the “traditional” graphic package would have deserved more space and consideration, not only in the devoted chapter (Chapter 12) but generally throughout the book. Indeed, the author suggests some good handbooks on the subject by Murrel and Wickham, but these are too detailed and advanced to be used for general purposes.

Despite an overall concise style, there are some long-winded passage and repetitions, especially in the applications, where certain lines of code are definitely redundant. I was likewise puzzled by the total absence in the book of the command separator ;, which would have considerably shortened and lightened some unnecessarily long examples. Also, a separate and more detailed index of R commands and functions would be helpful.

Finally, a minor but curious point about the assignment operator. I find the issue of <- vs. = particularly fascinating and a bit perturbing, since this leaves in fact an ambiguity in the definition of such a fundamental operator. Still, there seem to be two main streams and no general agreement. Reading on various blogs and discussion forums, I found no decisive nor robust argument in favor of either. Matloff approaches the issue of <- vs. = in assignments as soon as page 4. As he says, “The standard assignment operator in R is <-. You can also use =, but this is discouraged, as it does not work in some special situations.”. I was really eager to see these “special situations” shown in concrete examples. Unfortunately, they are nowhere to be listed in the book.

Notwithstanding these minor defaults, The Art of R programming is enriching, enjoyable and definitely worthwhile keeping as a reference while working with R. I highly recommend it to programmers, academic researchers and students in computational statistics willing to be quickly operational in writing R software.  And it is undoubtedly a really useful reading for any R user.

15 Responses to “the Art of R Programming [guest post]”

  1. I tend to disagree with your statement that the use of semicolons “would have considerably shortened and lightened some unnecessarily long examples”.

    Granted, your code will be “shorter” if by that you mean fewer lines. But in terms of code length it is all the same. I personally prefer tall and slim to short and fat… (Now if you were talking about making the book smaller to save trees, that’s a different story.)

    I also find code following the “one statement per line” rule easier to read than a mix of single and multiple statements.

    Finally, semicolons are dangerous in a way. For consistency, people using semicolons will also want to put one at the end of the last statement on a line. Those people might be fooled into thinking that this last semicolons has a role, that it is needed to end a statement (like in Perl for example). This will increase the chances of falling into this trap:

    sum.one.to.six <- 1 + 2 + 3
    + 4 + 5 + 6;

  2. […] agree with the review here that “The Art of R Programming” is a nice book, but the lack of data for some of the […]

  3. Nice review. BTW, I made a post here that explains how to get the data for the extended example of Chapter 5.

  4. Nice review. I made a post here that provides a way to get the data for the extended example (on Chinese) from Chapter 5.

  5. Try the following:

    a <- 1:5

    b <- 4

    a[b = 3]

    b

    a[b <- 3]

    b

    And you'll see that while you get the same answer in both subscripted cases, b is changed only in the second case. As the R help page says, "The operator <- can be used anywhere, whereas the operator = is only allowed at the top level (e.g., in the complete expression typed at the command prompt) or as one of the subexpressions in a braced list of expressions." And we've only discovered two such situations (function call, array subscript) so far. in this discussion. I have a feeling there are more cases.

    Please don't encourage the use of = for assignment. It makes some potential users happy… until they do something more advanced and suddenly it all blows up. People already deal routinely with = versus ==, and some languages have used := for assignment, so it's not all that hard.

    • Thanks, Wayne. I do not really see the point in writing
      a[b=3]
      nor in writing
      a[b<-3]
      and while my students can handle <- as well as =, I do not think it is a subtlety they can grasp (at least most of them).

    • Another thing that is confusing for new R users (like me) is that <- is is one place that whitespace really matters. Your example:

      a[b <- 3]

      reminded me of a problem where I was searching for elements that were less than some negative number, and it took me a while to figure out why it didn't work. I had used no whitespace:

      a[b<-3]

      but needed to use:

      a[b< -3]

  6. It’s R standard that = must be used for argument/parameter assignment.

    Neither this, nor Chambers (worse) nor any other R “programming” text I’ve seen has done an adequate job of disambiguating the language’s claims of OO and Functional programming semantics/syntax. It would be more useful, if less Cool, to start with the way the language works, function/data as FORTRAN, and then proceed to explain the syntactic pixy dust that’s been pasted on over the years. I mean, S3 structs are Classes? They’re just named data lumps.

  7. Moyenne Armorique Says:

    I personally think that declaring variables when calling a function is not a “clean” practice and should be avoided. Of course, the case of system.time() is different, but it is the only issue I see here.

    In any case, adding parentheses, e.g.

    > system.time((a = 1:10))
    user system elapsed
    0 0 0

    would perfectly do. The two assignment operators are totally interchangeable and can be chosen according to the programmer personal taste. Meaning: one of them is redundant. And the whole thing is indeed ambiguous.

    In his post http://csgillespie.wordpress.com/2010/11/16/assignment-operators-in-r-vs/ C. S. Gillespie gives a number of reasons why ‘=’ should be preferred to ‘<-' :

    – The other languages I program in (python, C and occasionally JavaScript) use the “=” operator.
    – It’s quicker to type “=” than “<-”.
    – Typically, when I type declare a variable – I only want it to exist in the current workspace.
    – Since I have the pleasure of teaching undergraduates their first course in programming, using “=” avoids misleading expressions like if (x[1]<-2)

    Apart from the fact that it's useful in making the super-assignment operator '<<-' syntax look natural, the only reason I see in keeping the old '<-' assignment operator is an ''historical'' one.

  8. A special situation where ‘=’ does not do the same as ‘<-' can be found in the R inferno (http://www.burns-stat.com/pages/Tutor/R_inferno.pdf) page 94:

    system.time(result <- my.test.function(100))

    I also bought and read "The Art of R Programming". IMHO, it is an ok book, certainly not a bible like many reviews (including yours) are trying to depict it as. I for example, have learnt a lot more about R's pitfalls and good programming by reading Richard Burns' "R Inferno", and some sections of his previous book "S poetry", which are both free.

  9. Steve Lianoglou Says:

    Hi,

    Thanks for the review — for the curious, here is a “special situation” when `<-` and `=` are not interchangeable.

    Say you want to speed test the runtime of a long running operation (or function call) — the example I'm using definitely is *not* a long running operation, but you get the idea:

    system.time(a <- 1:10)
    user system elapsed
    0 0 0

    vs.

    system.time(a = 1:10)
    Error in system.time(a = 1:10) : unused argument(s) (a = 1:10)

    Too late for inclusion in the book, but perhaps helpful for the internet traveler (and the reviewer) who stumble upon these comments …

  10. Here is an example where the operators = and <- do not produce the same results:

    table(x <- ceiling(10*runif(100)), y <- rbinom(100,1,.5))

    table(x2 = ceiling(10*runif(100)), y2 = rbinom(100,1,.5))

    • Well, they do produce the same resulting tables if you call set.seed() before each. The first, however, clutters your workspace with x and y (and potentially overwrites variables, a dangerous side-effect). I find that use of “<-" makes code more readable.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.