Archive for Ross Ihaka

Extending R

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , , , , , , , , on July 13, 2016 by xi'an

As I was previously unaware of this book coming up, my surprise and excitement were both extreme when I received it from CRC Press a few weeks ago! John Chambers, one of the fathers of S, precursor of R, had just published a book about extending R. It covers some reflections of the author on programming and the story of R (Parts 2 and 1),  and then focus on object-oriented programming (Part 3) and the interfaces from R to other languages (Part 4). While this is “only” a programming book, and thus not strictly appealing to statisticians, reading one of the original actors’ thoughts on the past, present, and future of R is simply fantastic!!! And John Chambers is definitely not calling to simply start over and build something better, as Ross Ihaka did in this [most read] post a few years ago. (It is also great to see the names of friends appearing at times, like Julie, Luke, and Duncan!)

“I wrote most of the original software for S3 methods, which were useful for their application, in the early 1990s.”

In the (hi)story part, Chambers delves into the details of the evolution of S at Bells Labs, as described in his [first]  “blue book” (which I kept on my shelf until very recently, next to the “white book“!) and of the occurrence of R in the mid-1990s. I find those sections fascinating maybe the more because I am somewhat of a contemporary, having first learned Fortran (and Pascal) in the mid-1980’s, before moving in the early 1990s to C (that I mostly coded as translated Pascal!), S-plus and eventually R, in conjunction with a (forced) migration from Unix to Linux, as my local computer managers abandoned Unix and mainframe in favour of some virtual Windows machines. And as I started running R on laptops with the help of friends more skilled than I (again keeping some of the early R manuals on my shelf until recently). Maybe one of the most surprising things about those reminiscences is that the very first version of R was dated Feb 29, 2000! Not because of Feb 29, 2000 (which, as Chambers points out, is the first use of the third-order correction to the Gregorian calendar, although I would have thought 1600 was the first one), but because I would have thought it appeared earlier, in conjunction with my first Linux laptop, but this memory is alas getting too vague!

As indicated above, the book is mostly about programming, which means in my case that some sections are definitely beyond my reach! For instance, reading “the onus is on the person writing the calling function to avoid using a reference object as the argument to an existing function that expects a named list” is not immediately clear… Nonetheless, most sections are readable [at my level] and enlightening about the mottoes “everything that exists is an object” and “everything that happens is a function” repeated throughout.  (And about my psycho-rigid ways of translating Pascal into every other language!) I obviously learned about new commands and notions, like the difference between

x <- 3

and

x <<- 3

(but I was disappointed to learn that the number of <‘s was not related with the depth or height of the allocation!) In particular, I found the part about replacement fascinating, explaining how a command like

diag(x)[i] = 3

could modify x directly. (While definitely worth reading, the chapter on R packages could have benefited from more details. But as Chambers points out there are whole books about this.) Overall, I am afraid the book will not improve my (limited) way of programming in R but I definitely recommend it to anyone even moderately skilled in the language.

1500th, 3000th, &tc

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , on January 8, 2012 by xi'an

As the ‘Og reached its 1500th post and 3000th comment at exactly the same time, a wee and only mildly interesting Sunday morning foray in what was posted so far and attracted the most attention (using the statistics provided by wordpress). The most visited posts:

Title Views
Home page 203,727
In{s}a(ne)!! 7,422
“simply start over and build something better” 6,264
Julien on R shortcomings 2,676
Sudoku via simulated annealing 2,402
About 1,876
Of black swans and bleak prospects 1,768
Solution manual to Bayesian Core on-line 1,628
Parallel processing of independent Metropolis-Hastings algorithms 1,625
Bayesian p-values 1,595
Bayes’ Theorem 1,537
#2 blog for the statistics geek?! 1,526
Do we need an integrated Bayesian/likelihood inference? 1,501
Coincidence in lotteries 1,396
Solution manual for Introducing Monte Carlo Methods with R 1,340
Julian Besag 1945-2010 1,293
Tornado in Central Park 1,093
The Search for Certainty 1,016

Hence, three R posts (incl. one by Julien and one by Ross Ihaka), three (critical) book reviews, two solution manuals, two general Bayesian posts, two computational entries, one paper (with Pierre Jacob and Murray Smith), one obituary, and one photograph news report… Altogether in line with the main purpose of the ‘Og. The most commented posts:

Post Comments
In{s}a(ne)!! 31
“simply start over and build something better” 30
That the likelihood principle does not hold… 23
Incoherent inference 23
Lack of confidence in ABC model choice 20
Parallel processing of independent Metropolis-Hastings algorithms 19
ABC model choice not to be trusted 17
MCMC with errors 16
Coincidence in lotteries 16
Bessel integral 14
Numerical analysis for statisticians 14

Not exactly the same as above! In particular, the posts about ABC model choice and our PNAS paper got into the list. At last, the top search terms:

Search Views
surfers paradise 1,050
benidorm 914
introducing monte carlo methods with r 514
andrew wyeth 398
mistborn 352
abele blanc 350
nested sampling 269
particle mcmc 269
bayesian p-value 263
julian besag 257
rites of love and math 249
millenium 237
bayesian p value 222
marie curie 221
bonsai 200

(out of which I removed the dozens of variations on xian’s blog). I find it rather sad that both top entries are beach towns that are completely unrelated to my lifestyle and to my vacation places. Overall, more than a  half of those entries do not strongly relate to the contents of the ‘Og (even though I did post at length about Saunderson’s Mistborn and Larsson’s Millenium trilogies). At last, the most popular clicks are

URL Clicks
amazon.com/gp/product/1441915753?ie=UTF8&tag=chrprobboo-20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1441915753 1,243
stat.columbia.edu/~cook/movabletype/mlm 1,039
terrytao.wordpress.com 583
amazon.com/gp/product/0387389792?ie=UTF8&tag=chrprobboo-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0387389792 575
arxiv.org/abs/1012.2184 531
radfordneal.wordpress.com/2010/08/15/two-surpising-things-about-r 529
romainfrancois.blog.free.fr 505
statisfaction.wordpress.com 404
ceremade.dauphine.fr/~xian/basudo.R 395
stackoverflow.com/questions/3706990/is-r-that-bad-that-it-should-be-rewritten-from-scratch 372
amazon.com/gp/product/0387212396?ie=UTF8&tag=chrprobboo-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0387212396 298
radfordneal.wordpress.com/2010/09/03/fourteen-patches-to-speed-up-r 298
cs.ubc.ca/~cornebis 288
statisticsforum.wordpress.com 282
arxiv.org/abs/1001.2906 279
arxiv.org/abs/1010.1595 257
amazon.com/gp/redirect.html?ie=UTF8&location=http://www.amazon.com/gp/entity/-/B001H6GSKC&tag=chrprobboo-20&linkCode=ur2&camp=1789&creative=390957 256
ceremade.dauphine.fr/~xian/BCS/solutions.pdf 253
rss.org.uk/main.asp?page=3005 243
www3.interscience.wiley.com/cgi-bin/fulltext/119424936/PDFSTART 216
stat.auckland.ac.nz/~ihaka/downloads/Compstat-2008.pdf 203

which include links to my books on Amazon, Andrew Gelman’s, Terry Tao’s, Radford Neal’s and Romain François’s blogs, the CREST stat students collective blog, and a few arXiv papers of mine’s…

Posts of the year

Posted in Books, R, Statistics, University life with tags , , , , , , , on August 31, 2011 by xi'an

Like last year, here are the most popular posts since last August:

  1. Home page 92,982
  2. In{s}a(ne)!! 6,803
  3. “simply start over and build something better” 5,834
  4. Julien on R shortcomings 2,373
  5. Parallel processing of independent Metropolis-Hastings algorithms 1,455
  6. Do we need an integrated Bayesian/likelihood inference? 1,361
  7. Coincidence in lotteries 1,256
  8. #2 blog for the statistics geek?! 863
  9. ABC model choice not to be trusted 814
  10. Sudoku via simulated annealing 706
  11. Bayes on the Beach 2010 [2] 704
  12. News about speeding R up 688
  13. Solution manual for Introducing Monte Carlo Methods with R 688
  14. R exam 617
  15. Bayesian p-values 607
  16. Monte Carlo Statistical Methods third edition 577
  17. Le Monde puzzle [49] 499
  18. The foundations of Statistics: a simulation-based approach 493
  19.  The mistborn trilogy 492
  20. Lack of confidence in ABC model choice 487
  21. Solution manual to Bayesian Core on-line 481
  22. Bayes’ Theorem 459
  23. Julian Besag 1945-2010 452
  24. Millenium 1 [movie] 448
  25. ABC lectures [finale] 436

No major surprise in this ranking: R related blogs keep the upper part, partly thanks to being syndicated on R-bloggers, partly thanks to the tribunes contributed by Ross Ihaka and Julien Cornebise, even though I am surprised a rather low-key Le Monde puzzle made it to the list (maybe because it became part of my latest R exam?). Controversial books reviews are great traffic generators, even though the review of The foundations of Statistics: a simulation-based approach was posted less than a month ago. At last, it is comforting to see two of our major research papers for the 2010-2011 period on the list: the Parallel processing of independent Metropolis-Hastings algorithms with Pierre and Murray, and the more controversial Lack of confidence in ABC model choice with Jean-Michel and Natesh (twice). The outlier in the list is undoubtedly Bayes on the Beach 2010 [2] which got undeserved traffic for pointing out to Surfers Paradise , a highly popular entry! On my side unscientific entries, Saunderson’s Mistborn and Larson’s Millenium, McCarthy’s Border trilogy missing the top list by three entries…

News about speeding R up

Posted in R, Running, Statistics with tags , , , , , on May 24, 2011 by xi'an

The most visited post ever on the ‘Og was In{s}a(ne), my report on Radford Neal’s experiments with speeding up R by using different brackets (the second most populat was Ross Ihaka’s comments, “simply start over and build something better”). I just spotted two new entries by Radford on his blog that are bound to rekindle the debate about the speed of R. The latest one shows that matrix multiplication can be made close to ten time faster by changing the way testing for the presence of NaN’s in a matrix is operated. This gain is not as shocking as producing a 25% improvement when replacing x=1/(1+x) with x=1/{1+x}, but a factor 10 is such a major gain…

“simply start over and build something better”

Posted in R, Statistics, University life with tags , , , on September 13, 2010 by xi'an

The post on the shortcomings of R has attracted a huge number of readers and Ross Ihaka has now posted a detailed comment that is fairly pessimistic… Given the radical directions drafted in this comment from the father of R (along with Robert Gentleman), I once again re-post it as a main entry to advertise more broadly its contents. (Obviously, the whole debate is now far beyond my reach! Please comment on the most current post, i.e. this one.)

Since (something like) my name has been taken in vain here, let me chip in.

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from Sand some are peculiar to R.

One of the worst problems is scoping. Consider the following little gem.

f =function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global. There are other examples where variables alternate between local and non-local throughout the body of a function. No sensible language would allow this. It’s ugly and it makes optimisation really difficult. This isn’t the only problem, even weirder things happen  because of interactions between scoping and lazy evaluation.

In light of this, I’ve come to the conclusion that rather than “fixing” R, it would be much more productive to simply start over and build something better. I think the best you could hope for by fixing the efficiency problems in R would be to boost performance by a small multiple, or perhaps as much as an order of magnitude. This probably isn’t enough to justify the effort (Luke Tierney has been working on R compilation for over a decade now).

To try to get an idea of how much speedup is possible, a number of us have been carrying out some experiments to see how much better we could do with something new. Based on prototyping we’ve been doing at Auckland, it looks like it should be straightforward to get two orders of magnitude speedup over R, at least for those computations which are currently bottle-necked. There are a couple of ways to make this happen.

First, scalar computations in R are very slow. This in part because the R interpreter is very slow, but also because there are a no scalar types. By introducing scalars and using compilation it looks like its possible to get a speedup by a factor of several hundred for scalar computations. This is important because it means that many ghastly uses of array operations and the apply functions could be replaced by simple loops. The cost of these improvements is that scope declarations become mandatory and (optional) type declarations are necessary to help the compiler.

As a side-effect of compilation and the use of type-hinting it should be possible to eliminate dispatch overhead for certain (sealed) classes (scalars and arrays in particular). This won’t bring huge benefits across the board, but it will mean that you won’t have to do foreign language calls to get efficiency.

A second big problem is that computations on aggregates (data frames in particular) run at glacial rates. This is entirely down to unnecessary copying because of the call-by-value semantics. Preserving call-by-value semantics while eliminating the extra copying is hard. The best we can probably do is to take a conservative approach. R already tries to avoid copying where it can, but fails in an epic fashion. The alternative is to abandon call-by-value and move to reference semantics. Again, prototyping indicates that several hundredfold speedup is possible (for data frames in particular).

The changes in semantics mentioned above mean that the new language will not be R. However, it won’t be all that far from R and it should be easy to port R code to the new system, perhaps using some form of automatic translation.

If we’re smart about building the new system, it should be possible to make use of multi-cores and parallelism. Adding this to the mix might just make it possible to get a three order-of-magnitude performance boost with just a fraction of the memory that R uses. I think it’s something really worth putting some effort into.

I also think one other change is necessary. The license will need to a better job of protecting work donated to the commons than GPL2 seems to have done. I’m not willing to have any more of my work purloined by the likes of Revolution Analytics, so I’ll be looking for better protection from the license (and being a lot more careful about who I work with).