“simply start over and build something better”

The post on the shortcomings of R has attracted a huge number of readers and Ross Ihaka has now posted a detailed comment that is fairly pessimistic… Given the radical directions drafted in this comment from the father of R (along with Robert Gentleman), I once again re-post it as a main entry to advertise more broadly its contents. (Obviously, the whole debate is now far beyond my reach! Please comment on the most current post, i.e. this one.)

Since (something like) my name has been taken in vain here, let me chip in.

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from Sand some are peculiar to R.

One of the worst problems is scoping. Consider the following little gem.

f =function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global. There are other examples where variables alternate between local and non-local throughout the body of a function. No sensible language would allow this. It’s ugly and it makes optimisation really difficult. This isn’t the only problem, even weirder things happen because of interactions between scoping and lazy evaluation.

In light of this, I’ve come to the conclusion that rather than “fixing” R, it would be much more productive to simply start over and build something better. I think the best you could hope for by fixing the efficiency problems in R would be to boost performance by a small multiple, or perhaps as much as an order of magnitude. This probably isn’t enough to justify the effort (Luke Tierney has been working on R compilation for over a decade now).

To try to get an idea of how much speedup is possible, a number of us have been carrying out some experiments to see how much better we could do with something new. Based on prototyping we’ve been doing at Auckland, it looks like it should be straightforward to get two orders of magnitude speedup over R, at least for those computations which are currently bottle-necked. There are a couple of ways to make this happen.

First, scalar computations in R are very slow. This in part because the R interpreter is very slow, but also because there are a no scalar types. By introducing scalars and using compilation it looks like its possible to get a speedup by a factor of several hundred for scalar computations. This is important because it means that many ghastly uses of array operations and the apply functions could be replaced by simple loops. The cost of these improvements is that scope declarations become mandatory and (optional) type declarations are necessary to help the compiler.

As a side-effect of compilation and the use of type-hinting it should be possible to eliminate dispatch overhead for certain (sealed) classes (scalars and arrays in particular). This won’t bring huge benefits across the board, but it will mean that you won’t have to do foreign language calls to get efficiency.

A second big problem is that computations on aggregates (data frames in particular) run at glacial rates. This is entirely down to unnecessary copying because of the call-by-value semantics. Preserving call-by-value semantics while eliminating the extra copying is hard. The best we can probably do is to take a conservative approach. R already tries to avoid copying where it can, but fails in an epic fashion. The alternative is to abandon call-by-value and move to reference semantics. Again, prototyping indicates that several hundredfold speedup is possible (for data frames in particular).

The changes in semantics mentioned above mean that the new language will not be R. However, it won’t be all that far from R and it should be easy to port R code to the new system, perhaps using some form of automatic translation.

If we’re smart about building the new system, it should be possible to make use of multi-cores and parallelism. Adding this to the mix might just make it possible to get a three order-of-magnitude performance boost with just a fraction of the memory that R uses. I think it’s something really worth putting some effort into.

I also think one other change is necessary. The license will need to a better job of protecting work donated to the commons than GPL2 seems to have done. I’m not willing to have any more of my work purloined by the likes of Revolution Analytics, so I’ll be looking for better protection from the license (and being a lot more careful about who I work with).

This entry was posted on September 13, 2010 at 12:36 am and is filed under R, Statistics, University life with tags CRAN, R, Robert Gentleman, Ross Ihaka. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

41 Responses to ““simply start over and build something better””

R, Julia and the shiny new thing – Palimpsest Says:
March 5, 2023 at 3:57 am

[…] Temple Long have mused on starting over (PDF, 2008). Similar comments were voiced by Ihaka in Christian Robert’s blog (2010) and were probably at the root of the development of Incanter (based on Clojure). Vince […]

Reply

Eugene Says:
September 18, 2017 at 4:09 am

It is now possible in 2017 to use quasiquotation, à la rlang (http://rlang.tidyverse.org/), to fix the scope of function variables, and resolve the problem of scope indeterminacy.

The nofrills package (https://cran.r-project.org/package=nofrills) provides an alternative to `function()` that facilitates this, by allowing you to fix the default value of local ‘x’ to the value of global ‘x’ at the _time of function creation_:

– either via an argument default value:
f .5) x <- 10; x})

– or via a local assignment:
f <- nofrills::fn(~ {x .5) x <- 10; x})

Now the 'x' in 'f' is purely local, and may assume the value that global 'x' has at the time of f's creation. Even if global 'x' changes, 'f' itself won't.

Eugene Says:
September 18, 2017 at 4:15 am

Seems my example code hasn’t rendered properly. In its place, see

	library(nofrills)

	x <- 0

	# make x local by pegging it to an argument
	f1 <- fn(x = !!x ~ {
	if (runif(1) > .5)
	x <- 10
	x
	})

	# assign local x to global x at the time of f2's creation (not call)
	f2 <- fn(~ {
	x <- !!x
	if (runif(1) > .5)
	x <- 10
	x
	})

	# even when x changes in the lexical scope, f1 and f2 don't
	x <- 1
	replicate(10, f1()) # only contains 0's or 10's
	replicate(10, f2()) # ditto

view raw

comment.R

hosted with ❤ by GitHub

Escopo Léxico e Lazy-evaluation no R | Blog Pra falar de coisas Says:
July 19, 2014 at 5:47 pm

[…] nem sempre isso funciona. Os problemas de escopo no R foram ilustrados muito bem, eu acho, pelo Christian Robert no blog dele há algum tempo. Considere o código […]

Reply
Stupid R Tricks: Random Scope « Statistical Modeling, Causal Inference, and Social Science Statistical Modeling, Causal Inference, and Social Science Says:
March 15, 2014 at 11:30 am

[…] Simply Start Over and Build Something Better […]

Reply
Is R “that bad” that it should be rewritten from scratch? | Ask Programming & Technology Says:
November 4, 2013 at 12:48 am

[…] In the past week I’ve been following a discussion where Ross Ihaka wrote: […]

Reply
Is R "that bad" that it should be rewritten from scratch? - Tech Forum Network Says:
May 29, 2013 at 5:50 am

[…] In the past week I’ve been following a discussion where Ross Ihaka wrote: […]

Reply
R, Julia and the shiny new thing | Quantum Forest Says:
April 12, 2012 at 12:37 pm

[…] Temple Long have mused on starting over (PDF, 2008). Similar comments were voiced by Ihaka in Christian Robert’s blog (2010) and were probably at the root of the development of Incanter (based on Clojure). Vince […]

Reply
crowding Says:
March 7, 2012 at 11:26 am

“Preserving call-by-value semantics while eliminating the extra copying is hard.”

Have you looked at the “chunked arrays” recently added to Clojure? Basically instead of arrays allocated in a single block, you chop the array up into blocks and organize it in a b-tree.

So, you can have an array of thousands of items and pass it to a function which changes one item — you no longer re-copy the entire array, all that happens is you allocate a single new chunk and O(log N) nodes to return a new tree. The old tree remains where it was and shares memory on the overlapping chunks/nodes.

Appends, inserts, deletes, concatenations all use less time and memory, while normal array operations remain fast. Indexing changes from O(1) to O(log N) but in practice you’ve gained way more than you’ve lost.

And all the chunks and nodes are immutable, the reference graph is a DAG so simple reference counting suffices for memory management, and you keep call-by-value semantics. I think that is important, especially with the increasing importance of parallel computing. Keeping pass-by-reference arrays synched across threads would be a nightmare.

Reply
- xi'an Says:
  March 8, 2012 at 6:41 pm
  
  Thanks for the comment! Overall, it hardly makes sense to me… but may do [make sense] for some portion of the ‘Og’s readership.
  
  Reply
1500th, 3000th, &tc « Xi'an's Og Says:
January 8, 2012 at 12:15 am

[…] three R posts (incl. one by Julien and one by Ross Ihaka), three (critical) book reviews, two solution manuals, two general Bayesian posts, two […]

Reply
John Says:
October 14, 2011 at 7:42 am

Has anyone just tried compiling and running R with the scoping explicit? Is there anywhere in base R or a package that really depends on this behaviour that can’t easily be fixed with explicit calls (i.e. <<-)? How hard is it really to just change this scoping rule so that R never looks outside the function unless explicitly told to do so?

Reply