“simply start over and build something better”

The post on the shortcomings of R has attracted a huge number of readers and Ross Ihaka has now posted a detailed comment that is fairly pessimistic… Given the radical directions drafted in this comment from the father of R (along with Robert Gentleman), I once again re-post it as a main entry to advertise more broadly its contents. (Obviously, the whole debate is now far beyond my reach! Please comment on the most current post, i.e. this one.)

Since (something like) my name has been taken in vain here, let me chip in.

I’ve been worried for some time that R isn’t going to provide the base that we’re going to need for statistical computation in the future. (It may well be that the future is already upon us.) There are certainly efficiency problems (speed and memory use), but there are more fundamental issues too. Some of these were inherited from Sand some are peculiar to R.

One of the worst problems is scoping. Consider the following little gem.

f =function() {
if (runif(1) > .5)
x = 10
x
}

The x being returned by this function is randomly local or global. There are other examples where variables alternate between local and non-local throughout the body of a function. No sensible language would allow this. It’s ugly and it makes optimisation really difficult. This isn’t the only problem, even weirder things happen  because of interactions between scoping and lazy evaluation.

In light of this, I’ve come to the conclusion that rather than “fixing” R, it would be much more productive to simply start over and build something better. I think the best you could hope for by fixing the efficiency problems in R would be to boost performance by a small multiple, or perhaps as much as an order of magnitude. This probably isn’t enough to justify the effort (Luke Tierney has been working on R compilation for over a decade now).

To try to get an idea of how much speedup is possible, a number of us have been carrying out some experiments to see how much better we could do with something new. Based on prototyping we’ve been doing at Auckland, it looks like it should be straightforward to get two orders of magnitude speedup over R, at least for those computations which are currently bottle-necked. There are a couple of ways to make this happen.

First, scalar computations in R are very slow. This in part because the R interpreter is very slow, but also because there are a no scalar types. By introducing scalars and using compilation it looks like its possible to get a speedup by a factor of several hundred for scalar computations. This is important because it means that many ghastly uses of array operations and the apply functions could be replaced by simple loops. The cost of these improvements is that scope declarations become mandatory and (optional) type declarations are necessary to help the compiler.

As a side-effect of compilation and the use of type-hinting it should be possible to eliminate dispatch overhead for certain (sealed) classes (scalars and arrays in particular). This won’t bring huge benefits across the board, but it will mean that you won’t have to do foreign language calls to get efficiency.

A second big problem is that computations on aggregates (data frames in particular) run at glacial rates. This is entirely down to unnecessary copying because of the call-by-value semantics. Preserving call-by-value semantics while eliminating the extra copying is hard. The best we can probably do is to take a conservative approach. R already tries to avoid copying where it can, but fails in an epic fashion. The alternative is to abandon call-by-value and move to reference semantics. Again, prototyping indicates that several hundredfold speedup is possible (for data frames in particular).

The changes in semantics mentioned above mean that the new language will not be R. However, it won’t be all that far from R and it should be easy to port R code to the new system, perhaps using some form of automatic translation.

If we’re smart about building the new system, it should be possible to make use of multi-cores and parallelism. Adding this to the mix might just make it possible to get a three order-of-magnitude performance boost with just a fraction of the memory that R uses. I think it’s something really worth putting some effort into.

I also think one other change is necessary. The license will need to a better job of protecting work donated to the commons than GPL2 seems to have done. I’m not willing to have any more of my work purloined by the likes of Revolution Analytics, so I’ll be looking for better protection from the license (and being a lot more careful about who I work with).

40 Responses to ““simply start over and build something better””

  1. […] part, partly thanks to being syndicated on R-bloggers, partly thanks to the tribunes contributed by Ross Ihaka and Julien Cornebise, even though I am surprised a rather low-key Le Monde puzzle made it to the […]

  2. On reflection, perhaps my points were lost in my long-winded post!
    They were:
    1. That with an active community, the GPL is adequate (GPL-3 may help if redistribution is required) Unless financial return is important then I favour MIT or BSD licenses, as do a lot of developers, and businesses (but I guess they can afford to pay something!)
    2. That a well-managed, parallel redevelopment model (keep current version with bugfix releases) would cause least problems.
    3. There are suitable reference models and communities to refer to when planning redevelopment – both how it should be done and how it shouldn’t.
    BTW, it is my understanding, IANL :), that Revolution are cmpliant with terms of GPL by releasing the combined source code of their ‘Enterprise’ version. Need to be careful abotu code pollution with their source/algorithmic methods if there is likely to be a dispute, but as they’re effectively GPL’d shouldn’t be a problem?
    Anthony

  3. Hi xi’an!

    Delighted to contribute. Sorry about the typos :(

    Really like this blog.
    Anthony

  4. I felt the need to add something here.

    Look at the WordPress project (GPL2) run by Automattic.
    Wordpress originated with the b2 blogging system. Automattic took over the project and redeveloped the core, keeping the GPL license, which they had to do anyway. The project is huge. Automattic make their money from consultancy ad custom development work, and some plugins such as akismet (comment spam checker), which they charge a fee to use (not to install) when you need a certain level of spam checks.
    The project is large and active and developed thorugh Automattic.
    There have been problems with plugin and theme developers. As the core system is GPL’d so too must the libraries and themes be so. The rational in this case is that neither themes nor libraries can function without the core system, so they are not themselves standalone and thus are GPL. Through community effort rather than legal action most developers have been presuaded to comply with the GPL. So, what I am saying kis that the GPL is effective with a bit of effort.

    Also, as Python has been alluded to, it is worth mentioning the way that Python has been redeveloped. Versions 1,2 of the language were developed over a long period, and it was realised that there were shortcomings in the language, not unlike those of R. A parallel development of Python 3 occurred whilst still developing and patching version 2, which is still the most widely used version. Many libraries have still to be converted to version 3. However, there is now a stable, faster and more up-to-date (technologically) version 3 with no disruption to users of version 2. Contrast this with Perl, which has had a fairly disasterous evolution towards version 6, due mainly to disagreements, overreaching objectives and generally bad mangement.
    R can be redeveloped in the same manner as Python if the will is there. Now is the time to so so,as if you leave it too long something else will evolve to make up for it’s shortcomings. I particularly notice R as being a recent requirement in financial modeling job advertisements, alongside the likes of C++, java, and Matlab. Perhaps here is a location for some financial sponsorhsip. Many investment organisations make use of Open Source software and also contribute back to the community.
    In this direction think of Quantlib, which is widely used.
    Just one final point, Pyhton is also becoming widely used in the finance community, largely due to the efforts of a few developers/analysts and a company called Enthought. They have an interesting pedigree and business model.
    Sorry this reply was a bit on the long side.
    Anthony

  5. […] “simply start over and build something better” « Xi'an's Og […]

  6. […] the past week I’ve been following a discussion where Ross Ihaka wrote (here ): I’ve been worried for some time that R isn’t going to provide the base that we’re going to […]

  7. […] “simply start over and build something better” « Xi'an's Og […]

  8. Kevin Wright Says:

    He said, “I’m not willing to have any more of my work purloined by the likes of Revolution Analytics”.

    Let’s get this straight. He purloins the intellectual property of Bell Labs by re-inventing S and then makes this claim?

    One has to wonder…

    • ivorytowerkiwi Says:

      @Kevin Wright

      The R software team developed their own implementation of S and released it to the community under the GPL. The criticism of Revolution Analytics comes from the fact that they are building proprietary extensions on top of GPL licensed software that the free software (as in freedom) community built. They are not contributing their code back to the community which is the condition of using GPL software.

    • @ivorytowerwiki

      My understanding, and please correct me if I’m wrong because most of this unfortunately comes from rumor & innuendo, is that the modifications RA made to the core engine and the core modules has indeed been distributed publicly in source code form as required by the GPL, and (some of) the proprietary extensions they’ve made have not (though some have). This satisfies the GPL just fine. If people aren’t allowed to build tools on top of R and sell them anymore, this boat goes down and we’re all in it.

      As for the point about R being a re-implementation of S – I hadn’t considered that but it’s good food for thought and certainly relevant.

  9. […] “simply start over and build something better” « Xi'an's Og […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.