Julien on R shortcomings

Julien Cornebise posted a rather detailed set of comments (from Jasper!) that I thought was interesting and thought-provoking enough (!) to promote to a guest post. Here it is , then, to keep the debate rolling (with my only censoring being the removal of smileys!). (Please keep in mind that I do not endorse everything stated in this guest post! Especially the point on “Use R!“)

On C vs R
As a reply to Duncan: indeed C (at least for the bottlenecks) will probably always be faster for the final, mainstream use of an algorithm [e.g. as a distributed R library, or a standalone program]. Machine-level, smart compilers, etc etc. The same goes for Matlab, and even for Python: e.g. Pierre Jacob (Xian’s great PhD student) uses Weave to inline C in his Python code for the bottlenecks — simple, and fast. Some hedge funds even hire coders to recode the Matlab code of their consulting academic statisticians.

Point taken. But, as Radford Neal points out, that doesn’t justify R to be much slower that it could be:

  • When statisticians (cf Xian) want to develop/prototype new algorithms and methods while focussing on the math/stat/algo more than on the language-dependent implementation, it is still a shame to waste 50% (or even 25%). Same goes for the memory management, or even for some language features[1]
  • Even less computer-savvy users of R for real-case data, willing to use existing algorithms (not developing new algos) but on big/intricate datasets can be put off by slow speed — or even by memory failures.
  • And the library is BRILLIANT.

On Future Language vs R
Thanks David and Martyn for the link to Ihaka’s great paper on R-like lisp-based. Says things better than I could, and with an expertise on R that I haven’t. I also didn’t know about Robert Gentleman and his success at Harvard (but he *invented* the thing, not merely tuned it up).

Developing a whole new language and concept, as advocated in Ihaka’s paper and as suggested by gappy3000 would be a great leap forward, and a needed breakthrough to change the way we use computational stats. I would *love* to see that, as I personally think (as Ihaka advocates in the paper you link to) that R, as a language, is a hell of a pain [2] and I am saddened to see a lot of “Use R” books who will root its inadequate use for needs where the language hardly fits the bill — although the library does.

But R is here and in everyday use, and the matter is more of making it worth using, to its full potential. I have no special attachment to R, but any breakthrough language that would not be entirely compatible with the massive library contributed over the years would be doomed to fail to pick-up the everyday statistician—and we’re talking here about far-fetched long-term moves. Sanitary breakthrough, but harder to make happen when such an anchor is here.
I would say that R has turned into the Fortran of statistics: here to stay, anchored by the inertia that stems from its intrinsic (and widely acknowledged) merits  (I’ve been nice, I didn’t say Cobol.).

So until of the great leap forward comes (or until we make it happen as a community), I second Radford Neal‘s call for optimization of the existing core of R.

Rejoinder
As a rejoinder to the comments here, I think we need to consider separately

  1. R’s brilliant library
  2. R’s not-so-brilliant language and/or interpreter.

It seems to me from this topic that the community needs/should push for, in chronological order.

  1. First, a speed-up of R’s existing interpreter as called for by Radford Neal.  “Easy” and short-term task, by good-willing amateur coders, or, better, by solid CS people.
  2. Team-up with CS experts interested in developing computational stat-related tools.
  3. With them, get out of the now dead-ended R language and embark on a new stat framework based on an *existing*, proven, language. *Must*  be able to reuse the brilliant R library/codes brought up by the community. Failing so would fail to pick up the userbase = die in limbo.  That’s more or less what is called for by Ihaka (except for his doubts on the backward compatibility, see Section 7 of his paper).  Much harder and longer term, but worth it.

From then on
Who knows the R community enough to relay this call, and make it happen ? I’m out of my league.

Uninteresting footnotes:
[1] I have twitched several times when trying R, feeling the coding was somewhat unnatural from a CS point of view. [Mind, I twitch all the same, although on other points, with Matlab]
[2] again, I speak only out of the few tries I gave it, as I gave up using it for my everyday work, I am biased — and ignorant

Neal

13 Responses to “Julien on R shortcomings”

  1. [...] Paris statisticians are recipients of the Savage award this year: Julien Cornebise (PhD from Telecom-Paristech with Eric Moulines, now at UCL in Mark Girolami’s group) is the [...]

  2. [...] lectures and meetings. Once again, this evaluation is fairly special to the local conditions. As Julien commented, this reflects more on the faculty than on the Ph.D. [...]

  3. [...] si è accesa una nuova discussione, a cui sono seguite diverse repliche, tra le quali quelle di Christian Robert, Dirk Eddelbuettel, e Andrew [...]

  4. [...] Given the directions drafted in this comment from the father of R (along with Robert Gentleman), I once again re-post this comment as a main entry to advertise more broadly its contents. (Obviously, the whole [...]

  5. Honestly, I think the chances of moving R to another langauge are slim at best.

    The real value of R is CRAN. Basically, you would need a fully functioning interpreter in the language of choice. There is little value in half of the packages working half of the time, so that means that every single oddity and quirk in R/S that accumulated over the last decades has to be reimplemented (you may be able to exclude well-defined subsets of the language, such as S4, but you still need to make very sure that you either produce the same results or produce an error).

    This is a Herculean task. R, being intended for interactive use, lets you define, redefine and modify everything at runtime (class defs, environments, function definitions, …). That is a lot of state to keep track of, and even small changes in behavior will cause hard-to-find bugs (and judging from the discussions on R-devel, the inner workings of S4 et al are not straightforward, even to the pros on the list).

    To see just how hard that is, just read some of the Perl 6 discussions (btw, its still in development after 10 years, and Perl has pretty much fallen of the radar in the during that time).

    If anything, a reverse approach may be promising:
    Ship R-3.0 with a Scala (or whatever language fits best) runtime and a seamless Scala-to-R interface, so that people can start writing their code in Scala. Optimally, add an R-to-Scala interface, so that packages written in Scala can be used by “old” R code. This is basically the third approach from the paper.

    It goes without saying that this is technically non-trivial, but imo still less of a nightmare than reimplementing R from scratch (the code translation approach also mentioned in the paper is no less difficult).

    This kind of “soft transition” also helps with the social aspect, ie you offer the new langauge as a future path without leaving behind your userbase. (a good example of how not to do it is the Netscape-Mozilla transition. They decided to start from scratch with Netscape 6/Mozilla, and because that meant no updates for Netscape 4.7 for a couple of years, they also had to start from scratch with their userbase).

  6. We begin to enter the deep waters of C.S. flame wars, but I did a bit of poking around and if I were to name a new language on which to build New R I would propose Scala instead of Lisp (Clojure: Incanter) or Python.

    To simplify it into the movie pitch: Scala is Java with its warts removed, put onto steroids, and with the best things that R/S gained from their Lisp heritage sprinkled throughout. You can use it as an interactive command line and as a scripting language, as with R, but it also compiles down to bytecode that runs anywhere Java can, and so it’ familiar to the many Java programmers out there, and it can directly call Java and be called by Java. (Giving it all of the advantages of Clojure (Incanter), but without the Lisp syntax that turns many people off.) And it’s well-designed for multi-threaded, multi-core programming, to give us even more speed. Last, it has a core development team (like R and unlike Clojure and Incanter, which are well-done but which are apparently one-man shows).

    The only thing against it is that some have painted it as “too complex”. (I disagree completely, but that would venture too far afield of the “R replacement” topic.)

    I’m still skeptical that you could port R’s two greatest assets — its user community and the CRAN library — to another foundation. But if we would want to try, Scala feels like the best candidate for current technological reasons and also because it really strikes me as what S would have been if it were designed today.

  7. [...] Julien Cornebise (via Christian Robert): On R Shortcomings [...]

  8. 1. I find R’s lazy evaluation plus dynamic scoping really confusing. For instance, in ggplot2, if you call qplot(x,y*y,data=f), it evaluates x and y*y lazily. The documented convention in ggplot2 is to evaluate them in frame f if they are defined there, and if not, in the calling context, but what’s really going on in R is that you have to attach the frame f and it becomes the local environment, nested within the function’s environment, nested within the calling environment.

    I find the lexical scoping of languages like C much easier to understand from a programmer’s perspective.

    2. When we’re talking about R’s brilliant libraries, are these libraries written in R, or libraries written in C or Fortran and linked from R? For the former, you need to translate arbitrary R programs.

  9. Patrick Caldon Says:

    From a CS point of view one important fact about R is lazy evaluation. It’s functional like Lisp, but lazy like (extended) ML and Haskell.

    There are CS people thinking hard about parallelism and large data sets in these languages.

    Personally I find it much easier to write correct code in Haskell or ML than C-like or OO languages. (or untyped lisp) And the compilers are surprisingly good.

    I know there is a tendency with languages whenever someone says “we should redo everything in obscure language X” for someone else to say “no! we should redo everything in obscure language Y”. But laziness is powerful and already in R – you will need some kind of lazy evaluation to make a “translator” work.

  10. Thanks for the honour, Xi’an ! Since the posting of what was originally a comment, Dirk Eddelbuettel seconded the mention of inline coding — for speeding up existing code, although that does not answer to everything –, by the means of Rcpp (that he and Romain Francois developed).
    And sorry for the smileys.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 634 other followers