Archive for open source

limited shelf validity

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on December 11, 2019 by xi'an

A great article from Steve Stigler in the new, multi-scaled, and so exciting Harvard Data Science Review magisterially operated by Xiao-Li Meng, on the limitations of old datasets. Illustrated by three famous datasets used by three equally famous statisticians, Quetelet, Bortkiewicz, and Gosset. None of whom were fundamentally interested in the data for their own sake. First, Quetelet’s data was (wrongly) reconstructed and missed the opportunity to beat Galton at discovering correlation. Second, Bortkiewicz went looking (or even cherry-picking!) for these rare events in yearly tables of mortality minutely divided between causes such as military horse kicks. The third dataset is not Guinness‘, but a test between two sleeping pills, operated rather crudely over inmates from a psychiatric institution in Kalamazoo, with further mishandling by Gosset himself. Manipulations that turn the data into dead data, as Steve put it. (And illustrates with the above skull collection picture. As well as warning against attempts at resuscitating dead data into what could be called “zombie data”.)

“Successful resurrection is only slightly more common than in Christian theology.”

His global perspective on dead data is that they should stop being used before extending their (shelf) life, rather than turning into benchmarks recycled over and over as a proof of concept. If only (my two cents) because it leads to calibrate (and choose) methods doing well over these benchmarks. Another example that could have been added to the skulls above is the Galaxy Velocity Dataset that makes frequent appearances in works estimating Gaussian mixtures. Which Radford Neal signaled at the 2001 ICMS workshop on mixture estimation as an inappropriate use of the dataset since astrophysical arguments weighted against a mixture modelling.

“…the role of context in shaping data selection and form—context in temporal, political, and social as well as scientific terms—has been shown to be a powerful and interesting phenomenon.”

The potential for “dead-er” data (my neologism!) increases with the epoch in that the careful sleuth work Steve (and others) conducted about these historical datasets is absolutely impossible with the current massive data sets. Massive and proprietary. And presumably discarded once the associated neural net is designed and sold. Letting the burden of unmasking the potential (or highly probable?) biases to others. Most interestingly, this recoups a “comment” in Nature of 17 October by Sabina Leonelli on the transformation of data from a national treasure to a commodity which “ownership can confer and signal power”. But her call for openness and governance of research data seems as illusory as other attempts to sever the GAFAs from their extra-territorial privileges…

R wins COPSS Award!

Posted in Statistics with tags , , , , , , , , on August 4, 2019 by xi'an

Hadley Wickham from RStudio has won the 2019 COPSS Award, which expresses a rather radical switch from the traditional recipient of this award in that this recognises his many contributions to the R language and in particular to RStudio. The full quote for the nomination is his  “influential work in statistical computing, visualisation, graphics, and data analysis” including “making statistical thinking and computing accessible to a large audience”. With the last part possibly a recognition of the appeal of Open Source… (I was not in Denver for the awards ceremony, having left after the ABC session on Monday morning. Unfortunately, this session only attracted a few souls, due to the competition of twentysome other sessions, including, excusez du peu!, David Dunson’s Medallion Lecture and Michael Lavine’s IOL on the likelihood principle. And Marco Ferreira’s short-course on Bayesian time series. This is the way the joint meeting goes, but it is disappointing to reach so few people.)

Journal of Open Source Software

Posted in Books, R, Statistics, University life with tags , , , , , , , , on October 4, 2016 by xi'an

A week ago, I received a request for refereeing a paper for the Journal of Open Source Software, which I have never seen (or heard of) before. The concept is quite interesting with a scope much broader than statistical computing (as I do not know anyone in the board and no-one there seems affiliated with a Statistics department). Papers are very terse, describing the associated code in one page or two, and the purpose of refereeing is to check the code. (I was asked to evaluate an MCMC R package but declined for lack of time.) Which is a pretty light task if the code is friendly enough to operate right away and provide demos. Best of luck to this endeavour!

grateful if you could give us your expert opinion [give as in gift]

Posted in Statistics with tags , , on December 12, 2015 by xi'an

I received this mail today about refereeing a paper for yet another open source “publisher” and went and checked that the F1000Research business model was as suspected another of those websites charging large amounts for publishing. At least they ask real referees…

Dear Christian,

You have been recommended by so-and-so as being an expert referee for their article “dis-and-dat” published in F1000Research. Please would you provide a referee report for this article? The abstract is included at the end of this email and the full article is available here.

F1000Research is a unique open science publishing platform that was set up as part of Faculty of 1000 (by the same publisher who created BioMed Central and previously the Current Opinion journals). Our advisors include the Nobel Prize winners Randy Schekman and Sir Tim Hunt, Steve Hyman, Edward Benz, and many more.

F1000Research is aiming to reshape scientific publishing: articles are published rapidly after a careful editorial check, and formal peer review takes place openly after publication. Articles that pass peer review are indexed in PubMed and PubMed Central. Referees receive full credit for their contribution as their names, affiliations and comments are permanently attached to the article and each report is assigned a DOI and therefore easily citable.

We understand that you have a lot of other commitments, but we would be very grateful if you could give us your expert opinion on this article. We would of course be happy for a colleague (for example, someone in your group) to help prepare the report and be named as a co-referee with you.

run my code [guest post]

Posted in Statistics with tags , , , , on July 18, 2012 by xi'an

(This guest post has been written by Nicolas Chopin.)

I have been contacted by Christophe Pérignon, a prof. of Finance at HEC and co-founder of RunMyCode.org, a very interesting initiative that deserves to be publicised widely. Essentially, it’s arxiv for scientific code. You can create a “companion web-site” for each of your projects, post your code (with links to the corresponding paper), and let users run your code in the “cloud”, with their own data. All of this through a simple web-page interface.

I’ve not tried it yet, and I still wonder if this does not sound too good to be true; for instance, I wonder what happens if too many people post computer-intensive programs that take hours to complete. But Christophe tells me there is some badass hardware behind the project (a big server from CNRS); they are also backed by prestigious institutions (Columbia, NSF, CNRS, etc.).

But certainly the idea is excellent, and looks like the next step in reproducible research. (One of the co-founders is Victoria Stodden, an Assistant prof in stat at Columbia, and a well-known advocate of reproducible research and open research.) One could also use it to illustrate an idea in a conference, or during a course.

The project was started by people in Economics and Business. There are still some reference to this on the web site, and indirectly through the list of currently implemented languages (Matlab, R, and … Rats!). But Christophe tells me that they want to reach further. They have already projects in image analysis for instance. They are apparently open to other computer languages (e.g. Python), if there is some demand.

It is going to be really interesting to see how much this project is going to gather steam in our field and beyond. Perhaps this is the start of a new trend where we will run more and more our programs “in the cloud”, with the added benefits of openness and simplicity. We live in exciting times!