Archive for R

another R new trick [new for me!]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on July 16, 2014 by xi'an

La Defense, Dec. 10, 2010While working with Andrew and a student from Dauphine on importance sampling, we wanted to assess the distribution of the resulting sample via the Kolmogorov-Smirnov measure

\max_x |\hat{F_n}(x)-F(x)|

where F is the target.  This distance (times √n) has an asymptotic distribution that does not depend on n, called the Kolmogorov distribution. After searching for a little while, we could not figure where this distribution was available in R. It had to, since ks.test was returning a p-value. Hopefully correct! So I looked into the ks.test function, which happens not to be entirely programmed in C, and found the line

PVAL <- 1 - if (alternative == "two.sided") 
                .Call(C_pKolmogorov2x, STATISTIC, n)

which means that the Kolmogorov distribution is coded as a C function C_pKolmogorov2x in R. However, I could not call the function myself.

> .Call(C_pKolmogorov2x,.3,4)
Error: object 'C_pKolmogorov2x' not found

Hence, as I did not want to recode this distribution cdf, I posted the question on stackoverflow (long time no see!) and got a reply almost immediately as to use the package kolmim. Followed by the extra comment from the same person that calling the C code only required to add the path to its name, as in

> .Call(stats:::C_pKolmogorov2x,STAT=.3,n=4)
[1] 0.2292

implementing reproducible research [short book review]

Posted in Books, Kids, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , on July 15, 2014 by xi'an

As promised, I got back to this book, Implementing reproducible research (after the pigeons had their say). I looked at it this morning while monitoring my students taking their last-chance R exam (definitely last chance as my undergraduate R course is not reconoduced next year). The book is in fact an edited collection of papers on tools, principles, and platforms around the theme of reproducible research. It obviously links with other themes like open access, open data, and open software. All positive directions that need more active support from the scientific community. In particular the solutions advocated through this volume are mostly Linux-based. Among the tools described in the first chapter, knitr appears as an alternative to sweave. I used the later a while ago and while I like its philosophy. it does not extend to situations where the R code within takes too long to run… (Or maybe I did not invest enough time to grasp the entire spectrum of sweave.) Note that, even though the book is part of the R Series of CRC Press, many chapters are unrelated to R. And even more [unrelated] to statistics.

This limitation is somewhat my difficulty with [adhering to] the global message proposed by the book. It is great to construct such tools that monitor and archive successive versions of code and research, as anyone can trace back the research steps conducting to the published result(s). Using some of the platforms covered by the book establishes for instance a superb documentation principle, going much further than just providing an “easy” verification tool against fraudulent experiments. The notion of a super-wiki where notes and preliminary versions and calculations (and dead ends and failures) would be preserved for open access is just as great. However this type of research processing and discipline takes time and space and human investment, i.e. resources that are sparse and costly. Complex studies may involve enormous amounts of data and, neglecting the notions of confidentiality and privacy, the cost of storing such amounts is significant. Similarly for experiments that require days and weeks of huge clusters. I thus wonder where those resources would be found (journals, universities, high tech companies, …?) for the principle to hold in full generality and how transient they could prove. One cannot expect the research time to garantee availability of those meta-documents for remote time horizons. Just as a biased illustration, checking the available Bayes’ notebooks meant going to a remote part of London at a specific time and with a preliminary appointment. Those notebooks are not available on line for free. But for how long?

“So far, Bob has been using Charlie’s old computer, using Ubuntu 10.04. The next day, he is excited to find the new computer Alice has ordered for him has arrived. He installs Ubuntu 12.04″ A. Davison et al.

Putting their principles into practice, the authors of Implementing reproducible research have made all chapters available for free on the Open Science Framework. I thus encourage anyone interesting in those principles (and who would not be?!) to peruse the chapters and see how they can benefit from and contribute to open and reproducible research.

R/Rmetrics in Paris [alas!]

Posted in Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on June 30, 2014 by xi'an

Bernard1Today I gave a talk on Bayesian model choice in a fabulous 13th Century former monastery in the Latin Quarter of Paris… It is the Collège des Bernardins, close to Jussieu and Collège de France, unbelievably hidden to the point I was not aware of its existence despite having studied and worked in Jussieu since 1982… I mixed my earlier San Antonio survey on importance sampling approximations to Bayes factors with an entry to our most recent work on ABC with random forests. This was the first talk of the 8th R/Rmetrics workshop taking place in Paris this year. (Rmetrics is aiming at aggregating R packages with econometrics and finance applications.) And I had a full hour and a half to deliver my lecture to the workshop audience. Nice place, nice people, new faces and topics (and even andouille de Vire for lunch!): why should I complain with an alas in the title?!Bernard2What happened is that the R/Rmetrics meetings have been till this year organised in Meielisalp, Switzerland. Which stands on top of Thuner See and… just next to the most famous peaks of the Bernese Alps! And that I had been invited last year but could not make it… Meaning I lost a genuine opportunity to climb one of my five dream routes, the Mittelegi ridge of the Eiger. As the future R/Rmetrics meetings will not take place there.

A lunch discussion at the workshop led me to experiment the compiler library in R, library that I was unaware of. The impact on the running time is obvious: recycling the fowler function from the last Le Monde puzzle,

> bowler=cmpfun(fowler)
> N=20;n=10;system.time(fowler(pred=N))
   user  system elapsed 
 52.647   0.076  56.332 
> N=20;n=10;system.time(bowler(pred=N))
   user  system elapsed 
 51.631   0.004  51.768 
> N=20;n=15;system.time(bowler(pred=N))
   user  system elapsed 
 51.924   0.024  52.429 
> N=20;n=15;system.time(fowler(pred=N))
   user  system elapsed 
 52.919   0.200  61.960 

shows a ten- to twenty-fold gain in system time, if not in elapsed time (re-alas!).

Statistical modeling and computation [apologies]

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , on June 11, 2014 by xi'an

In my book review of the recent book by Dirk Kroese and Joshua Chan,  Statistical Modeling and Computation, I mistakenly and persistently typed the name of the second author as Joshua Chen. This typo alas made it to the printed and on-line versions of the subsequent CHANCE 27(2) column. I am thus very much sorry for this mistake of mine and most sincerely apologise to the authors. Indeed, it always annoys me to have my name mistyped (usually as Roberts!) in references.  [If nothing else, this typo signals it is high time for a change of my prescription glasses.]

Le Monde puzzle [#865]

Posted in Books, Kids, R with tags , , on May 6, 2014 by xi'an

A Le Monde mathematical puzzle in combinatorics:

Given a permutation σ of {1,…,5}, if σ(1)=n, the n first values of σ are inverted. If the process is iterated until σ(1)=1, does this always happen and if so what is the maximal  number of iterations? Solve the same question for the set {1,…,2014}.

I ran the following basic R code:

N=5
tm=0
for (i in 1:10^6){
   sig=sample(1:N) #random permutation
   t=0;while (sig[1]>1){
     n=sig[1]
     sig[1:n]=sig[n:1]
     t=t+1}
   tm=max(t,tm)}

obtaining 7 as the outcome. Here is the evolution of the maximum as a function of the number of terms in the set. If we push the regression to N=2014, the predicted value is around 600,000… Running a million simulations of the above only gets me to 23,871!lemonde865A wee minutes of reflection lead me to conjecture that the maximum number of steps wN should be satisfy wN=wN-1+N-2. However, the values resulting from the simulations do not grow as fast. (And, as Jean-Louis Fouley commented, it does not even work for N=6.) Monte Carlo effect or true discrepancy?

lemonde865b

Le Monde puzzle [#869]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on April 27, 2014 by xi'an

A Le Monde mathematical puzzle once again in a Sudoku mode:

In an nxn table, all integers between 1 and n appear n times. If max denotes the maximum over the numbers of different integers on all rows and columns,  what is the minimum value of max when n=7? when n=11?

I tried to solve it by the following R code (in a pre-breakfast mode in my Reykjavik Airbnb flat!):

#pseudoku
n=7
T=10^4
vals=rep(1:n,n)
minmax=n
for (t in 1:T){
  psudo=matrix(sample(vals),ncol=n)
  maxc=maxr=max(sapply(apply(psudo,1,unique),length))
  if (maxc<minmax)
     maxr=max(sapply(apply(psudo,2,unique),length))
  minmax=min(minmax,max(maxc,maxr))
  }

but later realised that (a) the

sapply(apply(psudo,1,unique),length)

failed when all rows or all columns had the same number of unique terms and (b) I did not have to run the whole matrix:

vals=rep(1:n,n)
minmax=n
for (t in 1:T){
  psudo=matrix(sample(vals),ncol=n)
  maxc=max(length(unique(psudo[1,])),length(unique(psudo[,1])))
  i=1
  while((i<n)&(maxc<minmax)){
   i=i+1
   maxc=max(maxc,
        length(unique(psudo[i,])),
        length(unique(psudo[,i])))}
  minmax=min(minmax,maxc)
  }

gaining a factor of 3 in the R execution. With this random exploration, the minimum value returned was 2,2,2,3,4,5,5,6,7,8 for n=2,3,4,5,6,7,8,9,10,11. Half-hearted simulating annealing during one of the sessions of AISTATS 2014 did not show any difference…

Le Monde sans puzzle [& sans penguins]

Posted in Books, Kids, R, University life with tags , , , , , on April 12, 2014 by xi'an

As the Le Monde mathematical puzzle of this week was a geometric one (the quadrangle ABCD is divided into two parts with the same area, &tc…) , with no clear R resolution, I chose to bypass it. In this April 3 issue, several items of interest: first, a report by Etienne Ghys on Yakov Sinaï’s Abel Prize for his work “between determinism and randomness”, centred on ergodic theory for dynamic systems, which sounded like the ultimate paradox the first time I heard my former colleague Denis Bosq give a talk about it in Paris 6. Then a frightening fact: the summer conditions have been so unusually harsh in Antarctica (or at least near the Dumont d’Urville French austral station) that none of the 15,000 Adélie penguin couples studied there managed to keep their chick alive. This was due to an ice shelf that did not melt at all over the summer, forcing the penguins to walk an extra 40k to reach the sea… Another entry on the legal obligation for all French universities to offer a second chance exam, no matter how students are evaluated in the first round. (Too bad, I always find writing a second round exam a nuisance.)

Follow

Get every new post delivered to your Inbox.

Join 641 other followers