Archive for R

Le Monde puzzle [#887]

Posted in Books, Kids, R, Statistics with tags , , , on November 15, 2014 by xi'an

A simple combinatorics Le Monde mathematical puzzle:

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

Indeed, from an R programming point of view, all I have to do is to go over all possible permutations of {1,2,..,N} until one works or until I have exhausted all possible permutations for a given N. However, 25!=10²⁵ is a wee bit too large… Instead, I resorted once again to brute force simulation, by first introducing possible neighbours of the integers

  for (perm in 1:N){

and then proceeding to construct the permutation one integer at time by picking from its remaining potential neighbours until there is none left or the sequence is complete

for (perm in 1:N){
while (t<N){
  if (length(friends[[orderin[t]]])==0)
  if (length(friends[[orderin[t]]])>1){
  for (perm in 1:N){

and then repeating this attempt until a full sequence is produced or a certain number of failed attempts has been reached. I gained in efficiency by proposing a second completion on the left of the first integer once a break occurs:

while (t<N){
  if (length(friends[[orderin[1]]])==0)
  if (length(friends[[orderin[2]]])>1){
  for (perm in 1:N){

(An alternative would have been to complete left and right by squared numbers taken at random…) The result of running this program showed there exist permutations with the above property for N=15,16,17,23,25,26,…,77.  Here is the solution for N=49:

25 39 10 26 38 43 21 4 32 49 15 34 30 6 3 22 42 7 9 27 37 12 13 23 41 40 24 1 8 28 36 45 19 17 47 2 14 11 5 44 20 29 35 46 18 31 33 16 48

As an aside, the authors of Le Monde puzzle pretended (in Tuesday, Nov. 12, edition) that there was no solution for N=23, while this sequence

22 3 1 8 17 19 6 10 15 21 4 12 13 23 2 14 11 5 20 16 9 7 18

sounds fine enough to me… I more generally wonder at the general principle behind the existence of such sequences. It sounds quite probable that they exist for N>24. (The published solution does not bring any light on this issue, so I assume the authors have no mathematical analysis to provide.)

The winds of Winter [Bayesian prediction]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , , on October 7, 2014 by xi'an

A surprising entry on arXiv this morning: Richard Vale (from Christchurch, NZ) has posted a paper about the characters appearing in the yet hypothetical next volume of George R.R. Martin’s Song of ice and fire series, The winds of Winter [not even put for pre-sale on amazon!]. Using the previous five books in the series and the frequency of occurrence of characters’ point of view [each chapter being told as from the point of view of one single character], Vale proceeds to model the number of occurrences in a given book by a truncated Poisson model,

x_{it} \sim \mathcal{P}(\lambda_i)\text{ if }|t-\beta_i|<\tau_i

in order to account for [most] characters dying at some point in the series. All parameters are endowed with prior distributions, including the terrible “large” hyperpriors familiar to BUGS users… Despite the code being written in R by the author. The modelling does not use anything but the frequencies of the previous books, so knowledge that characters like Eddard Stark had died is not exploited. (Nonetheless, the prediction gives zero chapter to this character in the coming volumes.) Interestingly, a character who seemingly died at the end of the last book is still given a 60% probability of having at least one chapter in  The winds of Winter [no spoiler here, but many in the paper itself!]. As pointed out by the author, the model as such does not allow for prediction of new-character chapters, which remains likely given Martin’s storytelling style! Vale still predicts 11 new-character chapters, which seems high if considering the series should be over in two more books [and an unpredictable number of years!].

As an aside, this paper makes use of the truncnorm R package, which I did not know and which is based on John Geweke’s accept-reject algorithm for truncated normals that I (independently) proposed a few years later.

a weird beamer feature…

Posted in Books, Kids, Linux, R, Statistics, University life with tags , , , , , , , , , , , , on September 24, 2014 by xi'an

As I was preparing my slides for my third year undergraduate stat course, I got a weird error that got a search on the Web to unravel:

! Extra }, or forgotten \endgroup.
\endframe ->\egroup
  \begingroup \def \@currenvir {frame}
l.23 \end{frame}

which was related with a fragile environment

\frametitle{simulation in practice}
\item For a given distribution $F$, call the corresponding 
pseudo-random generator in an arbitrary computer language
> x=rnorm(10)
> x
 [1] -0.021573 -1.134735  1.359812 -0.887579
 [7] -0.749418  0.506298  0.835791  0.472144
\item use the sample as a statistician would
> mean(x)
[1] 0.004892123
> var(x)
[1] 0.8034657
to approximate quantities related with $F$

but not directly the verbatim part: the reason for the bug was that the \end{frame} command did not have a line by itself! Which is one rare occurrence where the carriage return has an impact in LaTeX, as far as I know… (The same bug appears when there is an indentation at the beginning of the line. Weird!) [Another annoying feature is wordpress turning > into &gt; in the sourcecode environment...]

another R new trick [new for me!]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on July 16, 2014 by xi'an

La Defense, Dec. 10, 2010While working with Andrew and a student from Dauphine on importance sampling, we wanted to assess the distribution of the resulting sample via the Kolmogorov-Smirnov measure

\max_x |\hat{F_n}(x)-F(x)|

where F is the target.  This distance (times √n) has an asymptotic distribution that does not depend on n, called the Kolmogorov distribution. After searching for a little while, we could not figure where this distribution was available in R. It had to, since ks.test was returning a p-value. Hopefully correct! So I looked into the ks.test function, which happens not to be entirely programmed in C, and found the line

PVAL <- 1 - if (alternative == "two.sided") 
                .Call(C_pKolmogorov2x, STATISTIC, n)

which means that the Kolmogorov distribution is coded as a C function C_pKolmogorov2x in R. However, I could not call the function myself.

> .Call(C_pKolmogorov2x,.3,4)
Error: object 'C_pKolmogorov2x' not found

Hence, as I did not want to recode this distribution cdf, I posted the question on stackoverflow (long time no see!) and got a reply almost immediately as to use the package kolmim. Followed by the extra comment from the same person that calling the C code only required to add the path to its name, as in

> .Call(stats:::C_pKolmogorov2x,STAT=.3,n=4)
[1] 0.2292

implementing reproducible research [short book review]

Posted in Books, Kids, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , on July 15, 2014 by xi'an

As promised, I got back to this book, Implementing reproducible research (after the pigeons had their say). I looked at it this morning while monitoring my students taking their last-chance R exam (definitely last chance as my undergraduate R course is not reconoduced next year). The book is in fact an edited collection of papers on tools, principles, and platforms around the theme of reproducible research. It obviously links with other themes like open access, open data, and open software. All positive directions that need more active support from the scientific community. In particular the solutions advocated through this volume are mostly Linux-based. Among the tools described in the first chapter, knitr appears as an alternative to sweave. I used the later a while ago and while I like its philosophy. it does not extend to situations where the R code within takes too long to run… (Or maybe I did not invest enough time to grasp the entire spectrum of sweave.) Note that, even though the book is part of the R Series of CRC Press, many chapters are unrelated to R. And even more [unrelated] to statistics.

This limitation is somewhat my difficulty with [adhering to] the global message proposed by the book. It is great to construct such tools that monitor and archive successive versions of code and research, as anyone can trace back the research steps conducting to the published result(s). Using some of the platforms covered by the book establishes for instance a superb documentation principle, going much further than just providing an “easy” verification tool against fraudulent experiments. The notion of a super-wiki where notes and preliminary versions and calculations (and dead ends and failures) would be preserved for open access is just as great. However this type of research processing and discipline takes time and space and human investment, i.e. resources that are sparse and costly. Complex studies may involve enormous amounts of data and, neglecting the notions of confidentiality and privacy, the cost of storing such amounts is significant. Similarly for experiments that require days and weeks of huge clusters. I thus wonder where those resources would be found (journals, universities, high tech companies, …?) for the principle to hold in full generality and how transient they could prove. One cannot expect the research time to garantee availability of those meta-documents for remote time horizons. Just as a biased illustration, checking the available Bayes’ notebooks meant going to a remote part of London at a specific time and with a preliminary appointment. Those notebooks are not available on line for free. But for how long?

“So far, Bob has been using Charlie’s old computer, using Ubuntu 10.04. The next day, he is excited to find the new computer Alice has ordered for him has arrived. He installs Ubuntu 12.04″ A. Davison et al.

Putting their principles into practice, the authors of Implementing reproducible research have made all chapters available for free on the Open Science Framework. I thus encourage anyone interesting in those principles (and who would not be?!) to peruse the chapters and see how they can benefit from and contribute to open and reproducible research.

R/Rmetrics in Paris [alas!]

Posted in Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , on June 30, 2014 by xi'an

Bernard1Today I gave a talk on Bayesian model choice in a fabulous 13th Century former monastery in the Latin Quarter of Paris… It is the Collège des Bernardins, close to Jussieu and Collège de France, unbelievably hidden to the point I was not aware of its existence despite having studied and worked in Jussieu since 1982… I mixed my earlier San Antonio survey on importance sampling approximations to Bayes factors with an entry to our most recent work on ABC with random forests. This was the first talk of the 8th R/Rmetrics workshop taking place in Paris this year. (Rmetrics is aiming at aggregating R packages with econometrics and finance applications.) And I had a full hour and a half to deliver my lecture to the workshop audience. Nice place, nice people, new faces and topics (and even andouille de Vire for lunch!): why should I complain with an alas in the title?!Bernard2What happened is that the R/Rmetrics meetings have been till this year organised in Meielisalp, Switzerland. Which stands on top of Thuner See and… just next to the most famous peaks of the Bernese Alps! And that I had been invited last year but could not make it… Meaning I lost a genuine opportunity to climb one of my five dream routes, the Mittelegi ridge of the Eiger. As the future R/Rmetrics meetings will not take place there.

A lunch discussion at the workshop led me to experiment the compiler library in R, library that I was unaware of. The impact on the running time is obvious: recycling the fowler function from the last Le Monde puzzle,

> bowler=cmpfun(fowler)
> N=20;n=10;system.time(fowler(pred=N))
   user  system elapsed 
 52.647   0.076  56.332 
> N=20;n=10;system.time(bowler(pred=N))
   user  system elapsed 
 51.631   0.004  51.768 
> N=20;n=15;system.time(bowler(pred=N))
   user  system elapsed 
 51.924   0.024  52.429 
> N=20;n=15;system.time(fowler(pred=N))
   user  system elapsed 
 52.919   0.200  61.960 

shows a ten- to twenty-fold gain in system time, if not in elapsed time (re-alas!).

Statistical modeling and computation [apologies]

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , on June 11, 2014 by xi'an

In my book review of the recent book by Dirk Kroese and Joshua Chan,  Statistical Modeling and Computation, I mistakenly and persistently typed the name of the second author as Joshua Chen. This typo alas made it to the printed and on-line versions of the subsequent CHANCE 27(2) column. I am thus very much sorry for this mistake of mine and most sincerely apologise to the authors. Indeed, it always annoys me to have my name mistyped (usually as Roberts!) in references.  [If nothing else, this typo signals it is high time for a change of my prescription glasses.]


Get every new post delivered to your Inbox.

Join 701 other followers