## Bangalore workshop [ಬೆಂಗಳೂರು ಕಾರ್ಯಾಗಾರ]

Posted in pictures, R, Running, Statistics, Travel, University life, Wines with tags , , , , , , on July 31, 2014 by xi'an

Second day at the Indo-French Centre for Applied Mathematics and the workshop. Maybe not the most exciting day in terms of talks (as I missed the first two plenary sessions by (a) oversleeping and (b) running across the campus!). However I had a neat talk with another conference participant that led to [what I think are] interesting questions… (And a very good meal in a local restaurant as the guest house had not booked me for dinner!)

To wit: given a target like

$\lambda \exp(-\lambda) \prod_{i=1}^n \dfrac{1-\exp(-\lambda y_i)}{\lambda}\quad (*)$

the simulation of λ can be demarginalised into the simulation of

$\pi (\lambda,\mathbf{z})\propto \lambda \exp(-\lambda) \prod_{i=1}^n \exp(-\lambda z_i) \mathbb{I}(z_i\le y_i)$

where z is a latent (and artificial) variable. This means a Gibbs sampler simulating λ given z and z given λ can produce an outcome from the target (*). Interestingly, another completion is to consider that the zi‘s are U(0,yi) and to see the quantity

$\pi(\lambda,\mathbf{z}) \propto \lambda \exp(-\lambda) \prod_{i=1}^n \exp(-\lambda z_i) \mathbb{I}(z_i\le y_i)$

as an unbiased estimator of the target. What’s quite intriguing is that the quantity remains the same but with different motivations: (a) demarginalisation versus unbiasedness and (b) zi ∼ Exp(λ) versus zi ∼ U(0,yi). The stationary is the same, as shown by the graph below, the core distributions are [formally] the same, … but the reasoning deeply differs.

Obviously, since unbiased estimators of the likelihood can be justified by auxiliary variable arguments, this is not in fine a big surprise. Still, I had not thought of the analogy between demarginalisation and unbiased likelihood estimation previously. Continue reading

## another R new trick [new for me!]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on July 16, 2014 by xi'an

While working with Andrew and a student from Dauphine on importance sampling, we wanted to assess the distribution of the resulting sample via the Kolmogorov-Smirnov measure

$\max_x |\hat{F_n}(x)-F(x)|$

where F is the target.  This distance (times √n) has an asymptotic distribution that does not depend on n, called the Kolmogorov distribution. After searching for a little while, we could not figure where this distribution was available in R. It had to, since ks.test was returning a p-value. Hopefully correct! So I looked into the ks.test function, which happens not to be entirely programmed in C, and found the line

PVAL <- 1 - if (alternative == "two.sided")
.Call(C_pKolmogorov2x, STATISTIC, n)


which means that the Kolmogorov distribution is coded as a C function C_pKolmogorov2x in R. However, I could not call the function myself.

> .Call(C_pKolmogorov2x,.3,4)


Hence, as I did not want to recode this distribution cdf, I posted the question on stackoverflow (long time no see!) and got a reply almost immediately as to use the package kolmim. Followed by the extra comment from the same person that calling the C code only required to add the path to its name, as in

> .Call(stats:::C_pKolmogorov2x,STAT=.3,n=4)
[1] 0.2292


## implementing reproducible research [short book review]

Posted in Books, Kids, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , on July 15, 2014 by xi'an

As promised, I got back to this book, Implementing reproducible research (after the pigeons had their say). I looked at it this morning while monitoring my students taking their last-chance R exam (definitely last chance as my undergraduate R course is not reconoduced next year). The book is in fact an edited collection of papers on tools, principles, and platforms around the theme of reproducible research. It obviously links with other themes like open access, open data, and open software. All positive directions that need more active support from the scientific community. In particular the solutions advocated through this volume are mostly Linux-based. Among the tools described in the first chapter, knitr appears as an alternative to sweave. I used the later a while ago and while I like its philosophy. it does not extend to situations where the R code within takes too long to run… (Or maybe I did not invest enough time to grasp the entire spectrum of sweave.) Note that, even though the book is part of the R Series of CRC Press, many chapters are unrelated to R. And even more [unrelated] to statistics.

This limitation is somewhat my difficulty with [adhering to] the global message proposed by the book. It is great to construct such tools that monitor and archive successive versions of code and research, as anyone can trace back the research steps conducting to the published result(s). Using some of the platforms covered by the book establishes for instance a superb documentation principle, going much further than just providing an “easy” verification tool against fraudulent experiments. The notion of a super-wiki where notes and preliminary versions and calculations (and dead ends and failures) would be preserved for open access is just as great. However this type of research processing and discipline takes time and space and human investment, i.e. resources that are sparse and costly. Complex studies may involve enormous amounts of data and, neglecting the notions of confidentiality and privacy, the cost of storing such amounts is significant. Similarly for experiments that require days and weeks of huge clusters. I thus wonder where those resources would be found (journals, universities, high tech companies, …?) for the principle to hold in full generality and how transient they could prove. One cannot expect the research time to garantee availability of those meta-documents for remote time horizons. Just as a biased illustration, checking the available Bayes’ notebooks meant going to a remote part of London at a specific time and with a preliminary appointment. Those notebooks are not available on line for free. But for how long?

“So far, Bob has been using Charlie’s old computer, using Ubuntu 10.04. The next day, he is excited to find the new computer Alice has ordered for him has arrived. He installs Ubuntu 12.04″ A. Davison et al.

Putting their principles into practice, the authors of Implementing reproducible research have made all chapters available for free on the Open Science Framework. I thus encourage anyone interesting in those principles (and who would not be?!) to peruse the chapters and see how they can benefit from and contribute to open and reproducible research.

## Le Monde puzzle [#875]

Posted in Kids, R, Statistics, University life with tags , , , , on July 12, 2014 by xi'an

I learned something in R today thanks to Le Monde mathematical puzzle:

A two-player game consists in A picking a number n between 1 and 10 and B and A successively choosing and applying one of three transforms to the current value of n

• n=n+1,
• n=3n,
• n=4n,

starting with B, until n is larger than 999. Which value(s) of n should A pick if both players act optimally?

Indeed, I first tested the following R code

sole=function(n){
if (n>999){ return(TRUE)
}else{
return((!sole(3*n))&(!sole(4*n))&(!sole(n+1)))
}}


which did not work because of too many calls to sole:

>sole(1)
Error: evaluation nested too deeply: infinite recursion
/ options(expressions=)?


So I included a memory in the calls to sole so that good and bad entries of n were saved for later calls:

visit=rep(-1,1000) #not yet visited
sole=function(n){
if (n>999){ return(TRUE)
}else{
if (visit[n]>-1){ return(visit[n]==1)
}else{
visit[n]<<-((!sole(3*n))&(!sole(4*n))&
(!sole(n+1)))
return(visit[n]==1)
}}}


Trying frontal attack

>sole(1)
Error: evaluation nested too deeply: infinite recursion
/ options(expressions=)?


did not work, but one single intermediary was sufficient:

> sole(500)
[1] FALSE
> for (i in 1:10)
+ print(sole(i))
[1] FALSE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] FALSE
[1] FALSE


which means that the only winning starters for A are n=4,6. If one wants the winning moves on top, a second counter can be introduced:

visit=best=rep(-1,1000)
sole=function(n){
if (n>999){ return(TRUE)
}else{
if (visit[n]>-1){ return(visit[n]==1)
}else{
visit[n]<<-((!sole(3*n))&(!sole(4*n))&
(!sole(n+1)))
if (visit[n]==0) best[n]<<-max(
3*n*(sole(3*n)),
4*n*(sole(4*n)),
(n+1)*(sole(n+1)))
return(visit[n]==1)
}}}


From which we can deduce the next values chosen by A or B as

> best[1:10]
[1]  4  6  4 -1  6 -1 28 32 36 40


(where -1 means no winning choice is possible).

Now, what is the R trick I learned from this toy problem? Simply the use of the double allocation symbol that allows to change global variables within functions. As visit and best in the latest function. (The function assign would have worked too.)

## ABC in Cancún

Posted in Books, Kids, R, Statistics, Travel, University life with tags , , , , , , , , on July 11, 2014 by xi'an

Here are our slides for the ABC [very] short course Jean-Michel and I give at ISBA 2014 in Cancún next Monday (if your browser can manage Slideshare…) Although I may switch the pictures from Iceland to Mexico, on Sunday, there will be not much change on those slides we both have previously used in previous short courses. (With a few extra slides borrowed from Richard Wilkinson’s tutorial at NIPS 2013!) Jean-Michel will focus his share of the course on software implementations, from R packages like abc and abctools and our population genetics software DIYABC. With an illustration on SNPs data from pygmies populations.