## amazing Gibbs sampler

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , on February 19, 2015 by xi'an

When playing with Peter Rossi’s bayesm R package during a visit of Jean-Michel Marin to Paris, last week, we came up with the above Gibbs outcome. The setting is a Gaussian mixture model with three components in dimension 5 and the prior distributions are standard conjugate. In this case, with 500 observations and 5000 Gibbs iterations, the Markov chain (for one component of one mean of the mixture) has two highly distinct regimes: one that revolves around the true value of the parameter, 2.5, and one that explores a much broader area (which is associated with a much smaller value of the component weight). What we found amazing is the Gibbs ability to entertain both regimes, simultaneously.

## Le Monde puzzle [#899]

Posted in Books, Kids, Statistics, University life with tags , , , , , on February 8, 2015 by xi'an

An arithmetics Le Monde mathematical puzzle:

For which n’s are the averages of the first n squared integers integers? Among those, which ones are perfect squares?

An easy R code, for instance

n=10^3
car=as.integer(as.integer(1:n)^2)
sumcar=as.integer((cumsum(car)%/%as.integer(1:n)))
diff=as.integer(as.integer(cumsum(car))-as.integer(1:n)*sumcar)
print((1:n)[diff==00])


which produces 333 values

  [1]   1   5   7  11  13  17  19  23  25  29  31  35  37  41  43  47  49  53
[19]  55  59  61  65  67  71  73  77  79  83  85  89  91  95  97 101 103 107
[37] 109 113 115 119 121 125 127 131 133 137 139 143 145 149 151 155 157 161
[55] 163 167 169 173 175 179 181 185 187 191 193 197 199 203 205 209 211 215
[73] 217 221 223 227 229 233 235 239 241 245 247 251 253 257 259 263 265 269
[91] 271 275 277 281 283 287 289 293 295 299 301 305 307 311 313 317 319 323
[109] 325 329 331 335 337 341 343 347 349 353 355 359 361 365 367 371 373 377
[127] 379 383 385 389 391 395 397 401 403 407 409 413 415 419 421 425 427 431
[145] 433 437 439 443 445 449 451 455 457 461 463 467 469 473 475 479 481 485
[163] 487 491 493 497 499 503 505 509 511 515 517 521 523 527 529 533 535 539
[181] 541 545 547 551 553 557 559 563 565 569 571 575 577 581 583 587 589 593
[199] 595 599 601 605 607 611 613 617 619 623 625 629 631 635 637 641 643 647
[217] 649 653 655 659 661 665 667 671 673 677 679 683 685 689 691 695 697 701
[235] 703 707 709 713 715 719 721 725 727 731 733 737 739 743 745 749 751 755
[253] 757 761 763 767 769 773 775 779 781 785 787 791 793 797 799 803 805 809
[271] 811 815 817 821 823 827 829 833 835 839 841 845 847 851 853 857 859 863
[289] 865 869 871 875 877 881 883 887 889 893 895 899 901 905 907 911 913 917
[307] 919 923 925 929 931 935 937 941 943 947 949 953 955 959 961 965 967 971
[325] 973 977 979 983 985 989 991 995 997


which are made of all odd integers that are not multiple of 3. (I could have guessed the exclusion of even numbers since the numerator is always odd. Why are the triplets excluded, now?! Jean-Louis Fouley gave me the answer: the sum of squares is such that

$\frac{1+2^2+\cdots+m^2}{m}=\frac{m(m+1)(2m+1)}{6m}=\frac{(m+1)(2m+1)}{6}$

and hence m must be odd and 2m+1 a multiple of 3, which excludes multiples of 3.)

The second part is as simple:

sole=sumcar[(1:n)[diff==0]]
scar=as.integer(as.integer(sqrt(sole))^2)-sole
sum(scar==0)


with the final result

> sum(scar==0)
[1] 2
> ((1:n)[diff==0])[scar==0]
[1] 1 337


since  38025=195² is a perfect square. (I wonder if there is a plain explanation for that result!)

## Le Monde puzzle [#887]

Posted in Books, Kids, R, Statistics with tags , , , on November 15, 2014 by xi'an

A simple combinatorics Le Monde mathematical puzzle:

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

Indeed, from an R programming point of view, all I have to do is to go over all possible permutations of {1,2,..,N} until one works or until I have exhausted all possible permutations for a given N. However, 25!=10²⁵ is a wee bit too large… Instead, I resorted once again to brute force simulation, by first introducing possible neighbours of the integers

  perfect=(1:trunc(sqrt(2*N)))^2
friends=NULL
le=1:N
for (perm in 1:N){
amis=perfect[perfect>perm]-perm
amis=amis[amis<N]
le[perm]=length(amis)
friends=c(friends,list(amis))
}


and then proceeding to construct the permutation one integer at time by picking from its remaining potential neighbours until there is none left or the sequence is complete

orderin=0*(1:N)
t=1
orderin[1]=sample((1:N),1)
for (perm in 1:N){
friends[[perm]]=friends[[perm]]
[friends[[perm]]!=orderin[1]]
le[perm]=length(friends[[perm]])
}
while (t<N){
if (length(friends[[orderin[t]]])==0)
break()
if (length(friends[[orderin[t]]])>1){
orderin[t+1]=sample(friends[[orderin[t]]],1)}else{
orderin[t+1]=friends[[orderin[t]]]
}
for (perm in 1:N){
friends[[perm]]=friends[[perm]]
[friends[[perm]]!=orderin[t+1]]
le[perm]=length(friends[[perm]])
}
t=t+1}


and then repeating this attempt until a full sequence is produced or a certain number of failed attempts has been reached. I gained in efficiency by proposing a second completion on the left of the first integer once a break occurs:

while (t<N){
if (length(friends[[orderin[1]]])==0)
break()
orderin[2:(t+1)]=orderin[1:t]
if (length(friends[[orderin[2]]])>1){
orderin[1]=sample(friends[[orderin[2]]],1)}else{
orderin[1]=friends[[orderin[2]]]
}
for (perm in 1:N){
friends[[perm]]=friends[[perm]]
[friends[[perm]]!=orderin[1]]
le[perm]=length(friends[[perm]])
}
t=t+1}


(An alternative would have been to complete left and right by squared numbers taken at random…) The result of running this program showed there exist permutations with the above property for N=15,16,17,23,25,26,…,77.  Here is the solution for N=49:

25 39 10 26 38 43 21 4 32 49 15 34 30 6 3 22 42 7 9 27 37 12 13 23 41 40 24 1 8 28 36 45 19 17 47 2 14 11 5 44 20 29 35 46 18 31 33 16 48

As an aside, the authors of Le Monde puzzle pretended (in Tuesday, Nov. 12, edition) that there was no solution for N=23, while this sequence

22 3 1 8 17 19 6 10 15 21 4 12 13 23 2 14 11 5 20 16 9 7 18

sounds fine enough to me… I more generally wonder at the general principle behind the existence of such sequences. It sounds quite probable that they exist for N>24. (The published solution does not bring any light on this issue, so I assume the authors have no mathematical analysis to provide.)

## The winds of Winter [Bayesian prediction]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , , on October 7, 2014 by xi'an

A surprising entry on arXiv this morning: Richard Vale (from Christchurch, NZ) has posted a paper about the characters appearing in the yet hypothetical next volume of George R.R. Martin’s Song of ice and fire series, The winds of Winter [not even put for pre-sale on amazon!]. Using the previous five books in the series and the frequency of occurrence of characters’ point of view [each chapter being told as from the point of view of one single character], Vale proceeds to model the number of occurrences in a given book by a truncated Poisson model,

$x_{it} \sim \mathcal{P}(\lambda_i)\text{ if }|t-\beta_i|<\tau_i$

in order to account for [most] characters dying at some point in the series. All parameters are endowed with prior distributions, including the terrible “large” hyperpriors familiar to BUGS users… Despite the code being written in R by the author. The modelling does not use anything but the frequencies of the previous books, so knowledge that characters like Eddard Stark had died is not exploited. (Nonetheless, the prediction gives zero chapter to this character in the coming volumes.) Interestingly, a character who seemingly died at the end of the last book is still given a 60% probability of having at least one chapter in  The winds of Winter [no spoiler here, but many in the paper itself!]. As pointed out by the author, the model as such does not allow for prediction of new-character chapters, which remains likely given Martin’s storytelling style! Vale still predicts 11 new-character chapters, which seems high if considering the series should be over in two more books [and an unpredictable number of years!].

As an aside, this paper makes use of the truncnorm R package, which I did not know and which is based on John Geweke’s accept-reject algorithm for truncated normals that I (independently) proposed a few years later.

## a weird beamer feature…

Posted in Books, Kids, Linux, R, Statistics, University life with tags , , , , , , , , , , , , on September 24, 2014 by xi'an

As I was preparing my slides for my third year undergraduate stat course, I got a weird error that got a search on the Web to unravel:

! Extra }, or forgotten \endgroup.
\endframe ->\egroup
\begingroup \def \@currenvir {frame}
l.23 \end{frame}
\begin{slide}
?


which was related with a fragile environment

\begin{frame}[fragile]
\frametitle{simulation in practice}
\begin{itemize}
\item For a given distribution $F$, call the corresponding
pseudo-random generator in an arbitrary computer language
\begin{verbatim}
> x=rnorm(10)
> x
[1] -0.021573 -1.134735  1.359812 -0.887579
[7] -0.749418  0.506298  0.835791  0.472144
\end{verbatim}
\item use the sample as a statistician would
\begin{verbatim}
> mean(x)
[1] 0.004892123
> var(x)
[1] 0.8034657
\end{verbatim}
to approximate quantities related with $F$
\end{itemize}
\end{frame}\begin{frame}


but not directly the verbatim part: the reason for the bug was that the \end{frame} command did not have a line by itself! Which is one rare occurrence where the carriage return has an impact in LaTeX, as far as I know… (The same bug appears when there is an indentation at the beginning of the line. Weird!) [Another annoying feature is wordpress turning > into &gt; in the sourcecode environment…]

## another R new trick [new for me!]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on July 16, 2014 by xi'an

While working with Andrew and a student from Dauphine on importance sampling, we wanted to assess the distribution of the resulting sample via the Kolmogorov-Smirnov measure

$\max_x |\hat{F_n}(x)-F(x)|$

where F is the target.  This distance (times √n) has an asymptotic distribution that does not depend on n, called the Kolmogorov distribution. After searching for a little while, we could not figure where this distribution was available in R. It had to, since ks.test was returning a p-value. Hopefully correct! So I looked into the ks.test function, which happens not to be entirely programmed in C, and found the line

PVAL <- 1 - if (alternative == "two.sided")
.Call(C_pKolmogorov2x, STATISTIC, n)


which means that the Kolmogorov distribution is coded as a C function C_pKolmogorov2x in R. However, I could not call the function myself.

> .Call(C_pKolmogorov2x,.3,4)


Hence, as I did not want to recode this distribution cdf, I posted the question on stackoverflow (long time no see!) and got a reply almost immediately as to use the package kolmim. Followed by the extra comment from the same person that calling the C code only required to add the path to its name, as in

> .Call(stats:::C_pKolmogorov2x,STAT=.3,n=4)
[1] 0.2292


## implementing reproducible research [short book review]

Posted in Books, Kids, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , on July 15, 2014 by xi'an

As promised, I got back to this book, Implementing reproducible research (after the pigeons had their say). I looked at it this morning while monitoring my students taking their last-chance R exam (definitely last chance as my undergraduate R course is not reconoduced next year). The book is in fact an edited collection of papers on tools, principles, and platforms around the theme of reproducible research. It obviously links with other themes like open access, open data, and open software. All positive directions that need more active support from the scientific community. In particular the solutions advocated through this volume are mostly Linux-based. Among the tools described in the first chapter, knitr appears as an alternative to sweave. I used the later a while ago and while I like its philosophy. it does not extend to situations where the R code within takes too long to run… (Or maybe I did not invest enough time to grasp the entire spectrum of sweave.) Note that, even though the book is part of the R Series of CRC Press, many chapters are unrelated to R. And even more [unrelated] to statistics.

This limitation is somewhat my difficulty with [adhering to] the global message proposed by the book. It is great to construct such tools that monitor and archive successive versions of code and research, as anyone can trace back the research steps conducting to the published result(s). Using some of the platforms covered by the book establishes for instance a superb documentation principle, going much further than just providing an “easy” verification tool against fraudulent experiments. The notion of a super-wiki where notes and preliminary versions and calculations (and dead ends and failures) would be preserved for open access is just as great. However this type of research processing and discipline takes time and space and human investment, i.e. resources that are sparse and costly. Complex studies may involve enormous amounts of data and, neglecting the notions of confidentiality and privacy, the cost of storing such amounts is significant. Similarly for experiments that require days and weeks of huge clusters. I thus wonder where those resources would be found (journals, universities, high tech companies, …?) for the principle to hold in full generality and how transient they could prove. One cannot expect the research time to garantee availability of those meta-documents for remote time horizons. Just as a biased illustration, checking the available Bayes’ notebooks meant going to a remote part of London at a specific time and with a preliminary appointment. Those notebooks are not available on line for free. But for how long?

“So far, Bob has been using Charlie’s old computer, using Ubuntu 10.04. The next day, he is excited to find the new computer Alice has ordered for him has arrived. He installs Ubuntu 12.04″ A. Davison et al.

Putting their principles into practice, the authors of Implementing reproducible research have made all chapters available for free on the Open Science Framework. I thus encourage anyone interesting in those principles (and who would not be?!) to peruse the chapters and see how they can benefit from and contribute to open and reproducible research.