Archive for the R Category

Florid’AISTATS

Posted in pictures, R, Statistics, Travel, University life with tags , , , , , , , , , on August 31, 2016 by xi'an

The next AISTATS conference is taking place in Florida, Fort Lauderdale, on April 20-22. (The website keeps the same address one conference after another, which means all my links to the AISTATS 2016 conference in Cadiz are no longer valid. And that the above sunset from Florida is named… cadiz.jpg!) The deadline for paper submission is October 13 and there are two novel features:

  1. Fast-track for Electronic Journal of Statistics: Authors of a small number of accepted papers will be invited to submit an extended version for fast-track publication in a special issue of the Electronic Journal of Statistics (EJS) after the AISTATS decisions are out. Details on how to prepare such extended journal paper submission will be announced after the AISTATS decisions.
  2. Review-sharing with NIPS: Papers previously submitted to NIPS 2016 are required to declare their previous NIPS paper ID, and optionally supply a one-page letter of revision (similar to a revision letter to journal editors; anonymized) in supplemental materials. AISTATS reviewers will have access to the previous anonymous NIPS reviews. Other than this, all submissions will be treated equally.

I find both initiatives worth applauding and replicating in other machine-learning conferences. Particularly in regard with the recent debate we had at Annals of Statistics.

Bayesian Essentials with R [book review]

Posted in Books, R, Statistics, University life with tags , , , , , , , on July 28, 2016 by xi'an

[A review of Bayesian Essentials that appeared in Technometrics two weeks ago, with the first author being rechristened Jean-Michael!]

“Overall this book is a very helpful and useful introduction to Bayesian methods of data analysis. I found the use of R, the code in the book, and the companion R package, bayess, to be helpful to those who want to begin using  Bayesian methods in data analysis. One topic that I would like to see added is the use of Bayesian methods in change point problems, a topic that we found useful in a recent article and which could be added to the time series chapter. Overall this is a solid book and well worth considering by its intended audience.”
David E. BOOTH
Kent State University

the curious incident of the inverse of the mean

Posted in R, Statistics, University life with tags , , , on July 15, 2016 by xi'an

A s I figured out while working with astronomer colleagues last week, a strange if understandable difficulty proceeds from the simplest and most studied statistical model, namely the Normal model

x~N(θ,1)

Indeed, if one reparametrises this model as x~N(υ⁻¹,1) with υ>0, a single observation x brings very little information about υ! (This is not a toy problem as it corresponds to estimating distances from observations of parallaxes.) If x gets large, υ is very likely to be small, but if x is small or negative, υ is certainly large, with no power to discriminate between highly different values. For instance, Fisher’s information for this model and parametrisation is υ⁻² and thus collapses at zero.

While one can always hope for Bayesian miracles, they do not automatically occur. For instance, working with a Gamma prior Ga(3,10³) on υ [as informed by a large astronomy dataset] leads to a posterior expectation hardly impacted by the value of the observation x:

invormAnd using an alternative estimate like the harmonic posterior mean that is associated with the relative squared error loss does not see much more impact from the observation:

invarmThere is simply not enough information contained in one datapoint (or even several datapoints for all that matters) to infer about υ.

Extending R

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , , , , , , , , on July 13, 2016 by xi'an

As I was previously unaware of this book coming up, my surprise and excitement were both extreme when I received it from CRC Press a few weeks ago! John Chambers, one of the fathers of S, precursor of R, had just published a book about extending R. It covers some reflections of the author on programming and the story of R (Parts 2 and 1),  and then focus on object-oriented programming (Part 3) and the interfaces from R to other languages (Part 4). While this is “only” a programming book, and thus not strictly appealing to statisticians, reading one of the original actors’ thoughts on the past, present, and future of R is simply fantastic!!! And John Chambers is definitely not calling to simply start over and build something better, as Ross Ihaka did in this [most read] post a few years ago. (It is also great to see the names of friends appearing at times, like Julie, Luke, and Duncan!)

“I wrote most of the original software for S3 methods, which were useful for their application, in the early 1990s.”

In the (hi)story part, Chambers delves into the details of the evolution of S at Bells Labs, as described in his [first]  “blue book” (which I kept on my shelf until very recently, next to the “white book“!) and of the occurrence of R in the mid-1990s. I find those sections fascinating maybe the more because I am somewhat of a contemporary, having first learned Fortran (and Pascal) in the mid-1980’s, before moving in the early 1990s to C (that I mostly coded as translated Pascal!), S-plus and eventually R, in conjunction with a (forced) migration from Unix to Linux, as my local computer managers abandoned Unix and mainframe in favour of some virtual Windows machines. And as I started running R on laptops with the help of friends more skilled than I (again keeping some of the early R manuals on my shelf until recently). Maybe one of the most surprising things about those reminiscences is that the very first version of R was dated Feb 29, 2000! Not because of Feb 29, 2000 (which, as Chambers points out, is the first use of the third-order correction to the Gregorian calendar, although I would have thought 1600 was the first one), but because I would have thought it appeared earlier, in conjunction with my first Linux laptop, but this memory is alas getting too vague!

As indicated above, the book is mostly about programming, which means in my case that some sections are definitely beyond my reach! For instance, reading “the onus is on the person writing the calling function to avoid using a reference object as the argument to an existing function that expects a named list” is not immediately clear… Nonetheless, most sections are readable [at my level] and enlightening about the mottoes “everything that exists is an object” and “everything that happens is a function” repeated throughout.  (And about my psycho-rigid ways of translating Pascal into every other language!) I obviously learned about new commands and notions, like the difference between

x <- 3

and

x <<- 3

(but I was disappointed to learn that the number of <‘s was not related with the depth or height of the allocation!) In particular, I found the part about replacement fascinating, explaining how a command like

diag(x)[i] = 3

could modify x directly. (While definitely worth reading, the chapter on R packages could have benefited from more details. But as Chambers points out there are whole books about this.) Overall, I am afraid the book will not improve my (limited) way of programming in R but I definitely recommend it to anyone even moderately skilled in the language.

correlation matrices on copulas

Posted in R, Statistics, University life with tags , , , , on July 4, 2016 by xi'an

Following my post of yesterday about the missing condition in Lynch’s R code, Gérard Letac sent me a paper he recently wrote with Luc Devroye on correlation matrices and copulas. Paper written for the memorial volume in honour of Marc Yor. It considers the neat problem of the existence of a copula (on [0,1]x…x[0,1]) associated with a given correlation matrix R. And establishes this existence up to dimension n=9. The proof is based on the consideration of the extreme points of the set of correlation matrices. The authors conjecture the existence of (10,10) correlation matrices that cannot be a correlation matrix for a copula. The paper also contains a result that answers my (idle) puzzling of many years, namely on how to set the correlation matrix of a Gaussian copula to achieve a given correlation matrix R for the copula. More precisely, the paper links the [correlation] matrix R of X~N(0,R) with the [correlation] matrix R⁰ of Φ(X) by

r^0_{ij}=\frac{6}{\pi}\arcsin\{r_{ij}/2\}

A side consequence of this result is that there exist correlation matrices of copulas that cannot be associated with Gaussian copulas. Like

R=\left[\begin{matrix} 1 &-1/2 &-1/2\\-1/2 &1 &-1/2\\-1/2 &-1/2 &1 \end{matrix}\right]

another wrong entry

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on June 27, 2016 by xi'an

Quite a coincidence! I just came across another bug in Lynch’s (2007) book, Introduction to Applied Bayesian Statistics and Estimation for Social Scientists. Already discussed here and on X validated. While working with one participant to the post-ISBA softshop, we were looking for efficient approaches to simulating correlation matrices and came [by Google] across the above R code associated with a 3×3 correlation matrix, which misses the additional constraint that the determinant must be positive. As shown e.g. by the example

> eigen(matrix(c(1,-.8,.7,-.8,1,.6,.7,.6,1),ncol=3))
$values
[1] 1.8169834 1.5861960 -0.4031794

having all correlations between -1 and 1 is not enough. Just. Not. Enough.

Le Monde puzzle [#965]

Posted in Kids, R with tags , , , on June 14, 2016 by xi'an

A game-related Le Monde mathematical puzzle:

Starting with a pile of 10⁴ tokens, Bob plays the following game: at each round, he picks one of the existing piles with at least 3 tokens, takes away one of the tokens in this pile, and separates the remaining ones into two non-empty piles of arbitrary size. Bob stops when all piles have identical size. What is this size and what is the maximal number of piles?

First, Bob can easily reach a decomposition that prevents all piles to be of the same size: for instance, he can start with a pile of 1 and another pile of 2. Looking at the general perspective, an odd number of tokens, n=2k+1, can be partitioned into (1,1,2k-1). Which means that the decomposition (1,1,…,1) involving k+1 ones can always be achieved. For an even number, n=2k, this is not feasible. If the number 2k can be partitioned into equal numbers u, this means that the sequence 2k-(u+1),2k-2(u+1),… ends up with u, hence that there exist m such that 2k-m(u+1)=u or that 2k+1 is a multiple of (u+1). Therefore, the smallest value is made of the smallest factor of 2k+1. Minus one. For 2k=10⁴, this value is equal to 72, while it is 7 for 10³. The decomposition is impossible for 2k=100, since 101 is prime. Here are the R functions used to check this analysis (with small integers, if not 10⁴):

solvant <- function(piles){
 if ((length(piles)>1)&((max(piles)==2)||(min(piles)==max(piles)))){
  return(piles)}else{
   i=sample(rep(1:length(piles),2),1,prob=rep(piles-min(piles)+.1,2))
   while (piles[i]<3)
    i=sample(rep(1:length(piles),2),1,prob=rep(piles-min(piles)+.1,2))
   split=sample(rep(2:(piles[i]-1),2),1,
        prob=rep(abs(2:(piles[i]-1)-piles[i]/2)+.1,2))
   piles=c(piles[-i],split-1,piles[i]-split)
   solvant(piles)}}

disolvant <- function(piles){
 sol=solvant(piles)
 while (min(sol)<max(sol))
 sol=solvant(piles)
 return(sol)}

resolvant <- function(piles){
 sol=disolvant(piles)
 lasol=sol;maxle=length(sol)
 for (t in 1:piles){
 sol=disolvant(piles)
 if (length(sol)>maxle){
 lasol=sol;maxle=length(sol)}}
 return(lasol)}
Follow

Get every new post delivered to your Inbox.

Join 1,077 other followers