Archive for R

where is .5?

Posted in Statistics with tags , , , , on September 10, 2020 by xi'an

A Riddler’s riddle on breaking the unit interval into 4 random bits (by which I understand picking 3 Uniform realisations and ordering them) and finding the length of the bit containing ½ (sparing you the chore of converting inches and feet into decimals). The result can be found by direct integration since the ordered Uniform variates are Beta’s, and so are their consecutive differences, leading to an average length of 15/32. Or by raw R simulation:

simz=t(apply(matrix(runif(3*1e5),ncol=3),1,sort))
mean((simz[,1]>.5)*simz[,1]+
  (simz[,1]<.5)*(simz[,2]>.5)*(simz[,2]-simz[,1])+
  (simz[,2]<.5)*(simz[,3]>.5)*(simz[,3]-simz[,2])+
  (simz[,3]<.5)*(1-simz[,3]))

Which can be reproduced for other values than ½, showing that ½ is the value leading to the largest expected length. I wonder if there is a faster way to reach this nice 15/32.

Le Monde puzzle [#1154]

Posted in Statistics with tags , , , , , , on August 25, 2020 by xi'an

The weekly puzzle from Le Monde is another Sudoku challenge:

An n by n grid contains all numbers from 1 till n². Is it possible for fill the grid so that every row and every column has an integer average, for n=5, 7 9?

By sheer random search

`?`=rowSums; `+`=sample 
o=function(n){
x=matrix(+(n^2),n) 
while(any(c(?x,?t(x))%%n))x=x/x*+x 
x}

I found solutions for n=3,4,5, quite easily,

     [,1] [,2] [,3] [,4] [,5]
[1,]   20   15   14   13    3
[2,]   21    4   25    6    9
[3,]    2    1   23   18   11
[4,]   17   12   22   24    5
[5,]   10    8   16   19    7

correction, for n=6 as well

     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]    4   12   11   23    8   32
[2,]   17   15   14   33    5   30
[3,]   35   28   27    7   13   22
[4,]   31    1    6    2   21   29
[5,]   25   36   20   34   16   19
[6,]   26   10   24    3    9   18

but larger values of n require a less frontal attack… Simulated annealing maybe.

the limits of R

Posted in Books, pictures, R, Statistics with tags , , , , , , , , , , , , on August 10, 2020 by xi'an

It has been repeated many times on many platforms, the R (or R⁰) number is not a great summary about the COVID-19 pandemic, see eg Rossman’s warning in The Conversation, but Nature chose to stress it one more time (in its 16 Jul edition). Or twice when considering a similar piece in Nature Physics. As Boris Johnson made it a central tool of his governmental communication policy. And some mayors started asking for their own local R numbers! It is obviously tempting to turn the messy and complex reality of this planetary crisis into a single number and even a single indicator R<1, but it is unhelpful and worse, from the epidemiology models being wrong (or at least oversimplifying) to the data being wrong (i.e., incomplete, biased and late), to the predictions being wrong (except for predicting the past). Nothing outrageous from the said Nature article, pointing out diverse degrees of uncertainty and variability and stressing the need to immediately address clusters rather than using the dummy R. As an aside, the repeated use of nowcasting instead of forecasting sounds like a perfect journalist fad, given that it does not seem to be based on a different model of infection or on a different statistical technique. (There is a nowcasting package in R, though!) And a wee bit later I have been pointed out at an extended discussion of an R estimation paper on Radford Neal’s blog.

[The Art of] Regression and other stories

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , on July 23, 2020 by xi'an

CoI: Andrew sent me this new book [scheduled for 23 July on amazon] of his with Jennifer Hill and Aki Vehtari. Which I read in my garden over a few sunny morns. And as Andrew and Aki are good friends on mine, this review is definitely subjective and biased! Hence to take with a spoonful of salt.

The “other stories’ in the title is a very nice touch. And a clever idea. As the construction of regression models comes as a story to tell, from gathering and checking the data, to choosing the model specifications, to analysing the output and setting the safety lines on its interpretation and usages. I added “The Art of” in my own title as the exercise sounds very much like an art and very little like a technical or even less mathematical practice. Even though the call to the resident stat_glm R function is ubiquitous.

The style itself is very story-like, very far from a mathematical statistics book as, e.g., C.R. Rao’s Linear Statistical Inference and Its Applications. Or his earlier Linear Models which I got while drafted in the Navy. While this makes the “Stories” part most relevant, I also wonder how I could teach from this book to my own undergrad students without acquiring first (myself) the massive expertise represented by the opinions and advice on what is correct and what is not in constructing and analysing linear and generalised linear models. In the sense that I would find justifying or explaining opinionated sentences an amathematical challenge. On the other hand, it would make for a great remote course material, leading the students through the many chapters and letting them experiment with the code provided therein, creating new datasets and checking modelling assumptions. The debate between Bayesian and likelihood solutions is quite muted, with a recommendation for weakly informative priors superseded by the call for exploring the impact of one’s assumption. (Although the horseshoe prior makes an appearance, p.209!) The chapter on math and probability is somewhat superfluous as I hardly fathom a reader entering this book without a certain amount of math and stats background. (While the book warns about over-trusting bootstrap outcomes, I find the description in the Simulation chapter a wee bit too vague.) The final chapters about causal inference are quite impressive in their coverage but clearly require a significant amount of investment from the reader to truly ingest these 110 pages.

“One thing that can be confusing in statistics is that similar analyses can be performed in different ways.” (p.121)

Unsurprisingly, the authors warn the reader about simplistic and unquestioning usages of linear models and software, with a particularly strong warning about significance. (Remember Abandon Statistical Significance?!) And keep (rightly) arguing about the importance of fake data comparisons (although this can be overly confident at times). Great Chapter 11 on assumptions, diagnostics and model evaluation. And terrific Appendix B on 10 pieces of advice for improving one’s regression model. Although there are two or three pages on the topic, at the very end, I would have also appreciated a more balanced and constructive coverage of machine learning as it remains a form of regression, which can be evaluated by simulation of fake data and assessed by X validation, hence quite within the range of the book.

The document reads quite well, even pleasantly once one is over the shock at the limited amount of math formulas!, my only grumble being a terrible handwritten graph for building copters(Figure 1.9) and the numerous and sometimes gigantic square root symbols throughout the book. At a more meaningful level, it may feel as somewhat US centric, at least given the large fraction of examples dedicated to US elections. (Even though restating the precise predictions made by decent models on the eve of the 2016 election is worthwhile.) The Oscar for the best section title goes to “Cockroaches and the zero-inflated negative binomial model” (p.248)! But overall this is a very modern, stats centred, engaging and careful book on the most common tool of statistical modelling! More stories to come maybe?!

le compte est bon

Posted in Books, Kids, R with tags , , , , , , on July 22, 2020 by xi'an

The Riddler asks how to derive 24 from (1,2,3,8), with each number appearing once and all operations (x,+,/,-,^) allowed. This reminded me of a very old TV show on French TV, called Le compte est bon!, where players were given 5 or 6 numbers and supposed to find a given total within 60 ,seconds. Unsurprisingly there is an online solver for this game, as shown above, e.g., 24=(8+3+1)x2. But it proves unable to solve the puzzle when the input is 24 and (2,3,3,4), only using 2,3 and 4, since 24=2x3x4. Introducing powers as well, since exponentiation is allowed, leads to two solutions, (4-2)³x3=(4/2)³x3=(3²-3)x4=3/(2/4)³=24… Not fun!

I however rewrote an R code to check whether 24 was indeed a possibility allowed with such combinations but could not find an easy way to identify which combination was used, although a pedestrian version eventually worked! And exhibited the slightly less predictable 43/2x3=24!