Archive for R

folded Normals

Posted in Books, Kids, pictures, R, Running, Statistics with tags , , , , , , , , , , , , on February 25, 2021 by xi'an

While having breakfast (after an early morn swim at the vintage La Butte aux Cailles pool, which let me in free!), I noticed a letter to the Editor in the Annals of Applied Statistics, which I was unaware existed. (The concept, not this specific letter!) The point of the letter was to indicate that finding the MLE for the mean and variance of a folded normal distribution was feasible without resorting to the EM algorithm. Since the folded normal distribution is a special case of mixture (with fixed weights), using EM is indeed quite natural, but the author, Iain MacDonald, remarked that an optimiser such as R nlm() could be called instead. The few lines of relevant R code were even included. While this is a correct if minor remark, I am a wee bit surprised at seeing it included in the journal, the more because the authors of the original paper using the EM approach were given the opportunity to respond, noticing EM is much faster than nlm in the cases they tested, and Iain MacDonald had a further rejoinder! The more because the Wikipedia page mentioned the use of optimisers much earlier (and pointed out at the R package Rfast as producing MLEs for the distribution).

why does rbinom(1,1) differ from sample(0:1,1) with the same seed?

Posted in Statistics with tags , , , , , , , , , on February 17, 2021 by xi'an
> set.seed(1)
> rbinom(10,1,0.5)
 [1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1) > sample(c(0,1), 10, replace = TRUE) [1] 0 1 0 1 0 0 1 0 1 1

This rather legitimate question was posted on X validated last week, the answer being that the C codes behind both functions do not use pseudo-random generators in the same manner. For instance, rbinom does get involved beyond a mean value of 30 (and otherwise resorts to the inverse cdf approach). And following worries about sample biases, sample was updated in 2019 (and also seems to resort to the inverse cdf when the mean is less than 200). However, when running the above code on my machine, still using the 2018 R version 3.4.4!, I recover the same outcome:

> set.seed(1)
> rbinom(10,1,0.5)
 [1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1)
> sample(c(0,1), 10, replace = TRUE)
 [1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1) > qbinom(runif(10),1,0.5) [1] 0 0 1 1 0 1 1 1 1 0
> set.seed(1) > 1*(runif(10)>.5) [1] 0 0 1 1 0 1 1 1 1 0

easy and uneasy riddles

Posted in Books, Kids, R with tags , , , , , on February 2, 2021 by xi'an

On 15 January, The Riddler had both a straightforward and a challenging riddles. The first one was to optimise the choice of a real number d with the utility function U(d,θ)=d ℑ(θ>d), when θ is Uniform(0,100). Leading unsurprisingly to d=50…

The tough(er) one was to solve a form of sudoku where the 24 entries of a 8×3 table are integers in {1,…,9} and the information is provided by the row-wise and column-wise products of these integers. The vertical margins are

294, 216, 135, 98, 112, 84, 245, 40

and the horizontal margins are

8 890 560, 156 800, 55 566

After an unsuccessful brute-force (and pseudo-annealed) attempt achieving a minimum error of 127, although using the prime factor decompositions of these 11 margins, I realised that some entries were known: e.g., 7 at (1,2), 5 at (3,2), and 7 at (7,3), and (much later) that the (huge) product value for the first column implied that each term in that column had to be the maximal possible value for the corresponding rows, except for 5 on row 7. This leads to the starting grid

    [,1] [,2] [,3]
[1,]    7    7    6
[2,]    9    0    0
[3,]    9    5    3
[4,]    7    0    0
[5,]    8    0    0
[6,]    7    0    0
[7,]    5    7    7
[8,]    8    0    0

and an additional and obvious exclusion based on the absence of 3’s in the second column, of 5’s and 2’s in the third column shows there was a unique solution

    [,1] [,2] [,3]
[1,]    7    7    6
[2,]    9    8    3
[3,]    9    5    3
[4,]    7    2    7
[5,]    8    2    7
[6,]    7    4    3
[7,]    5    7    7
[8,]    8    5    1

as also demonstrated by a complete exploration with R:

Try it online!

wrong algebra for slice sampler

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , , , on January 27, 2021 by xi'an

Once more, and thrice alas!, I became aware of a typo in our “Use R!” book through a question on X validated from a reader unable to reproduce the slice of a basic 2D slice sampler for a logistic regression with coefficients (a,b). Indeed, our slice reads as the incorrect set (missing the i=1,…,n)

\left\{ (a,b): y_i(a+bx_i) > \log \frac{u_i}{1-u_i} \right\}

when it should have been

\bigcap_{i=1} \left\{ (a,b)\,:\ (-1)^{y_i}(a+bx_i) > \log\frac{u_i}{1-u_i} \right\}

which is the version I found in my LaTeX file. So I do not know what happened (unless I corrected the LaTeX file at a later date and cannot remember it, but the latest chance on the file reads October 2011…). Fortunately, the resulting slices in a and b and the following R code remain correct. Unfortunately, both French and Japanese translations reproduce the mistake…

Kempner Fi

Posted in Books, Kids, R, Statistics with tags , , , , , , , on January 19, 2021 by xi'an

A short code-golf challenge led me to learn about the Kempner series, which is the series made of the inverted integers, excluding all those containing the digit 9. Most surprisingly this exclusion is enough to see the series converging (close to 23). The explanation for this convergence is that, citing Wikipedia,

“The number of n-digit positive integers that have no digit equal to ‘9’ is 8 × 9n−1

and since the inverses of these n-digit positive integers are less than 101−n the series is bounded by 80. In simpler terms, it converges because the fraction of remaining terms in the series is geometrically decreasing as (9/10)1−n. Unsurprisingly (?) the series is also atrociously slow to converge (for instance the first million terms sum up to 11) and there exist recurrence representations that speed up its computation.  Here is the code-golf version

n=scan()+2
while(n<-n-1){
F=F+1/T
while(grepl(9,T<-T+1))0}
F

that led me to learn about the R function grepl. (The explanation for the pun in the title is that Semper Fidelis is the motto of the corsair City of Saint-Malo or Sant-Maloù, Brittany.)