Archive for sample

why does rbinom(1,1) differ from sample(0:1,1) with the same seed?

Posted in Statistics with tags , , , , , , , , , on February 17, 2021 by xi'an
> set.seed(1)
> rbinom(10,1,0.5)
 [1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1) > sample(c(0,1), 10, replace = TRUE) [1] 0 1 0 1 0 0 1 0 1 1

This rather legitimate question was posted on X validated last week, the answer being that the C codes behind both functions do not use pseudo-random generators in the same manner. For instance, rbinom does get involved beyond a mean value of 30 (and otherwise resorts to the inverse cdf approach). And following worries about sample biases, sample was updated in 2019 (and also seems to resort to the inverse cdf when the mean is less than 200). However, when running the above code on my machine, still using the 2018 R version 3.4.4!, I recover the same outcome:

> set.seed(1)
> rbinom(10,1,0.5)
 [1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1)
> sample(c(0,1), 10, replace = TRUE)
 [1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1) > qbinom(runif(10),1,0.5) [1] 0 0 1 1 0 1 1 1 1 0
> set.seed(1) > 1*(runif(10)>.5) [1] 0 0 1 1 0 1 1 1 1 0

the strange occurrence of the one bump

Posted in Books, Kids, R, Statistics with tags , , , , , , , , on June 8, 2020 by xi'an

When answering an X validated question on running an accept-reject algorithm for the Gamma distribution by using a mixture of Beta and drifted (bt 1) Exponential distributions, I came across the above glitch in the fit of my 10⁷ simulated sample to the target, apparently displaying a wrong proportion of simulations above (or below) one.

a=.9
g<-function(T){
  x=rexp(T)
  v=rt(T,1)<0
  x=c(1+x[v],exp(-x/a)[!v])
  x[runif(T)<x^a/x/exp(x)/((x>1)*exp(1-x)+a*(x<1)*x^a/x)*a]}

It took me a while to spot the issue, namely that the output of

  z=g(T)
  while(sum(!!z)<T)z=c(z,g(T))
  z[1:T]

was favouring simulations from the drifted exponential by truncating. Permuting the elements of z before returning solved the issue (as shown below for a=½)!

CRAN does not validate R packages!

Posted in University life, pictures, R with tags , , , , , , , , , , on July 10, 2019 by xi'an

A friend called me the other day for advice on how to submit an R package to CRAN along with a proof his method was mathematically sound. I replied with some items of advice taken from my (limited) experience with submitting packages. And with the remark that CRAN would not validate the mathematical contents of the associated package manual. Nor even the validity of the R code towards delivering the right outcome as stated in the manual. This shocked him quite seriously as he thought having a package accepted by CRAN was a stamp of validation of both the method and the R code. It would be nice of course but would require so much manpower that it seems unrealistic. Some middle ground is to aim at a journal or a peer community validation where both code and methods are vetted. Which happens for instance with the Journal of Computational and Graphical Statistics. Or the Journal of Statistical Software (which should revise its instructions to authors that states “The majority of software published in JSS is written in S, MATLAB, SAS/IML, C++, or Java”. S, really?!)

As for the validity of the latest release of R (currently R-3.6.1 which came out on 2019-07-05, named Action of the Toes!), I figure the bazillion R programs currently running should be able to detect any defect pretty fast, although awareness of the incredible failure of sample() reported in an earlier post took a while to appear.

biased sample!

Posted in Statistics with tags , , , , , , , , , , , on May 21, 2019 by xi'an

A chance occurrence led me to this thread on R-devel about R sample function generating a bias by taking the integer part of the continuous uniform generator… And then to the note by Kellie Ottoboni and Philip Stark analysing the reason, namely the fact that R uniform [0,1) pseudo-random generator is not perfectly continuously uniform but discrete, by the nature of numbers on a computer. Knuth (1997) showed that in this case the range of probabilities is larger than (1,1), the largest range being (1,1.03). As noted in the note, exploiting directly the pseudo-random bits of the pseudo-random generator. Shocking, isn’t it!  A fast and bias-free alternative suggested by Lemire is available as dqsample::sample

As an update of June 2019, sample is now fixed.

a perfectly normally distributed sample

Posted in R, Statistics with tags , , , , , , , , on May 9, 2019 by xi'an

When I saw this title on R-bloggers, I was wondering how “more perfect” a Normal sample could be when compared with the outcome of rnorm(n). Hence went checking the original blog on bayestestR in search of more information. Which was stating nothing more than how to generate a sample is perfectly normal by using the rnorm_perfect function. Still unsure of the meaning, I contacted one of the contributors who replied very quickly

…that’s actually a good question. I would say an empirical sample having characteristics as close as possible to a cannonic gaussian distribution.
and again leaving me hungering for more details. I thus downloaded the package bayestestR and opened the rnorm_perfect function. Which is simply the sequence of n-quantiles
stats::qnorm(seq(1/n, 1 – 1/n, length.out = n), mean, sd)
which I would definitely not call a sample as it has nothing random. And perfect?! Not really, unless one associates randomness and imperfection.