biased sample!
A chance occurrence led me to this thread on R-devel about R sample function generating a bias by taking the integer part of the continuous uniform generator… And then to the note by Kellie Ottoboni and Philip Stark analysing the reason, namely the fact that R uniform [0,1) pseudo-random generator is not perfectly continuously uniform but discrete, by the nature of numbers on a computer. Knuth (1997) showed that in this case the range of probabilities is larger than (1,1), the largest range being (1,1.03). As noted in the note, exploiting directly the pseudo-random bits of the pseudo-random generator. Shocking, isn’t it! A fast and bias-free alternative suggested by Lemire is available as
dqsample::sample
July 11, 2019 at 8:11 am
[…] should be able to detect any defect pretty fast, although awareness of the incredible failure of sample() reported in an earlier post took a while to […]
May 21, 2019 at 4:18 am
I believe the new version of R (which is version 3.6.0 released on 26th April) may have addressed this issue, since one of its new features is:
“The default method for generating from a discrete uniform distribution (used in sample(), for instance) has been changed. This addresses the fact, pointed out by Ottoboni and Stark, that the previous method made sample() noticeably non-uniform on large populations.”
If so, it is good to know that this defect has (finally!) been taken into account.
May 21, 2019 at 8:25 am
Ah great news!
May 21, 2019 at 11:29 am
The relevant entry (https://bugs.r-project.org/bugzilla/show_bug.cgi?id=17494) in R’ Bugzilla explicitly mentions the thread quoted in Xian’s blog entry as a discussion of the problem fixed in 3.6.0. The NEWS entry also mentions that the previous algorithm can be selected if backward compatibility/reproducibility is an issue.
So, all in all, good news…