biased sample!

A chance occurrence led me to this thread on R-devel about R sample function generating a bias by taking the integer part of the continuous uniform generator… And then to the note by Kellie Ottoboni and Philip Stark analysing the reason, namely the fact that R uniform [0,1) pseudo-random generator is not perfectly continuously uniform but discrete, by the nature of numbers on a computer. Knuth (1997) showed that in this case the range of probabilities is larger than (1,1), the largest range being (1,1.03). As noted in the note, exploiting directly the pseudo-random bits of the pseudo-random generator. Shocking, isn’t it!  A fast and bias-free alternative suggested by Lemire is available as dqsample::sample

As an update of June 2019, sample is now fixed.

4 Responses to “biased sample!”

  1. […] should be able to detect any defect pretty fast, although awareness of the incredible failure of sample() reported in an earlier post took a while to […]

  2. I believe the new version of R (which is version 3.6.0 released on 26th April) may have addressed this issue, since one of its new features is:

    “The default method for generating from a discrete uniform distribution (used in sample(), for instance) has been changed. This addresses the fact, pointed out by Ottoboni and Stark, that the previous method made sample() noticeably non-uniform on large populations.”

    If so, it is good to know that this defect has (finally!) been taken into account.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.