**A** chance occurrence led me to this thread on R-devel about R sample function generating a bias by taking the integer part of the continuous uniform generator… And then to the note by Kellie Ottoboni and Philip Stark analysing the reason, namely the fact that R uniform [0,1) pseudo-random generator is not perfectly continuously uniform but discrete, by the nature of numbers on a computer. Knuth (1997) showed that in this case the range of probabilities is larger than (1,1), the largest range being (1,1.03). As noted in the note, exploiting directly the pseudo-random bits of the pseudo-random generator. Shocking, isn’t it! A fast and bias-free alternative suggested by Lemire is available as `dqsample::sample`

## Archive for sample

## biased sample!

Posted in Statistics with tags bias, Donald Knuth, dqsample, integers, PRNG, pseudo-random generator, R, random bit, rounding, sample, The Art of Computer Programming, uniform distribution on May 21, 2019 by xi'an## a perfectly normally distributed sample

Posted in R, Statistics with tags bayestestR, blog, R, R-bloggers, random number generator, rnorm(), rnorm_perfect(), sample, sampling on May 9, 2019 by xi'an**W**hen I saw this title on R-bloggers, I was wondering how “more perfect” a Normal sample could be when compared with the outcome of rnorm(n). Hence went checking the original blog on `bayestestR`

in search of more information. Which was stating nothing more than how to generate a sample is *perfectly* normal by using the `rnorm_perfect`

function. Still unsure of the meaning, I contacted one of the contributors who replied very quickly

…that’s actually a good question. I would say an empirical sample having characteristics as close as possible to a cannonic gaussian distribution.

`bayestestR`

and opened the `rnorm_perfect`

function. Which is simply the sequence of n-quantilesstats::qnorm(seq(1/n, 1 – 1/n, length.out = n), mean, sd)

## Le Monde puzzle [#1001]

Posted in Kids, R with tags Le Monde, mathematical puzzle, R, sample, sudoku on March 27, 2017 by xi'an**A**fter a long lag *(due to my missing the free copies distributed at Paris-Dauphine!)*, here is a Sudoku-like Le Monde mathematical puzzle:

A grid of size (n,n)holds integer values such that any entry larger than 1 is the sum of one term in the same column and one term in the same row. What is the maximal possible value observed in such a grid when n=3,4?

**T**his can be solved in R by a random exploration of such possible grids in a simulated annealing spirit:

mat=matrix(1,N,N) goal=1 targ=function(mat){ #check constraints d=0 for (i in (1:(N*N))[mat>1]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 d=d+(min(abs(mat[i]-outer(mat[-r,c],mat[r,-c],"+")))>0)} return(d)} cur=0 for (t in 1:1e6){ i=sample(1:(N*N),1);prop=mat prop[i]=sample(1:(2*goal),1) d=targ(prop) if (10*log(runif(1))/t<cur-d){ mat=prop;cur=d} if ((d==0)&(max(prop)>goal)){ goal=max(prop);maxx=prop}}

returning a value of 8 for n=3 and 37 for n=4. However, the method is quite myopic and I tried instead a random filling of the grid, using each time the maximum possible sum for empty cells:

goal=1 for (v in 1:1e6){ mat=matrix(0,N,N) #one 1 per row/col for (i in 1:N) mat[i,sample(1:N,1)]=1 for (i in 1:N) if (max(mat[,i])==0) mat[sample(1:N,1),i]=1 while (min(mat)==0){ parm=sample(1:(N*N)) #random order for (i in parm[mat[parm]==0]){ r=(i-1)%%N+1;c=(i-1)%/%N+1 if ((max(mat[-r,c])>0)&(max(mat[r,-c])>0)){ mat[i]=max(mat[-r,c])+max(mat[r,-c]) break()}}} if (goal<max(mat)){ goal=max(mat);maxx=mat}}

which recovered a maximum of 8 for n=3, but reached 48 for n=4. And 211 for n=5, 647 for n=6… For instance, here is the solution for n=4:

[1,] 1 5 11 10 [2,] 2 4 1 5 [3,] 48 2 24 1 [4,] 24 1 22 11

## Le Monde puzzle [#887bis]

Posted in Kids, R, Statistics, University life with tags Le Monde, mathematical puzzle, sample on November 16, 2014 by xi'an

**A**s mentioned in the previous post, an alternative consists in finding the permutation of {1,…,N} by “adding” squares left and right until the permutation is complete or no solution is available. While this sounds like the dual of the initial solution, it brings a considerable improvement in computing time, as shown below. I thus redefined the construction of the solution by initialising the permutation at random (it could start at 1 just as well)

perfect=(1:trunc(sqrt(2*N)))^2 perm=friends=(1:N) t=1 perm[t]=sample(friends,1) friends=friends[friends!=perm[t]]

and then completing only with possible neighbours, left

out=outer(perfect-perm[t],friends,"==") if (max(out)==1){ t=t+1 perm[t]=sample(rep(perfect[apply(out,1, max)==1],2),1)-perm[t-1] friends=friends[friends!=perm[t]]}

or right

out=outer(perfect-perm[1],friends,"==") if (max(out)==1){ t=t+1 perf=sample(rep(perfect[apply(out,1, max)==1],2),1)-perm[1] perm[1:t]=c(perf,perm[1:(t-1)]) friends=friends[friends!=perf]}

(If you wonder about why the *rep* in the *sample* step, it is a trick I just found to avoid the insufferable feature that sample(n,1) is equivalent to sample(1:n,1)! It costs basically nothing but bypasses reprogramming sample() each time I use it… I am very glad I found this trick!) The gain in computing time is amazing:

> system.time(for (i in 1:50) pick(15)) utilisateur système écoulé 5.397 0.000 5.395 > system.time(for (i in 1:50) puck(15)) utilisateur système écoulé 0.285 0.000 0.287

An unrelated point is that a more interesting (?) alternative problem consists in adding a toroidal constraint, namely to have the first and the last entries in the permutation to also sum up to a perfect square. Is it at all possible?

## fine-sliced Poisson [a.k.a. sashimi]

Posted in Books, Kids, pictures, R, Running, Statistics, University life with tags convergence assessment, ENSAE, Gibbs sampling, Monte Carlo Statistical Methods, Poisson distribution, R, sample, sampling from an atomic population, slice sampling, slow convergence on March 20, 2014 by xi'an**A**s my student Kévin Guimard had not mailed me his own Poisson slice sampler of a Poisson distribution, I could not tell why the code was not working! My earlier post prompted him to do so and a somewhat optimised version is given below:

nsim = 10^4 lambda = 6 max.factorial = function(x,u){ k = x parf=1 while (parf*u<1){ k = k + 1 parf = parf * k } k = k - (parf*u>1) return (k) } x = rep(floor(lambda), nsim) for (t in 2:nsim){ v1 = ceiling((log(runif(1))/log(lambda))+x[t-1]) ranj=max(0,v1):max.factorial(x[t-1],runif(1)) x[t]=sample(ranj,size=1) } barplot(as.vector(rbind( table(x)/length(x),dpois(min(x):max(x), lambda))),col=c("sienna","gold"))

**A**s you can easily check by running the code, it does not work. My student actually majored my MCMC class and he spent quite a while pondering why the code was not working. I did ponder as well for a part of a morning in Warwick, removing causes for exponential or factorial overflows (hence the shape of the code), but not eliciting the issue… (This now sounds like lethal fugu sashimi! ) Before reading any further, can you spot the problem?!

**T**he corrected R code is as follows:

x = rep(lambda, nsim) for (t in 2:nsim){ v1=ceiling((log(runif(1))/log(lambda))+x[t-1]) ranj=max(0,v1):max.factorial(x[t-1],runif(1)) if (length(ranj)>1){ x[t] = sample(ranj, size = 1) }else{ x[t]=ranj} }

**T**he culprit is thus the R function * sample* which simply does not understand Dirac masses and the basics of probability! When running

> sample(150:150,1) [1] 23

you can clearly see where the problem stands…! Well-documented issue with * sample* that already caused me woes… Another interesting thing about this slice sampler is that it is awfully slow in exploring the tails. And to converge to the centre from the tails. This is not very pronounced in the above graph with a mean of 6. Moving to 50 makes it more apparent:

**T**his is due to the poor mixing of the chain, as shown by the raw sequence below, which strives to achieve a single cycle out of 10⁵ iterations! In any case, thanks to Kévin for an interesting morning!

## Random graphs with fixed numbers of neighbours

Posted in Books, R, Statistics with tags edges, graphs, hcl, Le Monde, nodes, R, R bug, R-graphics, sample, simulation on November 25, 2010 by xi'an**I**n connection with Le Monde puzzle #46, I eventually managed to write an R program that generates graphs with a given number *n* of nodes and a given number *k* of edges leaving each of those nodes. (My early attempt was simply too myopic to achieve any level of success when *n* was larger than 10!) Here is the core of the R code: