biased sample!

Posted in Statistics with tags , , , , , , , , , , , on May 21, 2019 by xi'an A chance occurrence led me to this thread on R-devel about R sample function generating a bias by taking the integer part of the continuous uniform generator… And then to the note by Kellie Ottoboni and Philip Stark analysing the reason, namely the fact that R uniform [0,1) pseudo-random generator is not perfectly continuously uniform but discrete, by the nature of numbers on a computer. Knuth (1997) showed that in this case the range of probabilities is larger than (1,1), the largest range being (1,1.03). As noted in the note, exploiting directly the pseudo-random bits of the pseudo-random generator. Shocking, isn’t it!  A fast and bias-free alternative suggested by Lemire is available as `dqsample::sample`

As an update of June 2019, sample is now fixed.

a perfectly normally distributed sample

Posted in R, Statistics with tags , , , , , , , , on May 9, 2019 by xi'an

When I saw this title on R-bloggers, I was wondering how “more perfect” a Normal sample could be when compared with the outcome of rnorm(n). Hence went checking the original blog on `bayestestR` in search of more information. Which was stating nothing more than how to generate a sample is perfectly normal by using the `rnorm_perfect` function. Still unsure of the meaning, I contacted one of the contributors who replied very quickly

…that’s actually a good question. I would say an empirical sample having characteristics as close as possible to a cannonic gaussian distribution.
and again leaving me hungering for more details. I thus downloaded the package `bayestestR` and opened the `rnorm_perfect` function. Which is simply the sequence of n-quantiles
stats::qnorm(seq(1/n, 1 – 1/n, length.out = n), mean, sd)
which I would definitely not call a sample as it has nothing random. And perfect?! Not really, unless one associates randomness and imperfection.

Le Monde puzzle [#1001]

Posted in Kids, R with tags , , , , on March 27, 2017 by xi'an After a long lag (due to my missing the free copies distributed at Paris-Dauphine!), here is a Sudoku-like Le Monde mathematical puzzle:

A grid of size (n,n) holds integer values such that any entry larger than 1 is the sum of one term in the same column and one term in the same row. What is the maximal possible value observed in such a grid when n=3,4?

This can be solved in R by a random exploration of such possible grids in a simulated annealing spirit:

```mat=matrix(1,N,N)
goal=1

targ=function(mat){ #check constraints
d=0
for (i in (1:(N*N))[mat>1]){
r=(i-1)%%N+1;c=(i-1)%/%N+1
d=d+(min(abs(mat[i]-outer(mat[-r,c],mat[r,-c],"+")))>0)}
return(d)}

cur=0
for (t in 1:1e6){
i=sample(1:(N*N),1);prop=mat
prop[i]=sample(1:(2*goal),1)
d=targ(prop)
if (10*log(runif(1))/t<cur-d){
mat=prop;cur=d}
if ((d==0)&(max(prop)>goal)){
goal=max(prop);maxx=prop}}```

returning a value of 8 for n=3 and 37 for n=4. However, the method is quite myopic and I tried instead a random filling of the grid, using each time the maximum possible sum for empty cells:

```goal=1
for (v in 1:1e6){
mat=matrix(0,N,N)
#one 1 per row/col
for (i in 1:N) mat[i,sample(1:N,1)]=1
for (i in 1:N) if (max(mat[,i])==0) mat[sample(1:N,1),i]=1
while (min(mat)==0){
parm=sample(1:(N*N)) #random order
for (i in parm[mat[parm]==0]){
r=(i-1)%%N+1;c=(i-1)%/%N+1
if ((max(mat[-r,c])>0)&(max(mat[r,-c])>0)){
mat[i]=max(mat[-r,c])+max(mat[r,-c])
break()}}}
if (goal<max(mat)){
goal=max(mat);maxx=mat}}
```

which recovered a maximum of 8 for n=3, but reached 48 for n=4. And 211 for n=5, 647 for n=6… For instance, here is the solution for n=4:

```[1,]    1    5   11   10
[2,]    2    4    1    5
[3,]   48    2   24    1
[4,]   24    1   22   11
```

Le Monde puzzle [#887bis]

Posted in Kids, R, Statistics, University life with tags , , on November 16, 2014 by xi'an As mentioned in the previous post, an alternative consists in finding the permutation of {1,…,N} by “adding” squares left and right until the permutation is complete or no solution is available. While this sounds like the dual of the initial solution, it brings a considerable improvement in computing time, as shown below. I thus redefined the construction of the solution by initialising the permutation at random (it could start at 1 just as well)

```perfect=(1:trunc(sqrt(2*N)))^2
perm=friends=(1:N)
t=1
perm[t]=sample(friends,1)
friends=friends[friends!=perm[t]]
```

and then completing only with possible neighbours, left

```out=outer(perfect-perm[t],friends,"==")
if (max(out)==1){
t=t+1
perm[t]=sample(rep(perfect[apply(out,1,
max)==1],2),1)-perm[t-1]
friends=friends[friends!=perm[t]]}
```

or right

```out=outer(perfect-perm,friends,"==")
if (max(out)==1){
t=t+1
perf=sample(rep(perfect[apply(out,1,
max)==1],2),1)-perm
perm[1:t]=c(perf,perm[1:(t-1)])
friends=friends[friends!=perf]}
```

(If you wonder about why the rep in the sample step, it is a trick I just found to avoid the insufferable feature that sample(n,1) is equivalent to sample(1:n,1)! It costs basically nothing but bypasses reprogramming sample() each time I use it… I am very glad I found this trick!) The gain in computing time is amazing:

```> system.time(for (i in 1:50) pick(15))
utilisateur     système       écoulé
5.397       0.000       5.395
> system.time(for (i in 1:50) puck(15))
utilisateur     système      écoulé
0.285       0.000       0.287
```

An unrelated point is that a more interesting (?) alternative problem consists in adding a toroidal constraint, namely to have the first and the last entries in the permutation to also sum up to a perfect square. Is it at all possible?

fine-sliced Poisson [a.k.a. sashimi]

Posted in Books, Kids, pictures, R, Running, Statistics, University life with tags , , , , , , , , , on March 20, 2014 by xi'an As my student Kévin Guimard had not mailed me his own Poisson slice sampler of a Poisson distribution, I could not tell why the code was not working! My earlier post prompted him to do so and a somewhat optimised version is given below:

```nsim = 10^4
lambda = 6

max.factorial = function(x,u){
k = x
parf=1
while (parf*u<1){
k = k + 1
parf = parf * k
}
k = k - (parf*u>1)
return (k)
}

x = rep(floor(lambda), nsim)
for (t in 2:nsim){
v1 = ceiling((log(runif(1))/log(lambda))+x[t-1])
ranj=max(0,v1):max.factorial(x[t-1],runif(1))
x[t]=sample(ranj,size=1)
}
barplot(as.vector(rbind(
table(x)/length(x),dpois(min(x):max(x),
lambda))),col=c("sienna","gold"))
```

As you can easily check by running the code, it does not work. My student actually majored my MCMC class and he spent quite a while pondering why the code was not working. I did ponder as well for a part of a morning in Warwick, removing causes for exponential or factorial overflows (hence the shape of the code), but not eliciting the issue… (This now sounds like lethal fugu sashimi! ) Before reading any further, can you spot the problem?! The corrected R code is as follows:

```x = rep(lambda, nsim)
for (t in 2:nsim){
v1=ceiling((log(runif(1))/log(lambda))+x[t-1])
ranj=max(0,v1):max.factorial(x[t-1],runif(1))
if (length(ranj)>1){
x[t] = sample(ranj, size = 1)
}else{
x[t]=ranj}
}
```

The culprit is thus the R function sample which simply does not understand Dirac masses and the basics of probability! When running

```> sample(150:150,1)
 23
```

you can clearly see where the problem stands…! Well-documented issue with sample that already caused me woes… Another interesting thing about this slice sampler is that it is awfully slow in exploring the tails. And to converge to the centre from the tails. This is not very pronounced in the above graph with a mean of 6. Moving to 50 makes it more apparent: This is due to the poor mixing of the chain, as shown by the raw sequence below, which strives to achieve a single cycle out of 10⁵ iterations! In any case, thanks to Kévin for an interesting morning! Random graphs with fixed numbers of neighbours

Posted in Books, R, Statistics with tags , , , , , , , , , on November 25, 2010 by xi'an In connection with Le Monde puzzle #46, I eventually managed to write an R program that generates graphs with a given number n of nodes and a given number k of edges leaving each of those nodes. (My early attempt was simply too myopic to achieve any level of success when n was larger than 10!) Here is the core of the R code: