why does rbinom(1,1) differ from sample(0:1,1) with the same seed?

Posted in Statistics with tags , , , , , , , , , on February 17, 2021 by xi'an
> set.seed(1)
> rbinom(10,1,0.5)
[1] 0 0 1 1 0 1 1 1 1 0
> set.seed(1)
> sample(c(0,1), 10, replace = TRUE)
[1] 0 1 0 1 0 0 1 0 1 1

This rather legitimate question was posted on X validated last week, the answer being that the C codes behind both functions do not use pseudo-random generators in the same manner. For instance, rbinom does get involved beyond a mean value of 30 (and otherwise resorts to the inverse cdf approach). And following worries about sample biases, sample was updated in 2019 (and also seems to resort to the inverse cdf when the mean is less than 200). However, when running the above code on my machine, still using the 2018 R version 3.4.4!, I recover the same outcome:

> set.seed(1)
> rbinom(10,1,0.5)
[1] 0 0 1 1 0 1 1 1 1 0

> set.seed(1)
> sample(c(0,1), 10, replace = TRUE)
[1] 0 0 1 1 0 1 1 1 1 0> set.seed(1)
> qbinom(runif(10),1,0.5)
[1] 0 0 1 1 0 1 1 1 1 0
> set.seed(1)
> 1*(runif(10)>.5)
[1] 0 0 1 1 0 1 1 1 1 0


inverse Gaussian trick [or treat?]

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , , , , , , , , on October 29, 2020 by xi'an

When preparing my mid-term exam for my undergrad mathematical statistics course, I wanted to use the inverse Gaussian distribution IG(μ,λ) as an example of exponential family and include a random generator question. As shown above by a Fortran computer code from Michael, Schucany and Haas, a simple version can be based on simulating a χ²(1) variate and solving in x the following second degree polynomial equation

$\dfrac{\lambda(x-\mu)^2}{\mu^2 x} = v$

since the left-hand side transform is distributed as a χ²(1) random variable. The smallest root x¹, less than μ, is then chosen with probability μ/(μ+x¹) and the largest one, x²=μ²/x¹ with probability x¹/(μ+x¹). A relatively easy question then, except when one considers asking for the proof of the χ²(1) result, which proved itself to be a harder cookie than expected! The paper usually referred to for the result, Schuster (1968), is quite cryptic on the matter, essentially stating that the above can be expressed as the (bijective) transform of Y=min(X,μ²/X) and that V~χ²(1) follows immediately. I eventually worked out a proof by the “law of the unconscious statistician” [a name I do not find particularly amusing!], but did not include the question in the exam. But I found it fairly interesting that the inverse Gaussian can be generating by “inverting” the above equation, i.e. going from a (squared) Gaussian variate V to the inverse Gaussian variate X. (Even though the name stems from the two cumulant generating functions being inverses of one another.)

not a Bernoulli factory

Posted in Books, Kids, pictures, R with tags , , , , , , , on May 20, 2020 by xi'an

A Riddler riddle I possibly misunderstood:

Four isolated persons are given four fair coins, which can be either flipped once or returned without being flipped. If all flipped coins come up heads, the team wins! Else, if any comes up tails, or if no flip at all is done, it looses. Each person is further given an independent U(0,1) realisation. What is the best strategy?

Since the players are separated, I would presume the same procedure is used by all. Meaning that a coin is tossed with probability p, ie if the uniform is less than p, and untouched otherwise. The probability of winning is then

4(1-p)³p½+6(1-p)³p½²+4(1-p)p³½³+p⁴½⁴

which is maximum for p=0.3420391, with a winning probability of 0.2848424.

solve $$x⌊x⌊x⌊x⌋⌋⌋=2020$$

Where the integral part is the integer immediately below x. Puzzle that I first fail solving by brute force, because I did not look at negative x’s… Since the fourth root of 2020 is between 6 and 7, the solution is either x=6+ε or x=-7+ε, with ε in (0,1). The puzzle then becomes either

$$(6+ε)⌊(6+ε)⌊(6+ε)⌊6+ε⌋⌋⌋ = (6+ε)⌊(6+ε)⌊36+6ε⌋⌋ = (6+ε)⌊(6+ε)(36+⌊6ε⌋)⌋ = 2020$$

where there are 6 possible integer values for $$⌊6ε⌋, with only ⌊6ε⌋=5 being possible, turning the equation into$$

$$(6+ε)⌊41(6+ε)⌋ = (6+ε)(246+⌊41ε⌋) = 2020$$

where again only $$⌊42ε⌋$$=40 being possible, ending up with

$$1716+286ε = 2020$$

which has no solution in (0,1). In the second case

(-7+$$ε$$)$$⌊(-7+ε)⌊(-7+ε)⌊-7+ε⌋⌋⌋$$ = (-7+$$ε$$)$$⌊(-7+ε)(49+⌊-7ε⌋)⌋ = 2020$$

shows that only $$⌊-7ε⌋$$=-3 is possible, leading to

(-7+$$ε$$)$$⌊46(-7+ε))⌋$$ = (-7+$$ε$$) ($$-322+⌊46ε⌋)=2020$$

with only $$⌊46ε⌋$$=17 possible, hence

2135-305$$ε$$=2020

and

$$ε$$=115/305.

A brute force simulated annealing resolution returns x=-6.622706 after 10⁸ iterations. A more interesting question is to figure out the discontinuity points of the function

$$ℵ(x) = x$$$$⌊x⌊x⌊x⌋⌋⌋$$

as they seem to be numerous:

For instance, only 854 of the first 2020 integers enjoy a solution to $$ℵ(x)$$=n.

Markov chain quasi-Monte Carlo

Posted in Statistics with tags , , , , on April 29, 2020 by xi'an

“It is known that Tausworthe generators can be viewed as polynomial Korobov lattice point sets with a denominator polynomial p(x) and a numerator polynomial q(x) over IF2

A recently arXived paper by Shin Harase, “A table of short-period Tausworthe generators for Markov chain quasi-Monte Carlo”, discusses the use of [quasi-Monte Carlo] Tausworthe generators rather than iid uniform sampling. As shown by Owen and Tribble, it is indeed legit to replace a sequence of iid (pseudo-random) uniforms with its quasi-Monte Carlo (qMC) version if the sequence keeps a sufficient degree of uniformity. The current paper optimises the parameters of the Tausworthe generators in terms of the t-value of the generator, an indicator of the uniform occupancy of the qMC sequence.

For a range of values of m, if 2m-1 is the period of the pseudo-random generator, the author obtains the optimal weights in the Tausworthe generator, which is a linear feedback shift register generator over {0,1}, ie shifting all the bits of the current uniform realisation by linear combination modulo 2. The comparison with other qMC and MC is provided on a Gibbs sampler for a bidimensional Gaussian target, which presents the advantage of requiring exactly one uniform per simulation and the disadvantage of … requiring exactly one uniform per simulation! Since this is harder to envision for simulation methods requiring a random number of uniforms.

Regarding the complexity of the approach, I do not see any gap between using these Tausworthe generators and something like the Mersenne generator. I just wonder at the choice of m, that is, whether or not it makes sense to pick any value lower than 2³² for the period.

really random generators [again!]

Posted in Books, Statistics with tags , , , , , , , , , on March 2, 2020 by xi'an

A pointer sent me to Chemistry World and an article therein about “really random numbers“. Or “truly” random numbers. Or “exactly” random numbers. Not particularly different from the (in)famous lava lamp generator!

“Cronin’s team has developed a robot that can automatically grow crystals in a 10 by 10 array of vials, take photographs of them, and use measurements of their size, orientation, and colour to generate strings of random numbers. The researchers analysed the numbers generated from crystals grown in three solutions – including a solution of copper sulfate – and found that they all passed statistical tests for the quality of their randomness.” Chemistry World, 18 February 2020

The validation of this truly random generator is thus exactly the same as a (“bad”) pseudo-random generator, namely that in the law of large number sense, it fits the predicted behaviour. And thus the difference between them cannot be statistical, but rather cryptographic:

“…we considered the encryption capability of this random number generator versus that of a frequently used pseudorandom number generator, the Mersenne Twister.” Lee et al., Matter, February 10, 2020

Meaning that the knowledge of the starting point and of the deterministic transform for the Mersenne Twister makes it feasible to decipher, which is not the case for a physical and non-reproducible generator as the one advocated. One unclear aspect of the proposed generator is the time required to produce 10⁶, even though the authors mention that “the bit-generation rate is significantly lower than that in other methods”.