Archive for the R Category

an express riddle

Posted in Books, Kids, R with tags , , on January 20, 2017 by xi'an

A quick puzzle on The Riddler this week that enjoys a quick solution once one writes it out. The core of the puzzle is about finding the average number of draws one need to empty a population of size T if each draw is uniform over the remaining number of individuals between one and the number that remain. It is indeed easy to see that this average satisfies

\epsilon^T=1+\frac{1}{T}\sum_{i=1}^{T-1} \epsilon^i

since all draws but one require an extra draw. A recursion then leads by elimination to deduce that

\epsilon^T=\frac{1}{T}+\frac{1}{T-1}+\ldots+\frac{1}{2}+1

which is the beginning of the (divergent) harmonic series. In the case T=30, the solution is (almost) equal to 4.

> sum(1/(1:30))*1e10
[1] 39949871309

A second riddle the same week reminded me of a result in Devroye’s Non-Uniform Random Variate Generation, namely to find the average number of draws from a Uniform until the sequence goes down. Actually, the real riddle operates with a finite support Uniform, but I find the solution with the continuous Uniform more elegant. And it only took a few metro stops to solve. The solution goes as follows: the probability to stop after two Uniform draws is 1/2, after n uniform draws, it is (n-1)/n!, which does sum up to 1:

\sum_{n=2}^\infty \frac{n-1}{n!} = \sum_{n=2}^\infty \frac{n}{n!} - \sum_{n=2}^\infty \frac{1}{n!} = \sum_{n=1}^\infty \frac{1}{n!} - \sum_{n=2}^\infty \frac{1}{n!}=1

and the expectation of this distribution is e-1 by a very similar argument, as can be checked by a rudimentary Monte Carlo experiment

> over(1e7) #my implementation of the puzzle
[1] 1.7185152

weakly informative reparameterisations for location-scale mixtures

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , on January 19, 2017 by xi'an

fitted_density_galaxy_data_500itersWe have been working towards a revision of our reparameterisation paper for quite a while now and too advantage of Kate Lee visiting Paris this fortnight to make a final round: we have now arXived (and submitted) the new version. The major change against the earlier version is the extension of the approach to a large class of models that include infinitely divisible distributions, compound Gaussian, Poisson, and exponential distributions, and completely monotonic densities. The concept remains identical: change the parameterisation of a mixture from a component-wise decomposition to a construct made of the first moment(s) of the distribution and of component-wise objects constrained by the moment equation(s). There is of course a bijection between both parameterisations, but the constraints appearing in the latter produce compact parameter spaces for which (different) uniform priors can be proposed. While the resulting posteriors are no longer conjugate, even conditional on the latent variables, standard Metropolis algorithms can be implemented to produce Monte Carlo approximations of these posteriors.

truncated normal algorithms

Posted in Books, pictures, R, Statistics, University life with tags , , , , on January 4, 2017 by xi'an

Nicolas Chopin (CREST) just posted an entry on Statisfaction about the comparison of truncated Normal algorithms run by Alan Rogers, from the University of Utah. Nicolas wrote a paper in Statistics and Computing about a simulation method, which proposes a Ziggurat type of algorithm for this purpose, and which I do not remember reading, thanks to my diminishing memory buffer!  As shown in the picture below, when truncating to the half-line (a,∞), this method improves upon my accept-reject approach except in the far tails.

truncanormOn the top graph, made by Alan Rogers, my uniform proposal (r) seems to be doing better for a Normal truncated to (a,b) when b<0, or when a gets large and close to b. Nicolas’ ziggurat (c) works better than the Gaussian accept-reject method (c) on the positive part. (I wonder what the exponential proposal (e) stands for, in terms of scale parameter.)

a Galton-Watson riddle

Posted in R, Travel with tags , , , on December 30, 2016 by xi'an

riddleThe Riddler of this week has an extinction riddle which summarises as follows:

One observes a population of N individuals, each with a probability of 10⁻⁴ to kill the observer each day. From one day to the next, the population decreases by one individual with probability

K√N 10⁻⁴

What is the value of K that leaves the observer alive with probability ½?

Given the sequence of population sizes N,N¹,N²,…, the probability to remain alive is

(1-10^{-4})^{N+N^1+\ldots}

where the sum stops with the (sure) extinction of the population. Which is the moment generating function of the sum. At x=1-10⁻⁴. Hence the problem relates to a Galton-Watson extinction problem. However, given the nature of the extinction process I do not see a way to determine the distribution of the sum, except by simulation. Which returns K=27 for the specific value of N=9.

N=9
K=3*N
M=10^4
vals=rep(0,M)
targ=0
ite=1
while (abs(targ-.5)>.01){

for (t in 1:M){
  gen=vals[t]=N
  while (gen>0){
   gen=gen-(runif(1)<K*sqrt(gen)/10^4)
   vals[t]=vals[t]+gen}
  }
  targ=mean(exp(vals*log(.9999)))
print(c(as.integer(ite),K,targ))
  if (targ<.5){ K=K*ite/(1+ite)}else{
     K=K/(ite/(1+ite))}
  ite=ite+1}

The solution proposed on The Riddler is more elegant in that the fixed point equation is

\prod_{J=1}^9 \frac{K \cdot \sqrt{J}}{K \cdot \sqrt{J} + J}=\frac{1}{2}

with a solution around K=27.

puzzled by harmony [not!]

Posted in Books, Kids, Mountains, pictures, R, Running, Statistics, Travel with tags , , , , , on December 13, 2016 by xi'an

In answering yet another question on X validated about the numerical approximation of the marginal likelihood, I suggested using an harmonic mean estimate as a simple but worthless solution based on an MCMC posterior sample. This was on a toy example with a uniform prior on (0,π) and a “likelihood” equal to sin(θ) [really a toy problem!]. Simulating an MCMC chain by a random walk Metropolis-Hastings algorithm is straightforward, as is returning the harmonic mean of the sin(θ)’s.

f <- function(x){
    if ((0<x)&(x<pi)){
        return(sin(x))}else{
        return(0)}}

n = 2000 #number of iterations
sigma = 0.5
x = runif(1,0,pi) #initial x value
chain = fx = f(x)   
#generates an array of random x values from norm distribution
rands = rnorm(n,0, sigma) 
#Metropolis - Hastings algorithm
for (i in 2:n){
    can = x + rands[i]  #candidate for jump
    fcan=f(can)
    aprob = fcan/fx #acceptance probability
    if (runif(1) < aprob){
        x = can
        fx = fcan}
    chain=c(chain,fx)}
I = pi*length(chain)/sum(1/chain) #integral harmonic approximation

However, the outcome looks remarkably stable and close to the expected value 2/π, despite 1/sin(θ) having an infinite integral on (0,π). Meaning that the average of the 1/sin(θ)’s has no variance. Hence I wonder why this specific example does not lead to an unreliable output… But re-running the chain with a smaller scale σ starts producing values of sin(θ) regularly closer to zero, which leads to an estimate of I both farther away from 2 and much more variable. No miracle, in the end!

ratio-of-uniforms [-1]

Posted in Books, pictures, R, Statistics, University life with tags , , , on December 12, 2016 by xi'an

Luca Martino pointed out to me my own and forgotten review of a 2012 paper of his, “On the Generalized Ratio of Uniforms as a Combination of Transformed Rejection and Extended Inverse of Density Sampling” that obviously discusses a generalised version of Kinderman and Monahan’s (1977) ratio-of-uniform method. And further points out the earlier 1991 paper by Jon Wakefield, Alan Gelfand and Adrian Smith that contains the general form I rediscovered a few posts ago… Called the GRoU in Martino et al.. While the generalisation in the massive arXiv document is in finding Φ such that the above region is bounded and can be explored by uniform sampling over a box.

Neither reference mentions using the cdf transform, though, which does guarantee a bounded ratio-of-uniform set in u. (An apparent contradiction with Martino et al.  statement (34), unless I am confused. Maybe due to using Φ⁻¹ instead of Φ?) But I still wonder at the usefulness of my derivations those past weeks!

flea circus

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , , , , , , on December 8, 2016 by xi'an

gribAn old riddle found on X validated asking for Monte Carlo resolution  but originally given on Project Euler:

A 30×30 grid of squares contains 30² fleas, initially one flea per square. When a bell is rung, each flea jumps to an adjacent square at random. What is the expected number of unoccupied squares after 50 bell rings, up to six decimal places?

The debate on X validated is whether or not a Monte Carlo resolution is feasible. Up to six decimals, certainly not. But with some lower precision, certainly. Here is a rather basic R code where the 50 steps are operated on the 900 squares, rather than the 900 fleas. This saves some time by avoiding empty squares.

xprmt=function(n=10,T=50){

 mean=0
 for (t in 1:n){

   board=rep(1,900)
   for (v in 1:T){

    beard=rep(0,900)
    if (board[1]>0){
      poz=c(0,1,0,30)
      ne=rmultinom(1,board[1],prob=(poz!=0))
      beard[1+poz]=beard[1+poz]+ne}
    #
    for (i in (2:899)[board[-1][-899]>0]){
     u=(i-1)%%30+1;v=(i-1)%/%30+1
     poz=c(-(u>1),(u<30),-30*(v>1),30*(v<30))      
     ne=rmultinom(1,board[i],prob=(poz!=0))      
     beard[i+poz]=beard[i+poz]+ne} 
    #     
    if (board[900]>0){
     poz=c(-1,0,-30,0)
     ne=rmultinom(1,board[900],prob=(poz!=0))
     beard[900+poz]=beard[900+poz]+ne}
     board=beard}
   mean=mean+sum(board==0)}
 return(mean/n)}

The function returns an empirical average over n replications. With a presumably awkward approach to the borderline squares, since it involves adding zeros to keep the structure the same… Nonetheless, it produces an approximation that is rather close to the approximate expected value, in about 3mn on my laptop.

> exprmt(n=1e3)
[1] 331.082
> 900/exp(1)
[1] 331.0915

Further gains follow from considering only half of the squares, as there are two independent processes acting in parallel. I looked at an alternative and much faster approach using the stationary distribution, with the stationary being the Multinomial (450,(2/1740,3/1740…,4/1740,…,2/1740)) with probabilities proportional to 2 in the corner, 3 on the sides, and 4 in the inside. (The process, strictly speaking, has no stationary distribution, since it is periodic. But one can consider instead the subprocess indexed by even times.) This seems to be the case, though, when looking at the occupancy frequencies, after defining the stationary as:

inva=function(B=30){
return(c(2,rep(3,B-2),2,rep(c(3,
  rep(4,B-2),3),B-2),2,rep(3,B-2),2))}

namely

> mn=0;n=1e8 #14 clock hours!
> proz=rep(c(rep(c(0,1),15),rep(c(1,0),15)),15)*inva(30)
> for (t in 1:n)
+ mn=mn+table(rmultinom(1,450,prob=rep(1,450)))[1:4]
> mn=mn/n
> mn[1]=mn[1]-450
> mn
     0      1      2     3
166.11 164.92  82.56 27.71
> exprmt(n=1e6) #55 clock hours!!
[1] 165.36 165.69 82.92 27.57

my original confusion being that the Poisson approximation had not yet taken over… (Of course, computing the first frequency for the stationary distribution does not require any simulation, since it is the sum of the complement probabilities to the power 450, i.e., 166.1069.)