Archive for R

on approximations of Φ and Φ⁻¹

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , on June 3, 2021 by xi'an

As I was working on a research project with graduate students, I became interested in fast and not necessarily very accurate approximations to the normal cdf Φ and its inverse. Reading through this 2010 paper of Richards et al., using for instance Polya’s

F_0(x) =\frac{1}{2}(1+\sqrt{1-\exp(-2x^2/\pi)})

(with another version replacing 2/π with the squared root of π/8) and


not to mention a rational faction. All of which are more efficient (in R), if barely, than the resident pnorm() function.

      test replications elapsed relative user.self 
3 logistic       100000   0.410    1.000     0.410 
2    polya       100000   0.411    1.002     0.411 
1 resident       100000   0.455    1.110     0.455 

For the inverse cdf, the approximations there are involving numerical inversion except for

F_0^{-1}(p) =(-\pi/2 \log[1-(2p-1)^2])^{\frac{1}{2}}

which proves slightly faster than qnorm()

       test replications elapsed relative user.self 
2 inv-polya       100000   0.401    1.000     0.401
1  resident       100000   0.450    1.000     0.450

scale matters [maths as well]

Posted in pictures, R, Statistics with tags , , , , , , , , on June 2, 2021 by xi'an

A question from X validated on why an independent Metropolis sampler of a three component Normal mixture based on a single Normal proposal was failing to recover the said mixture…

When looking at the OP’s R code, I did not notice anything amiss at first glance (I was about to drive back from Annecy, hence did not look too closely) and reran the attached code with a larger variance in the proposal, which returned the above picture for the MCMC sample, close enough (?) to the target. Later, from home, I checked the code further and noticed that the Metropolis ratio was only using the ratio of the targets. Dividing by the ratio of the proposals made a significant (?) to the representation of the target.

More interestingly, the OP was fundamentally confused between independent and random-walk Rosenbluth algorithms, from using the wrong ratio to aiming at the wrong scale factor and average acceptance ratio, and furthermore challenged by the very notion of Hessian matrix, which is often suggested as a default scale.

bean bag win

Posted in Books, Kids, pictures, R with tags , , , , on May 19, 2021 by xi'an

A quick riddle from The Riddler, where a multiple step game sees a probability of a 3 point increase of .4 and a probability of a 1 point increase of .3 with a first strategy (A), versus a probability of a 3 point increase of .4 and a probability of a 1 point increase of .3 with a second strategy (B), and a sure miss third strategy (C). The goal is to optimise the probability of hitting exactly 3 points after 4 steps.

The optimal strategy is to follow A while the score is zero, C when the score is 3, and B otherwise. The corresponding winning probability is 0.8548, as checked by the following code


unbalanced sampling

Posted in pictures, R, Statistics with tags , , , , , , , on May 17, 2021 by xi'an

A question from X validated on sampling from an unknown density f when given both a sample from the density f restricted to a (known) interval A , say, and a sample from f restricted to the complement of A, say. Or at least on producing an estimate of the mass of A under f, p(A)

The problem sounds impossible to solve without an ability to compute the density value at a given value, since  any convex combination αf¹+(1-α)f² would return the same two samples. Assuming continuity of the density f at the boundary point a between A and its complement, a desperate solution for p(A)/1-p(A) is to take the ratio of the density estimates at the value a, which turns out not so poor an approximation if seemingly biased. This was surprising to me as kernel density estimates are notoriously bad at boundary points.

If f(x) can be computed [up to a constant] at an arbitrary x, it is obviously feasible to simulate from f and approximate p(A). But the problem is then moot as a resolution would not even need the initial samples. If exploiting those to construct a single kernel density estimate, this estimate can be used as a proposal in an MCMC algorithm. Surprisingly (?), using instead the empirical cdf as proposal does not work.

one-way random walks

Posted in Kids, R, Statistics with tags , , , on May 2, 2021 by xi'an

A rather puzzling riddle from The Riddler on an 3×3 directed grid and the probability to get from the North-West to the South-East nodes following the arrows. Puzzling because while the solution could be reasonably computed with an R code like

for(i in 1:2^12){
  for(j in 1:12)sol=max(sol,

where paz is the list of the 12 possible paths from North-West to South-East (excluding loops!), leading to a probability of 1135/2¹², I could not find a logical reasoning to reach this number. The paths of length 4, 6, 8 are valid in 2⁸, 2⁶, 2⁴ of the cases, respectively and logically!, but this does not help as they are dependent.