Archive for FiveThirtyEight

Bernoulli factory in the Riddler

Posted in Books, Kids, R, Statistics with tags , , , , , , , , , , on December 1, 2020 by xi'an

“Mathematician John von Neumann is credited with figuring out how to take a p biased coin and “simulate” a fair coin. Simply flip the coin twice. If it comes up heads both times or tails both times, then flip it twice again. Eventually, you’ll get two different flips — either a heads and then a tails, or a tails and then a heads, with each of these two cases equally likely. Once you get two different flips, you can call the second of those flips the outcome of your “simulation.” For any value of p between zero and one, this procedure will always return heads half the time and tails half the time. This is pretty remarkable! But there’s a downside to von Neumann’s approach — you don’t know how long the simulation will last.” The Riddler

The associated riddle (first one of the post-T era!) is to figure out what are the values of p for which an algorithm can be derived for simulating a fair coin in at most three flips. In one single flip, p=½ sounds like the unique solution. For two flips, p²,(1-p)^2,2p(1-p)=½ work, but so do p+(1-p)p,(1-p)+p(1-p)=½, and the number of cases grows for three flips at most. However, since we can have 2³=8 different sequences, there are 2⁸ ways to aggregate these events and thus at most 2⁸ resulting probabilities (including 0 and 1). Running a quick R code and checking for proximity to ½ of any of these sums leads to

[1] 0.2062997 0.7937005 #p^3
[1] 0.2113249 0.7886753 #p^3+(1-p)^3
[1] 0.2281555 0.7718448 #p^3+p(1-p)^2
[1] 0.2372862 0.7627143 #p^3+(1-p)^3+p(1-p)^2
[1] 0.2653019 0.7346988 #p^3+2p(1-p)^2
[1] 0.2928933 0.7071078 #p^2
[1] 0.3154489 0.6845518 #p^3+2p^2(1-p)
[1] 0.352201  0.6477993 #p^3+p(1-p)^2+p^2(1-p)
[1] 0.4030316 0.5969686 #p^3+p(1-p)^2+3(1-p)p^2
[1] 0.5

which correspond to 1-p³=½, p³+(1-p)³=½,(1-p)³+(1-p)p²=½,p³+(1-p)³+p²(1-p),(1-p)³+2(1-p)p²=½,1-p²=½, p³+(1-p)³+p²(1-p)=½,(1-p)³+p(1-p)²+p²(1-p)=½,(1-p)³+p²(1-p)+3p(1-p)²=½,p³+p(1-p)²+3(p²(1-p)=½,p³+2p(1-p)²+3(1-p)p²=½,p=½, (plus the symmetric ones), leading to 19 different values of p producing a “fair coin”. Missing any other combination?! Another way to look at the problem is to find all roots of the 2^{2^n} equations

a_0p^n+a_1p^{n-1}(1-p)+\cdots+a_{n-1}p(1-p)^{n-1}+a_n(1-p)^n=1/2\quad\text{where}\quad 0\le a_i\le{n \choose i}

(None of these solutions is rational, by the way, except p=½.) I also tried this route with a slightly longer R code, calling polyroot, and finding the same 19 roots for three flips, [at least] 271 for four, and [at least] 8641 for five (The Riddler says 8635!). With an imprecision due to numerical rounding by polyroot. (Since the coefficients of the above are not directly providing those of the polynomial, I went through an alternate representation as a polynomial in (1-p)/p, with a straightforward derivation of the coefficients.)

sampling w/o replacement except when replacing

Posted in Books, Kids, R with tags , , , , , , , on November 3, 2020 by xi'an

Another Riddle(r), considering a box with M myrtle balls and D dandelion balls. Drawing balls without replacement while they stay of the same color as the initial draw, else put back the last ball and repeat the process until all balls are drawn. The funny thing is that, unless M=0 or D=0, the probability to draw a myrtle ball at the end is always ½..! This can be easily checked by simulation (when M=2 and D=8)

r=function()sample(0:1,1,p=c(d,m))
for(t in 1:1e6){
  m=2;d=8
  i=r();m=m-!!i;d=d-!i
  while(!!m*d){
    j=r();i=ifelse(i==j,j,r())
    m=m-!!i;d=d-!i}
  F=F+(m>0)}
F/1e6

Now the proof that the probability is ½ is quite straightforward, for M=1 (or D=1). But I cannot find a quick fix for larger values. I thus reasoned by recursion, with the probability of emptying a given colour first is d!m!/(d+m)!, whatever the colour and whatever d>0,m>0. Hence half a chance to finish with myrtle. Any shorter sequence of a given colour reduces the value of either d or m, at which point we are using the recursion assumption that the probability is ½…

coronavirus counts do not count

Posted in Books, pictures, Statistics with tags , , , , , , , , on April 8, 2020 by xi'an

Somewhat by chance I came across Nate Silver‘s tribune on FiveThirtyEight about the meaninglessness of COVID-19 case counts. As it reflects on sampling efforts and available resources rather than actual cases, furthermore sampling efforts from at least a fortnight.

“The data, at best, is highly incomplete, and often the tip of the iceberg for much larger problems. And data on tests and the number of reported cases is highly nonrandom. In many parts of the world today, health authorities are still trying to triage the situation with a limited number of tests available. Their goal in testing is often to allocate scarce medical care to the patients who most need it — rather than to create a comprehensive dataset for epidemiologists and statisticians to study.”

This article runs four different scenarios, with the same actual parameters for the epidemics, and highly different and mostly misleading perceptions based on the testing strategies. This is a highly relevant warning but I am surprised Nate Silver does not move to the rather obvious conclusion that some form of official survey or another, for instance based on capture-recapture and representative samples, testing for present and past infections, should be implemented on a very regular basis, even with a limited number of tested persons to get a much more reliable vision of the status of the epidemics. Here, the French official institute of statistics, INSEE, would be most suited to implement such a scheme.

one or two?

Posted in Books, Kids, R with tags , , , , , , on March 12, 2020 by xi'an

A superposition of two random walks from The Riddler:

Starting from zero, a random walk is produced by choosing moves between ±1 and ±2 at each step. If the choice between both is made towards maximising the probability of ending up positive after 100 steps, what is this probability?

Although the optimal path is not necessarily made of moves that optimise the probability of ending up positive after the remaining steps, I chose to follow a dynamic programming approach by picking between ±1 and ±2 at each step based on that probability:

bs=matrix(0,405,101) #best stategy with value i-203 at time j-1
bs[204:405,101]=1
for (t in 100:1){
  tt=2*t
  bs[203+(-tt:tt),t]=.5*apply(cbind(
     bs[204+(-tt:tt),t+1]+bs[202+(-tt:tt),t+1],
     bs[201+(-tt:tt),t+1]+bs[205+(-tt:tt),t+1]),1,max)}

resulting in the probability

> bs[203,1]
[1] 0.6403174

Just checking that a simple strategy of picking ±1 above zero and ±2 below leads to the same value

ga=rep(0,T)
for(v in 1:100) ga=ga+(1+(ga<1))*sample(c(-1,1),T,rep=TRUE)

or sort of

> mean(ga>0)
[1] 0.6403494

With highly similar probabilities when switching at ga<2

> mean(ga>0)
[1] 0.6403183

or ga<0

> mean(ga>0)
[1] 0.6403008

and too little difference to spot a significant improvement between the three boundaries.

multiplying the bars

Posted in Kids, R with tags , , , , , , , on February 25, 2020 by xi'an

The latest Riddler makes the remark that the expression

|-1|-2|-3|

has no unique meaning (and hence value) since it could be

| -1x|-2|-3 | = 5   or   |-1| – 2x|-3| = -5

depending on the position of the multiplication sign and asks for all the possible values of

|-1|-2|…|-9|

which can be explored by a recursive R function for computing |-i|-(i+1)|…|-(i+2j)|

vol<-function(i,j){x=i
  if(j){x=c(i-(i+1)*vol(i+2,j-1),abs(i*vol(i+1,j-1)+i+2*j))
  if(j>1){for(k in 1:(j-2))
        x=c(x,vol(i,k)-(i+2*k+1)*vol(i+2*k+2,j-k-1))}}
  return(x)}

producing 40 different values for the ill-defined expression. However, this is incorrect as the product(s) hidden in the expression only involve a single term in vol(i,j)… I had another try with the decomposition of the expression vol(i,j) into a first part and a second part

prod<-function(a,b) a*b[,1]+b[,2]

val<-function(i,j){
  x=matrix(c(i,0),ncol=2)
  if(j){x=rbind(cbind(i,prod(-(i+1),val(i+2,j-1))),
          cbind(abs(prod(-i,val(i+1,j-1))-i-2*j),0))
    if(j-1){for(k in 2:(j-1)){
      pon=val(i,k-1)
      for(m in 1:dim(pon)[1])
          x=rbind(x,cbind(pon[m,1],pon[m,2]+prod(-(i+2*k-1),val(i+2*k,j-k))))}}}
  return(x)}

but it still fails to produce the right version.