Archive for the R Category

RNG impact on MCMC [or lack thereof]

Posted in Books, R, Statistics, Travel, University life with tags , , , , , , , on July 13, 2017 by xi'an

Following the talk at MCM 2017 about the strange impact of the random generator on the outcome of an MCMC generator, I tried in Montréal airport the following code on the banana target of Haario et al. (1999), copied from Soetaert and Laine and using the MCMC function of the FME package:

library(FME)
Banana <- function (x1, x2) {
 return(x2 - (x1^2+1)) }
pmultinorm <- function(vec, mean, Cov) {
 diff <- vec - mean
 ex <- -0.5*t(diff) %*% solve(Cov) %*% diff
 rdet <- sqrt(det(Cov))
 power <- -length(diff)*0.5
 return((2.*pi)^power / rdet * exp(ex)) }
BananaSS <- function (p) {
 P <- c(p[1], Banana(p[1], p[2]))
 Cov <- matrix(nr = 2, data = c(1, 0.9, 0.9, 1))
N=1e3
ejd=matrix(0,4,N)
RNGkind("Mars")
for (t in 1:N){
  MCMC <- modMCMC(f = BananaSS, p = c(0, 0.7), 
  jump = diag(nrow = 2, x = 5), niter = 1e3)
  ejd[1,t]=mean((MCMC$pars[-1,2]-MCMC$pars[1,2])^2)}

since this divergence from the initial condition seemed to reflect the experiment of the speaker at MCM 2017. Unsurprisingly, no difference came from using the different RNGs in R (which may fail to contain those incriminated by the study)…

easy riddle

Posted in Books, Kids, R with tags , , , , , on July 12, 2017 by xi'an

From the current Riddler, a problem that only requires a few lines of code and a few seconds of reasoning. Or not.

N households each stole the earnings from one of the (N-1) other households, one at a time. What is the probability that a given household is not burglarised? And what are the expected final earnings of each household in the list, assuming they all start with $1?

The first question is close to Feller’s enveloppe problem in that

\left(1-\frac{1}{N-1}\right)^{N-1}

is close to exp(-1) for N large. The second question can easily be solved by an R code like

N=1e3;M=1e6
fina=rep(1,N)
for (v in 1:M){
 ordre=sample(1:N)
 vole=sample(1:N,N,rep=TRUE)
 while (min(abs(vole-(1:N)))==0)
  vole[abs(vole-(1:N))==0]=sample(1:N,
     sum(vole-(1:N)==0))
 cash=rep(1,N)
 for (t in 1:N){
  cash[ordre[t]]=cash[ordre[t]]+cash[vole[t]];cash[vole[t]]=0}
 fina=fina+cash[ordre]}

which returns a pretty regular exponential-like curve, although I cannot figure the exact curve beyond the third burglary. The published solution gives the curve

{\frac{N-2}{N-1}}^{999}\times 2+{\frac{1}{N-1}}^{t-1}\times{\frac{N-1}{N}}^{N-t}\times\frac{N}{N-1}

corresponding to the probability of never being robbed (and getting on average an extra unit from the robbery) and of being robbed only before robbing someone else (with average wealth N/(N-1)).

[un]solved riddles

Posted in Books, Kids, R with tags , , on July 4, 2017 by xi'an

On the Riddler of last week, first a birthday puzzle:

Given a group of 23 persons, what is the probability of observing three pairs of identical birthdays?

which can be found by a quick simulation as

ave=0
for (t in 1:1e6){
 dupz=dates[duplicated(sample(1:365,23,rep=TRUE))]
 ave=ave+as.integer((length(dupz)==3)&
     (length(unique(dupz))==3))}}
ave/M

returning a value of 0.0183, but which combinatoric resolution I could not fully fathom without a little help from a friend (-ly blog). I had the multinomial coefficient

{23\choose 2\ 2\ 2}

for the allocation of the 23 persons to one of the three pairs or none, as well as the probability

\dfrac{365\times\cdots\times(365-(23-4)+1)}{365^{23}}

but I had forgotten the 3! in the denominator for the permutations of the three pairs, which leads again to

{23 \choose 2\ 2\ 2}\dfrac{365\times\cdots\times(365-(23-4)+1)}{3!\times365^{23}} = 0.0183

A question that also led to an unsolved question: in the even this probability was much smaller, what is an easy way into constructing a more efficient importance sampler?

The second riddle was just as easy to code in R:

A game of tag goes by the following rules: (i) anyone untagged can tag anyone untagged; (ii) anyone tagged by a player tagged gets untagged; (iii) the winner is the last untagged player. What is the expected number of runs for N players?

The outcome of

game=function(N=12){
  a=rep(0,N);T=0
  while (sum(a==0)>1){
   ij=sample((1:N)[a==0],2)
   a[ij[2]]=ij[1];a[a==ij[2]]=0
   T=T+1}
  return(T)}

leads to an average value of

2^{N-1}-1

but I had no clear quick explanation for the doubling phenomenon. Until I picked a pen and a sheet of paper and drew the last steps of the game: to decrease down to 1, the size of the untagged survivors has to get through …,3,2 and each time the eliminated player needs to have tagged no other player since otherwise the population grows again. This has to apply all the way to the second round, where N-1 players remain and the one tagged needs to be anyone but the one who tagged the first one. And so on…

thinning a Markov chain, statistically

Posted in Books, pictures, R, Statistics with tags , , , , , , on June 13, 2017 by xi'an

Art Owen has arXived a new version of his thinning MCMC paper, where he studies how thinning or subsampling can improve computing time in MCMC chains. I remember quite well the message set by Mark Berliner and Steve MacEachern in an early 1990’s paper that subsampling was always increasing the variance of the resulting estimators. We actually have this result in our Monte Carlo Statistical Methods book. Now, there are other perspectives on this, as for instance cases when thinning can be hard-wired by simulating directly a k-step move, delaying rejection or acceptance, prefetching, or simulating directly the accepted values as in our vanilla Rao-Blackwellisation approach. Here, Art considers the case when there is a cost θ of computing a transform of the simulation [when the transition cost a unit] and when those transforms are positively correlated with correlation ρ. Somewhat unsurprisingly, when θ is large enough, thinning becomes worth implementing. But requires extra computations in evaluating the correlation ρ and the cost θ, which is rarely comparable with the cost of computing the likelihood itself, a requirement for the Metropolis-Hastings or Hamiltonian Monte Carlo step(s).  Subsampling while keeping the right target (which is a hard constraint!) should thus have a much more effective impact on computing budgets.

datazar

Posted in R, Statistics, University life with tags , , , , , , , on June 4, 2017 by xi'an

A few weeks ago and then some, I [as occasional blogger!] got contacted by datazar.com to write a piece on this data-sharing platform. I then went and checked what this was all about, having the vague impression this was a platform where I could store and tun R codes, besides dropping collective projects, but from what I quickly read, it sounds more like being able to run R scripts from one’s machine using data and code stored on datazar.com. But after reading just one more blog entry I finally understood it is also possible to run R, SQL, NotebookJS (and LaTeX) directly on that platform, without downloading code or data to one’s machine. Which makes it a definitive plus with this site, as users can experiment with no transfer to their computer. Hence on a larger variety of platforms. While personally I do not [yet?] see how to use it for my research or [limited] teaching, it seems like an [yet another] interesting exploration of the positive uses of Internet to collaborate and communicate on scientific issues! With no opinion on privacy and data protection offered by the site, of course.

continental divide

Posted in Books, Kids, pictures, R with tags , , , , on May 19, 2017 by xi'an

While the Riddler puzzle this week was anticlimactic,  as it meant filling all digits in the above division towards a null remainder, it came as an interesting illustration of how different division is taught in the US versus France: when I saw the picture above, I had to go and check an American primary school on-line introduction to division, since the way I was taught in France is something like that

with the solution being that 12128316 = 124 x 97809… Solved by a dumb R exploration of all constraints:

for (y in 111:143)
for (z4 in 8:9)
for (oz in 0:999){
  z=oz+7e3+z4*1e4
  x=y*z
  digx=digits(x)
  digz=digits(z)
  if ((digz[2]==0)&(x>=1e7)&(x<1e8)){ 
   r1=trunc(x/1e4)-digz[5]*y 
   if ((digz[5]*y>=1e3)&(digz[4]*y<1e4) &(r1>9)&(r1<100)){ 
    r2=10*r1+digx[4]-7*y 
    if ((7*y>=1e2)&(7*y<1e3)&(r2>=1e2)&(r2<1e3)){     
     r3=10*r2+digx[3]-digz[3]*y 
     if ((digz[3]*y>=1e2)&(digz[3]*y<1e3)&(r3>9)&(r3<1e2)){
       r4=10*r3+digx[2]
       if (r4<y) solz=rbind(solz,c(y,z,x))
  }}}}

Looking for a computer-free resolution, the constraints on z exhibited by the picture are that (a) the second digit is 0 and the fourth digit is 7.  Moreover, the first and fifth digits are larger than 7 since y times these digits is a four-digit number. Better, since the second subtraction from a three-digit number by 7y returns a three-digit number and the third subtraction from a four-digit number by ny returns a two-digit number, n is larger than 7 but less than the first and fifth digits. Ergo, z is necessarily 97809! Furthermore, 8y<10³ and 9y≥10³, which means 111<y<125. Plus the constraint that 1000-8y≤99 implies y≥112. Nothing gained there! This leaves 12 values of y to study, unless there is another restriction I missed…

Le Monde puzzle [#1006]

Posted in Kids, R with tags , , , , , , , on May 3, 2017 by xi'an

Once the pseudo-story [noise] removed, a linear programming Le Monde mathematical puzzle:

For the integer linear programming problem

max 2x¹+2x²+x³+…+x¹⁰

under the constraints

x¹>x²+x³, x²>x³+x⁴, …, x⁹>x¹⁰+x¹, x¹⁰>x¹+x²

find a solution with the maximal number of positive entries.

Expressed this way, it becomes quite straightforward to solve with the help of a linear programming R code like lpSolve. Indeed, a simple iteration of the constraints shows that positive entries are necessarily bracketed by negative entries, since, e.g.,

x²<-88x¹/55, x¹⁰<-33x¹/55

(which applies to all consecutive triplets since the constraints are invariant by transposition). Hence there are at most five positive entries but calling lpSolve with this option

> lp (direction="max", 
objective.in=c(2,2,rep(1,8)),
const.mat=A, 
const.dir=rep(">=",10), 
const.rhs=rep(1,10)+A%*%c(rep(c(20,-1),5)), 
all.int=TRUE) 
Error: no feasible solution found

shows this is not possible. (The added vector is my way of getting around the constraint that lpSolve only considers positive entries. I therefore moved the negative entries by 20, meaning they are assumed to be larger than -20. Using the larger bound 50 does not change the outcome.) From there, there are 10 possible versions of vectors with four positive entries and a simple loop returns

> masume
[1] -90
> topast
 [1] -11 1 -13 1 -15 1 -17 1 -19 -9

as the maximal criterion and argument of this maximum with four positive entries.

As an aside, the previous Le Monde puzzle [#1005] was also provided by a linear system: given 64 cubes, each of the 384 faces being painted in one of four colours, with exactly 40 of these cubes exhibiting the four colours,  the question was about the number of cubes that could be bicolour so that a mono-colour super-cube could be reconstituted for all colours.  Which amounted to solve the four equations

4a+2b=24,4c+2d=40, b+c=8,d+3a=24,

leading to 40 quadri-colour, 16 tri-colour, and 8 bi-colour cubes.