**A**rt Owen has arXived a new version of his thinning MCMC paper, where he studies how thinning or subsampling can improve computing time in MCMC chains. I remember quite well the message set by Mark Berliner and Steve MacEachern in an early 1990’s paper that subsampling was *always* increasing the variance of the resulting estimators. We actually have this result in our Monte Carlo Statistical Methods book. Now, there are other perspectives on this, as for instance cases when thinning can be hard-wired by simulating directly a k-step move, delaying rejection or acceptance, prefetching, or simulating directly the accepted values as in our vanilla Rao-Blackwellisation approach. Here, Art considers the case when there is a cost θ of computing a transform of the simulation [when the transition cost a unit] and when those transforms are positively correlated with correlation ρ. Somewhat unsurprisingly, when θ is large enough, thinning becomes worth implementing. But requires extra computations in evaluating the correlation ρ and the cost θ, which is rarely comparable with the cost of computing the likelihood itself, a requirement for the Metropolis-Hastings or Hamiltonian Monte Carlo step(s). Subsampling while keeping the right target (which is a hard constraint!) should thus have a much more effective impact on computing budgets.

## Archive for the R Category

## thinning a Markov chain, statistically

Posted in Books, pictures, R, Statistics with tags autocorrelation, computing time, MCMC, MCMC convergence, Monte Carlo Statistical Methods, thinning, vanilla Rao-Blackwellisation on June 13, 2017 by xi'an## datazar

Posted in R, Statistics, University life with tags blogging, data privacy, data science, data-sharing, datazar.com, LaTeX, R, R-bloggers on June 4, 2017 by xi'an**A** few weeks ago and then some, I [as occasional blogger!] got contacted by datazar.com to write a piece on this data-sharing platform. I then went and checked what this was all about, having the vague impression this was a platform where I could store and tun R codes, besides dropping collective projects, but from what I quickly read, it sounds more like being able to run R scripts from one’s machine using data and code stored on datazar.com. But after reading just one more blog entry I finally understood it is also possible to run R, SQL, NotebookJS (and LaTeX) directly on that platform, without downloading code or data to one’s machine. Which makes it a definitive plus with this site, as users can experiment with no transfer to their computer. Hence on a larger variety of platforms. While personally I do not [yet?] see how to use it for my research or [limited] teaching, it seems like an [yet another] interesting exploration of the positive uses of Internet to collaborate and communicate on scientific issues! With no opinion on privacy and data protection offered by the site, of course.

## continental divide

Posted in Books, Kids, pictures, R with tags arithmetics, division, FiveThirtyEight, long division, The Riddler on May 19, 2017 by xi'an**W**hile the Riddler puzzle this week was anticlimactic, as it meant filling all digits in the above division towards a null remainder, it came as an interesting illustration of how different division is taught in the US versus France: when I saw the picture above, I had to go and check an American primary school on-line introduction to division, since the way I was taught in France is something like that

with the solution being that 12128316 = 124 x 97809… Solved by a dumb R exploration of all constraints:

for (y in 111:143) for (z4 in 8:9) for (oz in 0:999){ z=oz+7e3+z4*1e4 x=y*z digx=digits(x) digz=digits(z) if ((digz[2]==0)&(x>=1e7)&(x<1e8)){ r1=trunc(x/1e4)-digz[5]*y if ((digz[5]*y>=1e3)&(digz[4]*y<1e4) &(r1>9)&(r1<100)){ r2=10*r1+digx[4]-7*y if ((7*y>=1e2)&(7*y<1e3)&(r2>=1e2)&(r2<1e3)){ r3=10*r2+digx[3]-digz[3]*y if ((digz[3]*y>=1e2)&(digz[3]*y<1e3)&(r3>9)&(r3<1e2)){ r4=10*r3+digx[2] if (r4<y) solz=rbind(solz,c(y,z,x)) }}}}

Looking for a computer-free resolution, the constraints on z exhibited by the picture are that (a) the second digit is 0 and the fourth digit is 7. Moreover, the first and fifth digits are larger than 7 since y times these digits is a four-digit number. Better, since the second subtraction from a three-digit number by 7y returns a three-digit number and the third subtraction from a four-digit number by ny returns a two-digit number, n is larger than 7 but less than the first and fifth digits. Ergo, z is necessarily 97809! Furthermore, 8y<10³ and 9y≥10³, which means 111<y<125. Plus the constraint that 1000-8y≤99 implies y≥112. Nothing gained there! This leaves 12 values of y to study, unless there is another restriction I missed…

## Le Monde puzzle [#1006]

Posted in Kids, R with tags competition, Le Monde, linear programming, linear system of equations, lpSolve, mathematical puzzle, optimisation, R on May 3, 2017 by xi'an**O**nce the pseudo-story [noise] removed, a linear programming Le Monde mathematical puzzle:

For the integer linear programming problemmax 2x¹+2x²+x³+…+x¹⁰

under the constraintsx¹>x²+x³, x²>x³+x⁴, …, x⁹>x¹⁰+x¹, x¹⁰>x¹+x²

find a solution with the maximal number of positive entries.

**E**xpressed this way, it becomes quite straightforward to solve with the help of a linear programming R code like lpSolve. Indeed, a simple iteration of the constraints shows that positive entries are necessarily bracketed by negative entries, since, e.g.,

x²<-88x¹/55, x¹⁰<-33x¹/55

(which applies to all consecutive triplets since the constraints are invariant by transposition). Hence there are at most five positive entries but calling lpSolve with this option

> lp (direction="max", objective.in=c(2,2,rep(1,8)), const.mat=A, const.dir=rep(">=",10), const.rhs=rep(1,10)+A%*%c(rep(c(20,-1),5)), all.int=TRUE) Error: no feasible solution found

shows this is not possible. (The added vector is my way of getting around the constraint that *lpSolve only considers positive entries*. I therefore moved the negative entries by 20, meaning they are assumed to be larger than -20. Using the larger bound 50 does not change the outcome.) From there, there are 10 possible versions of vectors with four positive entries and a simple loop returns

> masume [1] -90 > topast [1] -11 1 -13 1 -15 1 -17 1 -19 -9

as the maximal criterion and argument of this maximum with four positive entries.

As an aside, the previous Le Monde puzzle [#1005] was also provided by a linear system: given 64 cubes, each of the 384 faces being painted in one of four colours, with exactly 40 of these cubes exhibiting the four colours, the question was about the number of cubes that could be bicolour so that a mono-colour super-cube could be reconstituted for all colours. Which amounted to solve the four equations

4a+2b=24,4c+2d=40, b+c=8,d+3a=24,

leading to 40 quadri-colour, 16 tri-colour, and 8 bi-colour cubes.

## a secretary problem with maximum ability

Posted in Kids, R with tags mathematical puzzle, R, secretary problem, stopping rule, The Riddler on April 28, 2017 by xi'an**T**he Riddler of today has a secretary problem, where one measures sequentially N random variables until one deems the current variable to be the largest of the whole sample. The classical secretary problem has a counter-intuitive solution where one first measures N/e random variables without taking any decision and then and only then picks the first next outcome larger than the largest in the first group. (For large values of N.) The added information in the current riddle is that the distribution of those iid random variables is set to be uniform on {1,…,M}, which begs for a modification in the algorithm. As for instance when observing M on the current draw.

The approach I devised is certainly suboptimal, as I decided to pick the currently observed value if the (conditional) probability it is the largest is larger than the probability subsequent draws. This translates into the following R code:

M=100 #maximum value N=10 #total number of draws hyprob=function(m){ # m is sequence of draws so far n=length(m);mmax=max(m) if ((m[n]<mmax)||(mmax-n<N-n)){prob=0 }else{ prob=prod(sort((1:mmax)[-m],dec=TRUE) [1:(N-n)]/((M-n):(M-N+1))} return(prob)} decision=function(draz=sample(1:M,N)){ i=0 keepgoin=TRUE while ((keepgoin)&(i<N)){ i=i+1 keepgoin=(hyprob(draz[1:i])<0.5)} return(c(i,draz[i],(draz[i]<max(draz))))}

which produces a winning rate of around 62% when N=10 and M=100, hence much better than the expected performances of the (asymptotic) secretary algorithm, with a winning frequency of 1/e. (For N=10 and M=100, the winning frequency is only 27%.)

## Le Monde puzzle [#1003]

Posted in Kids, R with tags competition, Le Monde, mathematical puzzle, R on April 18, 2017 by xi'an**A** purely arithmetic Le Monde mathematical puzzle:

Find the four integers w, x, y, z such that the four smallest pairwise sums among the six pairwise sums are 59, 65, 66, and 69. Similarly, find the four smallest of the five integers v, x, y, z such that the five smallest pairwise sums among the ten pairwise sums are 56, 64 , 66, 69 and 70.

**T**he first question is rather straightforward since there are only two possible orderings when x≤y≤z≤w :

x+y≤x+z≤x+w≤y+z≤y+w≤z+w and x+y≤x+z≤y+z≤x+w≤y+w≤z+w

which means

x+y=59, x+z=65, x+w=66, y+z=69, and x+y=59, x+z=65, y+z=66, x+w=69

but since the first system does not allow for an integer solution, the only possibility is the second system, with solution (x,y,z,w)=(29,30,36,40). And the second question is of the same complexity, with, when x≤y≤z≤w≤v :

x+y=56, x+z=64, y+z=66, x+w=69, x+v=70 or x+y=56, x+z=64, x+w=66, x+v=69, y+z=70

with solutions (x,y,z,w,v)=(27,29,37,42,43) and (x,y,z,w,v)=(25,31,39,41,44).

## optimultiplication [a riddle]

Posted in Books, Kids, R, Statistics with tags coding, conditional probability, FiveThirtyEight, mathematical puzzle, R, The Riddler on April 14, 2017 by xi'an**T**he riddle of this week is about an optimisation of positioning the four digits of a multiplication of two numbers with two digits each and is open to a coding resolution:

Four digits are drawn without replacement from {0,1,…,9}, one at a time. What is the optimal strategy to position those four digits, two digits per row, as they are drawn, toward minimising the average product?

Although the problem can be solved algebraically by computing **E**[X⁴|x¹,..] and **E**[X⁴X³|x¹,..] I wrote three R codes to “optimise” the location of the first three digits: the first digit ends up as a unit if it is 5 or more and a multiple of ten otherwise, on the first row. For the second draw, it is slightly more variable: with this R code,

second<-function(i,j,N=1e5){draw drew=matrix(0,N,2) for (t in 1:N) drew[t,]=sample((0:9)[-c(i+1,j+1)],2) conmean=(45-i-j)/8 conprod=mean(drew[,1]*drew[,2]) if (i<5){ #10*i pos=c((110*i+11*j)*conmean, 100*i*j+10*(i+j)*conmean+conprod, (100*i+j)*conmean+10*i*j+10*conprod)}else{ pos=c((110*j+11*i)*conmean, 10*i*j+(100*j+i)*conmean+10*conprod, 10*(i+j)*conmean+i*j+100*conprod) return(order(pos)[1])}

the resulting digit again ends up as a unit if it is 5 (except when x¹=7,8,9, where it is 4) or more and a multiple of ten otherwise, but on the second row. Except when x¹=0, x²=1,2,3,4, when they end up on the first row together, 0 obviously in front.

For the third and last open draw, there is only one remaining random draw, which mean that the decision only depends on x¹,x²,x³ and **E**[X⁴|x¹,x²,x³]=(45-x¹-x²-x³)/7. Attaching x³ to x² or x¹ will then vary monotonically in x³, depending on whether x¹>x² or x¹<x²:

fourth=function(i,j,k){ comean=(45-i-j-k)/7 if ((i<1)&(j<5)){ pos=c(10*comean+k,comean+10*k)} if ((i<5)&(j>4)){ pos=c(100*i*comean+k*j,j*comean+100*i*k)} if ((i>0)&(i<5)&(j<5)){ pos=c(i*comean+k*j,j*comean+i*k)} if ((i<7)&(i>4)&(j<5)){ pos=c(i*comean+100*k*j,j*comean+100*i*k)} if ((i<7)&(i>4)&(j>4)){ pos=c(i*comean+k*j,j*comean+i*k)} if ((i>6)&(j<4)){ pos=c(i*comean+100*k*j,j*comean+100*i*k)} if ((i>6)&(j>3)){ pos=c(i*comean+k*j,j*comean+i*k)} return(order(pos)[1])}

Running this R code for all combinations of x¹,x² shows that, except for the cases x¹≥5 and x²=0, for which x³ invariably remains in front of x¹, there are always values of x³ associated with each position.