Archive for the R Category

amazonish thanks (& repeated warning)

Posted in Books, Kids, R, Statistics with tags , , , , , on December 9, 2014 by xi'an

As in previous years, at about this time, I want to (re)warn unaware ‘Og readers that all links to Amazon.com and more rarely to Amazon.fr found on this blog are actually susceptible to earn me an advertising percentage if a purchase is made by the reader in the 24 hours following the entry on Amazon through this link, thanks to the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to amazon.com/fr. Unlike last year, I did not benefit as much from the new edition of Andrew’s book, and the link he copied from my blog entry… Here are some of the most Og-unrelated purchases:

Once again, books I reviewed, positively or negatively, were among the top purchases… Like a dozen Monte Carlo simulation and resampling methods for social science , a few copies of Naked Statistics. And again a few of The Cartoon Introduction to Statistics. (Despite a most critical review.) Thanks to all of you using those links and feeding further my book addiction, with the drawback of inducing even more fantasy book reviews.

the Grumble distribution and an ODE

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on December 3, 2014 by xi'an

As ‘Og’s readers may have noticed, I paid some recent visits to Cross Validated (although I find this too addictive to be sustainable on a long term basis!, and as already reported a few years ago frustrating at several levels from questions asked without any preliminary personal effort, to a lack of background material to understand hints towards the answer, to not even considering answers [once the homework due date was past?], &tc.). Anyway, some questions are nonetheless great puzzles, to with this one about the possible transformation of a random variable R with density

p(r|\lambda) = \dfrac{2\lambda r\exp\left(\lambda\exp\left(-r^{2}\right)-r^{2}\right)}{\exp\left(\lambda\right)-1}

into a Gumble distribution. While the better answer is that it translates into a power law,

V=e^{e^{-R^2}}\sim q(v|\lambda)\propto v^{\lambda-1}\mathbb{I}_{(1,e)}(v),

I thought using the S=R² transform could work but obtained a wrong sign in the pseudo-Gumble density

W=S-\log(\lambda)\sim \eth(w)\propto\exp\left(\exp(-w)-w\right)

and then went into seeking another transform into a Gumbel rv T, which amounted to solve the differential equation

\exp\left(-e^{-t}-t\right)\text{d}t=\exp\left(e^{-w}-w\right)\text{d}w

As I could not solve analytically the ODE, I programmed a simple Runge-Kutta numerical resolution as follows:

solvR=function(prec=10^3,maxz=1){
z=seq(1,maxz,le=prec)
t=rep(1,prec) #t(1)=1
for (i in 2:prec)
  t[i]=t[i-1]+(z[i]-z[i-1])*exp(-z[i-1]+
  exp(-z[i-1])+t[i-1]+exp(-t[i-1]))
zold=z
z=seq(.1/maxz,1,le=prec)
t=c(t[-prec],t)
for (i in (prec-1):1)
  t[i]=t[i+1]+(z[i]-z[i+1])*exp(-z[i+1]+
  exp(-z[i+1])+t[i+1]+exp(-t[i+1]))
return(cbind(c(z[-prec],zold),t))
}

Which shows that [the increasing] t(w) quickly gets too large for the function to be depicted. But this is a fairly useless result in that a transform of the original variable and of its parameter into an arbitrary distribution is always possible, given that  W above has a fixed distribution… Hence the pun on Gumble in the title.

Le Monde puzzle [#887quater]

Posted in Books, Kids, R, Statistics, University life with tags , , on November 28, 2014 by xi'an

And yet another resolution of this combinatorics Le Monde mathematical puzzle: that puzzle puzzled many more people than usual! This solution is by Marco F, using a travelling salesman representation and existing TSP software.

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

For instance, take n=199, you should first calculate the “friends”. Save them on a symmetric square matrix:

m1 <- matrix(Inf, nrow=199, ncol=199)
diag(m1) <- 0
for (i in 1:199) m1[i,friends[i]] <- 1

Export the distance matrix to a file (in TSPlib format):

library(TSP)
tsp <- TSP(m1)
tsp
image(tsp)
write_TSPLIB(tsp, "f199.TSPLIB")

And use a solver to obtain the results. The best solver for TSP is Concorde. There are online versions where you can submit jobs:

0 2 1000000
2 96 1000000
96 191 1000000
191 168 1000000
  ...

The numbers of the solution are in the second column (2, 96, 191, 168…). And they are 0-indexed, so you have to add 1 to them:

3  97 192 169 155 101 188 136 120  49 176 148 108 181 143 113 112  84  37  63 18  31  33  88168 193  96 160 129 127 162 199  90  79 177 147  78  22 122 167 194 130  39 157  99 190 13491 198  58  23  41 128 196  60  21 100 189 172 152 73 183 106  38 131 125 164 197  59 110 146178 111 145  80  20  61 135 121  75  6  94 195166 123 133 156  69  52 144  81  40   9  72 184  12  24  57  87  82 62  19  45  76 180 109 116 173 151  74  26  95 161 163 126  43 153 17154  27 117 139  30  70  11  89 107 118 138 186103  66 159 165 124 132  93  28   8  17  32  45  44  77 179 182 142  83  86  14  50 175 114 55 141 115  29  92 104 185  71  10  15  34   27  42 154 170 191  98 158  67 102 187 137 119 25  56 65  35  46 150 174  51  13  68  53  47 149 140  85  36  64 105  16  48

an ABC experiment

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on November 24, 2014 by xi'an

 

ABCmadIn a cross-validated forum exchange, I used the code below to illustrate the working of an ABC algorithm:

#normal data with 100 observations
n=100
x=rnorm(n)
#observed summaries
sumx=c(median(x),mad(x))

#normal x gamma prior
priori=function(N){
 return(cbind(rnorm(N,sd=10),
  1/sqrt(rgamma(N,shape=2,scale=5))))
}

ABC=function(N,alpha=.05){

  prior=priori(N) #reference table

  #pseudo-data
  summ=matrix(0,N,2)
  for (i in 1:N){
   xi=rnorm(n)*prior[i,2]+prior[i,1]
   summ[i,]=c(median(xi),mad(xi)) #summaries
   }

  #normalisation factor for the distance
  mads=c(mad(summ[,1]),mad(summ[,2]))
  #distance
  dist=(abs(sumx[1]-summ[,1])/mads[1])+
   (abs(sumx[2]-summ[,2])/mads[2])
  #selection
  posterior=prior[dist<quantile(dist,alpha),]}

Hence I used the median and the mad as my summary statistics. And the outcome is rather surprising, for two reasons: the first one is that the posterior on the mean μ is much wider than when using the mean and the variance as summary statistics. This is not completely surprising in that the latter are sufficient, while the former are not. Still, the (-10,10) range on the mean is way larger… The second reason for surprise is that the true posterior distribution cannot be derived since the joint density of med and mad is unavailable.

sufvsinsufAfter thinking about this for a while, I went back to my workbench to check the difference with using mean and variance. To my greater surprise, I found hardly any difference! Using the almost exact ABC with 10⁶ simulations and a 5% subsampling rate returns exactly the same outcome. (The first row above is for the sufficient statistics (mean,standard deviation) while the second row is for the (median,mad) pair.) Playing with the distance does not help. The genuine posterior output is quite different, as exposed on the last row of the above, using a basic Gibbs sampler since the posterior is not truly conjugate.

Le Monde puzzle [#887bis]

Posted in Kids, R, Statistics, University life with tags , , on November 16, 2014 by xi'an

As mentioned in the previous post, an alternative consists in finding the permutation of {1,…,N} by “adding” squares left and right until the permutation is complete or no solution is available. While this sounds like the dual of the initial solution, it brings a considerable improvement in computing time, as shown below. I thus redefined the construction of the solution by initialising the permutation at random (it could start at 1 just as well)

perfect=(1:trunc(sqrt(2*N)))^2
perm=friends=(1:N)
t=1
perm[t]=sample(friends,1)
friends=friends[friends!=perm[t]]

and then completing only with possible neighbours, left

out=outer(perfect-perm[t],friends,"==")
if (max(out)==1){
  t=t+1
  perm[t]=sample(rep(perfect[apply(out,1,
   max)==1],2),1)-perm[t-1]
  friends=friends[friends!=perm[t]]}

or right

out=outer(perfect-perm[1],friends,"==")
if (max(out)==1){
  t=t+1
  perf=sample(rep(perfect[apply(out,1,
    max)==1],2),1)-perm[1]
  perm[1:t]=c(perf,perm[1:(t-1)])
  friends=friends[friends!=perf]}

(If you wonder about why the rep in the sample step, it is a trick I just found to avoid the insufferable feature that sample(n,1) is equivalent to sample(1:n,1)! It costs basically nothing but bypasses reprogramming sample() each time I use it… I am very glad I found this trick!) The gain in computing time is amazing:

> system.time(for (i in 1:50) pick(15))
utilisateur     système       écoulé
      5.397       0.000       5.395
> system.time(for (i in 1:50) puck(15))
utilisateur     système      écoulé
      0.285       0.000       0.287

An unrelated point is that a more interesting (?) alternative problem consists in adding a toroidal constraint, namely to have the first and the last entries in the permutation to also sum up to a perfect square. Is it at all possible?

Le Monde puzzle [#887]

Posted in Books, Kids, R, Statistics with tags , , , on November 15, 2014 by xi'an

A simple combinatorics Le Monde mathematical puzzle:

N is a golden number if the sequence {1,2,…,N} can be reordered so that the sum of any consecutive pair is a perfect square. What are the golden numbers between 1 and 25?

Indeed, from an R programming point of view, all I have to do is to go over all possible permutations of {1,2,..,N} until one works or until I have exhausted all possible permutations for a given N. However, 25!=10²⁵ is a wee bit too large… Instead, I resorted once again to brute force simulation, by first introducing possible neighbours of the integers

  perfect=(1:trunc(sqrt(2*N)))^2
  friends=NULL
  le=1:N
  for (perm in 1:N){
    amis=perfect[perfect>perm]-perm
    amis=amis[amis<N]
    le[perm]=length(amis)
    friends=c(friends,list(amis))
    }

and then proceeding to construct the permutation one integer at time by picking from its remaining potential neighbours until there is none left or the sequence is complete

orderin=0*(1:N)
t=1
orderin[1]=sample((1:N),1)
for (perm in 1:N){
  friends[[perm]]=friends[[perm]]
              [friends[[perm]]!=orderin[1]]
  le[perm]=length(friends[[perm]])
  }
while (t<N){
  if (length(friends[[orderin[t]]])==0)
        break()
  if (length(friends[[orderin[t]]])>1){
    orderin[t+1]=sample(friends[[orderin[t]]],1)}else{
    orderin[t+1]=friends[[orderin[t]]]
    }
  for (perm in 1:N){
    friends[[perm]]=friends[[perm]]
      [friends[[perm]]!=orderin[t+1]]
    le[perm]=length(friends[[perm]])
    }
  t=t+1}

and then repeating this attempt until a full sequence is produced or a certain number of failed attempts has been reached. I gained in efficiency by proposing a second completion on the left of the first integer once a break occurs:

while (t<N){
  if (length(friends[[orderin[1]]])==0)
        break()
  orderin[2:(t+1)]=orderin[1:t]
  if (length(friends[[orderin[2]]])>1){
    orderin[1]=sample(friends[[orderin[2]]],1)}else{
    orderin[1]=friends[[orderin[2]]]
    }
  for (perm in 1:N){
    friends[[perm]]=friends[[perm]]
       [friends[[perm]]!=orderin[1]]
    le[perm]=length(friends[[perm]])
    }
  t=t+1}

(An alternative would have been to complete left and right by squared numbers taken at random…) The result of running this program showed there exist permutations with the above property for N=15,16,17,23,25,26,…,77.  Here is the solution for N=49:

25 39 10 26 38 43 21 4 32 49 15 34 30 6 3 22 42 7 9 27 37 12 13 23 41 40 24 1 8 28 36 45 19 17 47 2 14 11 5 44 20 29 35 46 18 31 33 16 48

As an aside, the authors of Le Monde puzzle pretended (in Tuesday, Nov. 12, edition) that there was no solution for N=23, while this sequence

22 3 1 8 17 19 6 10 15 21 4 12 13 23 2 14 11 5 20 16 9 7 18

sounds fine enough to me… I more generally wonder at the general principle behind the existence of such sequences. It sounds quite probable that they exist for N>24. (The published solution does not bring any light on this issue, so I assume the authors have no mathematical analysis to provide.)

Rasmus’ socks fit perfectly!

Posted in Books, Kids, R, Statistics, University life with tags , , , , on November 10, 2014 by xi'an

nsocksFollowing the previous post on Rasmus’ socks, I took the opportunity of a survey on ABC I am currently completing to compare the outcome of his R code with my analytical derivation. After one quick correction [by Rasmus] of a wrong representation of the Negative Binomial mean-variance parametrisation [by me], I achieved this nice fit… psocks

Follow

Get every new post delivered to your Inbox.

Join 717 other followers