Archive for the R Category

Le Monde puzzle [#1000…1025]

Posted in Kids, R with tags , , , , , , on March 28, 2017 by xi'an

Le Monde mathematical puzzle launched a competition to celebrate its 1000th puzzle! A fairly long-term competition as it runs over the 25 coming puzzles (and hence weeks). Starting with puzzle #1001. Here is the 1000th puzzle, not part of the competition:

Alice & Bob spend five (identical) vouchers in five different shops, each time buying the maximum number of items to get close to the voucher value. In these five shops, they buy sofas at 421 euros each, beds at 347 euros each, kitchen appliances at 289 euros each, tables at 251 euros each and bikes at 211 euros each, respectively. Once the buying frenzy is over, they realise that within a single shop, they would have spent exactly four vouchers for the same products. What is the value of a voucher?

Le Monde puzzle [#1001]

Posted in Kids, R with tags , , , , on March 27, 2017 by xi'an

After a long lag (due to my missing the free copies distributed at Paris-Dauphine!), here is a Sudoku-like Le Monde mathematical puzzle:

A grid of size (n,n) holds integer values such that any entry larger than 1 is the sum of one term in the same column and one term in the same row. What is the maximal possible value observed in such a grid when n=3,4?

This can be solved in R by a random exploration of such possible grids in a simulated annealing spirit:

mat=matrix(1,N,N)
goal=1

targ=function(mat){ #check constraints
  d=0
  for (i in (1:(N*N))[mat>1]){
    r=(i-1)%%N+1;c=(i-1)%/%N+1
    d=d+(min(abs(mat[i]-outer(mat[-r,c],mat[r,-c],"+")))>0)} 
  return(d)}

cur=0
for (t in 1:1e6){
  i=sample(1:(N*N),1);prop=mat
  prop[i]=sample(1:(2*goal),1)
  d=targ(prop)
  if (10*log(runif(1))/t<cur-d){ 
      mat=prop;cur=d} 
  if ((d==0)&(max(prop)>goal)){
     goal=max(prop);maxx=prop}}

returning a value of 8 for n=3 and 37 for n=4. However, the method is quite myopic and I tried instead a random filling of the grid, using each time the maximum possible sum for empty cells:

goal=1
for (v in 1:1e6){
  mat=matrix(0,N,N)
  #one 1 per row/col
  for (i in 1:N) mat[i,sample(1:N,1)]=1
  for (i in 1:N) if (max(mat[,i])==0) mat[sample(1:N,1),i]=1
  while (min(mat)==0){
   parm=sample(1:(N*N)) #random order
   for (i in parm[mat[parm]==0]){
    r=(i-1)%%N+1;c=(i-1)%/%N+1
    if ((max(mat[-r,c])>0)&(max(mat[r,-c])>0)){
      mat[i]=max(mat[-r,c])+max(mat[r,-c])
      break()}}}
    if (goal<max(mat)){
    goal=max(mat);maxx=mat}}

which recovered a maximum of 8 for n=3, but reached 48 for n=4. And 211 for n=5, 647 for n=6… For instance, here is the solution for n=4:

[1,]    1    5   11   10
[2,]    2    4    1    5
[3,]   48    2   24    1
[4,]   24    1   22   11

Continue reading

what does more efficient Monte Carlo mean?

Posted in Books, Kids, R, Statistics with tags , , , , , , on March 17, 2017 by xi'an

“I was just thinking that there might be a magic trick to simulate directly from this distribution without having to go for less efficient methods.”

In a simple question on X validated a few days ago [about simulating from x²φ(x)] popped up the remark that the person asking the question wanted a direct simulation method for higher efficiency. Compared with an accept-reject solution. Which shows a misunderstanding of what “efficiency” means on Monte Carlo situations. If it means anything, I would think it is reflected in the average time taken to return one simulation and possibly in the worst case. But there is no reason to call an inverse cdf method more efficient than an accept reject or a transform approach since it all depends on the time it takes to make the inversion compared with the other solutions… Since inverting the closed-form cdf in this example is much more expensive than generating a Gamma(½,½), and taking plus or minus its root, this is certainly the case here. Maybe a ziggurat method could be devised, especially since x²φ(x)<φ(x) when |x|≤1, but I am not sure it is worth the effort!

Peter Lee (1940?-2017)

Posted in Books, pictures, R, Statistics, University life, Wines with tags , , , , , , on March 12, 2017 by xi'an

Just heard the sad news that Peter Lee, British Bayesian and author of Bayesian Statistics: An Introduction, has passed away yesterday night. While I did not know him, I remember meeting him at a few conferences in the UK and spending an hilarious evening at the pub. When the book came out, I thought it was quite fine an introduction to Bayesian Statistics, with enough mathematical details and prerequisites to make it worthwhile studying, while also including computational recommendations. Fare thee well, Peter.

testing R code [book review]

Posted in R, Statistics, Travel with tags , , , , , , on March 1, 2017 by xi'an

When I saw this title among the CRC Press novelties, I immediately ordered it as I though it fairly exciting. Now that I have gone through the book, the excitement has died. Maybe faster than need be as I read it while being stuck in a soulless Schipol airport and missing the only ice-climbing opportunity of the year!

Testing R Code was written by Richard Cotton and is quite short: once you take out the appendices and the answers to the exercises, it is about 130 pages long, with a significant proportion of code and output. And it is about some functions developed by Hadley Wickham from RStudio, for testing the coherence of R code in terms of inputs more than outputs. The functions are assertive and testthat. Intended for run-time versus development-time testing. Meaning that the output versus the input are what the author of the code intends them to be. The other chapters contain advices and heuristics about writing maintainable testable code, and incorporating a testing feature in an R package.

While I am definitely a poorly qualified reader for this type of R books, my disappointment stems from my expectation of a book about debugging R code, which is possibly due to a misunderstanding of the term testing. This is an unrealistic expectation, for sure, as testing for a code to produce what it is supposed to do requires some advanced knowledge of what the output should be, at least in some representative situations. Which means using interface like RStudio is capital in spotting unsavoury behaviours of some variables, if not foolproof in any case.

a riddle at the end of its tether

Posted in R, Statistics with tags , , on February 24, 2017 by xi'an

A simply worded riddle this week on The Riddler, about four ropes having non-uniform and unknown burning rates, the only constraint being they all burn completely in one hour. With the help of a lighter (or even a single match), what are the possible units of time one can measure by burning them?

While I had worked out a range of possible times for each of the ropes to complete its combustion, I went for a simulation based confirmation. The starting point is that a new fire can only be started when a burning rope ends up burning. Which only happens for a finite number of possibilities. I found 24 of them, consisting of

> total*prec
[1]  0.000 0.5000 0.750 0.875 0.9375 1.000 1.125 1.1875 1.25 1.3125
[11] 1.375 1.4375 1.500 1.625 1.7500 1.875 2.000 2.1250 2.25 2.5000
[21] 2.750 3.0000 3.500 4.000

i.e., some combinations of 1, 2⁻¹, …, 2⁻⁴, with the comment that those times cannot all be obtained within a single arson experiment.

Continue reading

coauthorship and citation networks

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , on February 21, 2017 by xi'an

cozauthorAs I discovered (!) the Annals of Applied Statistics in my mailbox just prior to taking the local train to Dauphine for the first time in 2017 (!), I started reading it on the way, but did not get any further than the first discussion paper by Pengsheng Ji and Jiashun Jin on coauthorship and citation networks for statisticians. I found the whole exercise intriguing, I must confess, with little to support a whole discussion on the topic. I may have read the paper too superficially as a métro pastime, but to me it sounded more like a post-hoc analysis than a statistical exercise, something like looking at the network or rather at the output of a software representing networks and making sense of clumps and sub-networks a posteriori. (In a way this reminded of my first SAS project at school, on the patterns of vacations in France. It was in 1983 on pinched cards. And we spent a while cutting & pasting in a literal sense the 80 column graphs produced by SAS on endless listings.)

It may be that part of the interest in the paper is self-centred. I do not think analysing a similar dataset in another field like deconstructionist philosophy or Korean raku would have attracted the same attention. Looking at the clusters and the names on the pictures is obviously making sense, if more at a curiosity than a scientific level, as I do not think this brings much in terms of ranking and evaluating research (despite what Bernard Silverman suggests in his preface) or understanding collaborations (beyond the fact that people in the same subfield or same active place like Duke tend to collaborate). Speaking of curiosity, I was quite surprised to spot my name in one network and even more to see that I was part of the “High-Dimensional Data Analysis” cluster, rather than of the “Bayes” cluster.  I cannot fathom how I ended up in that theme, as I cannot think of a single paper of mines pertaining to either high dimensions or data analysis [to force the trait just a wee bit!]. Maybe thanks to my joint paper with Peter Mueller. (I tried to check the data itself but cannot trace my own papers in the raw datafiles.)

I also wonder what is the point of looking at solely four major journals in the field, missing for instance most of computational statistics and biostatistics, not to mention machine learning or econometrics. This results in a somewhat narrow niche, if obviously recovering the main authors in the [corresponding] field. Some major players in computational stats still make it to the lists, like Gareth Roberts or Håvard Rue, but under the wrong categorisation of spatial statistics.