Archive for the R Category

Egyptian fractions [Le Monde puzzle #922]

Posted in Books, Kids, R with tags , , , , , , , on July 28, 2015 by xi'an

For its summer edition, Le Monde mathematical puzzle switched to a lighter version with immediate solution. This #922 considers Egyptian fractions which only have distinct denominators (meaning the numerator is always 1) and can be summed. This means 3/4 is represented as ½+¼. Each denominator only appears once. As I discovered when looking on line, a lot of people are fascinated with this representation and have devised different algorithms to achieve decompositions with various properties. Including Fibonacci who devised a specific algorithm called the greedy algorithm in 1202 in the Liber Abaci. In the current Le Monde edition, the questions were somewhat modest and dealt with the smallest decompositions of 2/5, 5/12, and 50/77 under some additional constraint.

Since the issue was covered in so many places, I just spent one hour or so constructing a basic solution à la Fibonacci and then tried to improve it against a length criterion. Here are my R codes (using the numbers library):

osiris=function(a,b){
#can the fraction a/b be simplified
diva=primeFactors(a)
divb=primeFactors(b)
divc=c(unique(diva),unique(divb))
while (sum(duplicated(divc))>0){
  n=divc[duplicated(divc)]
  for (i in n){a=div(a,i);b=div(b,i)}
  diva=primeFactors(a)
  divb=primeFactors(b)
  divc=c(unique(diva),unique(divb))
  }
  return(list(a=a,b=b))
}

presumably superfluous for simplifying fractions

horus=function(a,b,teth=NULL){
#simplification
anubis=osiris(a,b)
a=anubis$a;b=anubis$b
#decomposition by removing 1/b
 isis=NULL
 if (!(b %in% teth)){
   a=a-1
   isis=c(isis,b)
   teth=c(teth,b)}
 if (a>0){
#simplification
  anubis=osiris(a,b)
  bet=b;a=anubis$a;b=anubis$b
  if (bet>b){ isis=c(isis,horus(a,b,teth))}else{
  # find largest integer
    k=ceiling(b/a)
    while (k %in% teth) k=k+1
    a=k*a-b
    b=k*b
    isis=c(isis,k,horus(a,b,teth=c(teth,k)))
    }}
 return(isis)}

which produces a Fibonacci solution (with the additional inclusion of the original denominator) and

nut=20
seth=function(a,b,isis=NULL){
#simplification
anubis=osiris(a,b)
a=anubis$a;b=anubis$b
if ((a==1)&(!(b %in% isis))){isis=c(isis,b)}else{
 ra=hapy=ceiling(b/a)
 if (max(a,b)<1e5) hapy=horus(a,b,teth=isis)
 k=unique(c(hapy,ceiling(ra/runif(nut,min=.1,max=1))))
 propa=propb=propc=propd=rep(NaN,le=length((k %in% isis)))
 bastet=1
 for (i in k[!(k %in% isis)]){
   propa[bastet]=i*a-b
   propb[bastet]=i*b
   propc[bastet]=i
   propd[bastet]=length(horus(i*a-b,i*b,teth=c(isis,i)))
   bastet=bastet+1
   }
 k=propc[order(propd)[1]]
 isis=seth(k*a-b,k*b,isis=c(isis,k))
 }
return(isis)}

which compares solutions against their lengths. When calling those functions for the three fractions above the solutions are

> seth(2,5)
[1] 15 3
> seth(5,12)
[1] 12  3
> seth(50,77)
[1]   2 154   7

with no pretension whatsoever to return anything optimal (and with some like crashes when the magnitude of the entries grows, try for instance 5/121). For this latest counter-example, the alternative horus works quite superbly:

> horus(5,121)
[1] 121 31 3751 1876 7036876

Le Monde puzzle [#920]

Posted in Books, Kids, R, Statistics, University life with tags , on July 23, 2015 by xi'an

A puzzling Le Monde mathematical puzzle (or blame the heat wave):

A pocket calculator with ten keys (0,1,…,9) starts with a random digit n between 0 and 9. A number on the screen can then be modified into another number by two rules:
1. pressing k changes the k-th digit v whenever it exists into (v+1)(v+2) where addition is modulo 10;
2. pressing 0k deletes the (k-1)th and (k+1)th digits if they both exist and are identical (otherwise nothing happens.
Which 9-digit numbers can always be produced whatever the initial digit?

I did not find an easy entry to this puzzle, in particular because it did not state what to do once 9 digits had been reached: would the extra digits disappear? But then, those to the left or to the right? The description also fails to explain how to handle n=000 000 004 versus n=4.

Instead, I tried to look at the numbers with less than 7 digits that could appear, using some extra rules of my own like preventing numbers with more than 9 digits. Rules which resulted in a sure stopping rule when applying both rules above at random:

leplein=rep(0,1e6)
for (v in 1:1e6){
 x=as.vector(sample(1:9,1))
 for (t in 1:1e5){
  k=length(x) #as sequence of digits
  if (k<3){

   i=sample(rep(1:k,2),1)
   x[i]=(x[i]+1)%%10
   y=c(x[1:i],(x[i]+1)%%10)
   if (i<k){ x=c(y,x[(i+1):k])}else{ x=y}
 }else{

  prop1=prop2=NULL
  difs=(2:(k-1))[abs(x[-(1:2)]-x[-((k-1):k)])==0]
  if (length(difs)>0) prop1=sample(rep(difs,2),1)
  if (k<9) prop2=sample(rep(1:k,2),1)

  if (length(c(prop1,prop2))>1){
   if (runif(1)<.5){

    x[prop2]=(x[prop2]+1)%%10
    y=c(x[1:prop2],(x[prop2]+1)%%10)
    if (prop2<k){ x=c(y,x[(prop2+1):k])}else{ x=y}
    }else{
      x=x[-c(prop1-1,prop1+1)]}
    while ((length(x)>1)&(x[1]==0)) x=x[-1]}

  if (length(c(prop1,prop2))==1){
    if (is.null(prop2)){ x=x[-c(prop1-1,prop1+1)]
    }else{
     x[prop2]=(x[prop2]+1)%%10
     y=c(x[1:prop2],(x[prop2]+1)%%10)
     if (prop2<k){ x=c(y,x[(prop2+1):k])
     }else{ x=y}
     x=c(x[1:(prop2-1)],
       (x[prop2]+1)%%10,
       (x[prop2]+2)%%10,x[(prop2+1):k])}
    while ((length(x)>1)&(x[1]==0)) x=x[-1]}

  if (length(c(prop1,prop2))==0) break()
  }

 k=length(x)
 if (k<7) leplein[sum(x*10^((k-1):0))]=
          leplein[sum(x*10^((k-1):0))]+1
}}

code that fills an occupancy table for the numbers less than a million over 10⁶ iterations. The solution as shown below (with the number of zero entries over each column) is rather surprising in that it shows an occupancy that is quite regular over a grid. While it does not answer the original question…

lemonde920

MCMskv, Lenzerheide, 4-7 Jan., 2016 [news #1]

Posted in Kids, Mountains, pictures, R, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on July 20, 2015 by xi'an

moonriseThe BayesComp MCMski V [or MCMskv for short] has now its official website, once again maintained by Merrill Lietchy from Drexel University, Philadelphia, and registration is even open! The call for contributed sessions is now over, while the call for posters remains open until the very end. The novelty from the previous post is that there will be a “Breaking news” [in-between the Late news sessions at JSM and the crash poster talks at machine-learning conferences] session to highlight major advances among poster submissions. And that there will be an opening talk by Steve [the Bayesian] Scott on the 4th, about the frightening prospect of MCMC death!, followed by a round-table and a welcome reception, sponsored by the Swiss Supercomputing Centre. Hence the change in dates. Which still allows for arrivals in Zürich on the January 4th [be with you].

Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

pimcisIn the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

R brut

Posted in Kids, pictures, R, Statistics, University life with tags , , , on July 2, 2015 by xi'an

R-brut

Introduction to Monte Carlo methods with R and Bayesian Essentials with R

Posted in Books, R, Statistics, University life with tags , , , , , , on June 26, 2015 by xi'an

sales1Here are the  download figures for my e-book with George as sent to me last week by my publisher Springer-Verlag.  With an interesting surge in the past year. Maybe simply due to new selling strategies of the published rather to a wider interest in the book. (My royalties have certainly not increased!) Anyway thanks to all readers. As an aside for wordpress wannabe bloggers, I realised it is now almost impossible to write tables with WordPress, another illustration of the move towards small-device-supported blogs. Along with a new annoying “simpler” (or more accurately dumber) interface and a default font far too small for my eyesight. So I advise alternatives to wordpress that are more sympathetic to maths contents (e.g., using MathJax) and comfortable editing.

salesBessAnd the same for the e-book with Jean-Michel, which only appeared in late 2013. And contains more chapters than Introduction to Monte Carlo methods with R. Incidentally, a reader recently pointed out to me the availability of a pirated version of The Bayesian Choice on a Saudi (religious) university website. And of a pirated version of Introducing Monte Carlo with R on a Saõ Paulo (Brazil) university website. This may be alas inevitable, given the diffusion by publishers of e-chapters that can be copied with no limitations…

arXiv frenzy

Posted in R, Statistics, University life with tags , , , , , , on June 23, 2015 by xi'an

In the few past days, there has been so many arXiv postings of interest—presumably the NIPS submission effect!—that I cannot hope to cover them in the coming weeks! Hopefully, some will still come out on the ‘Og in a near future:

  • arXiv:1506.06629: Scalable Approximations of Marginal Posteriors in Variable Selection by Willem van den Boom, Galen Reeves, David B. Dunson
  • arXiv:1506.06285: The MCMC split sampler: A block Gibbs sampling scheme for latent Gaussian models by Óli Páll Geirsson, Birgir Hrafnkelsson, Daniel Simpson, Helgi Sigurðarson [also deserves a special mention for gathering only ***son authors!]
  • arXiv:1506.06268: Bayesian Nonparametric Modeling of Higher Order Markov Chains by Abhra Sarkar, David B. Dunson
  • arXiv:1506.06117: Convergence of Sequential Quasi-Monte Carlo Smoothing Algorithms by Mathieu Gerber, Nicolas Chopin
  • arXiv:1506.06101: Robust Bayesian inference via coarsening by Jeffrey W. Miller, David B. Dunson
  • arXiv:1506.05934: Expectation Particle Belief Propagation by Thibaut Lienart, Yee Whye Teh, Arnaud Doucet
  • arXiv:1506.05860: Variational Gaussian Copula Inference by Shaobo Han, Xuejun Liao, David B. Dunson, Lawrence Carin
  • arXiv:1506.05855: The Frequentist Information Criterion (FIC): The unification of information-based and frequentist inference by Colin H. LaMont, Paul A. Wiggins
  • arXiv:1506.05757: Bayesian Inference for the Multivariate Extended-Skew Normal Distribution by Mathieu Gerber, Florian Pelgrin
  • arXiv:1506.05741: Accelerated dimension-independent adaptive Metropolis by Yuxin Chen, David Keyes, Kody J.H. Law, Hatem Ltaief
  • arXiv:1506.05269: Bayesian Survival Model based on Moment Characterization by Julyan Arbel, Antonio Lijoi, Bernardo Nipoti
  • arXiv:1506.04778: Fast sampling with Gaussian scale-mixture priors in high-dimensional regression by Anirban Bhattacharya, Antik Chakraborty, Bani K. Mallick
  • arXiv:1506.04416: Bayesian Dark Knowledge by Anoop Korattikara, Vivek Rathod, Kevin Murphy, Max Welling [a special mention for this title!]
  • arXiv:1506.03693: Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference by Edward Meeds, Max Welling
  • arXiv:1506.03074: Variational consensus Monte Carlo by Maxim Rabinovich, Elaine Angelino, Michael I. Jordan
  • arXiv:1506.02564: Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families by Heiko Strathmann, Dino Sejdinovic, Samuel Livingstone, Zoltan Szabo, Arthur Gretton [comments coming soon!]
Follow

Get every new post delivered to your Inbox.

Join 891 other followers