Archive for the R Category

Le Monde puzzle [#929]

Posted in Books, Kids, R with tags , on September 29, 2015 by xi'an

A combinatorics Le Monde mathematical puzzle:

In the set {1,…,12}, numbers adjacent to i are called friends of i. How many distinct subsets of size 5 can be chosen under the constraint that each number in the subset has at least a friend with him?

In a brute force approach, I tried a quintuple loop to check all possible cases:

for (a in 1:(12-4))
for (b in (a+1):(12-3))
for (c in (b+1):(12-2))
for (d in (c+1):(12-1))
for (e in (d+1):12)

which returns 64 possible cases. Note that the second and last loop are useless since b=a+1 and e=d+1, necessarily. And c is either (b+1) or (d-1), which means 2 choices for c, except when e=a+4. This all adds up to

8 + 2\sum_{a=1}^7\sum_{e=a+5}^{12} = 8+2.7.8-2.7.8/2=8.8=64

A related R question: is there a generic way of programming a sequence of embedded loops like the one above without listing all of the loops one by one?

Le Monde puzzle [#928]

Posted in Books, Kids, R with tags , , , on September 10, 2015 by xi'an

A combinatorics Le Monde mathematical puzzle:

How many distinct integers between 0 and 16 can one pick so that all positive differences are distinct?

If k is the number of distinct integers, the number of positive differences is

1+2+…+(k-1) = k(k-1)/2,

which cannot exceed 16, because it is a subset of {1,2,…,16}, meaning k cannot exceed 6 if all differences are distinct. From there, picking k integers at random makes it easy to check for the condition:

while (max(duplicated(y[!upper.tri(y)]))==1){

which quickly returns for k=5

> x
[1] 0 1 7 12 15

as a solution. And is still running for k=6, meaning there is apparently no solution for k=6. (An exhaustive search shows there is indeed no solution for k=6 and N=16, while there are several for k=6 and N=17.) Now, reading the puzzle solution of Le Monde today, on September 09, I discovered that the authors proposed a sequence of length 7, (0,1,2,4,5,7,11,16), which does not work since 1-0=2-1… and proved that 8 is an impossible value by quite a convoluted argument. Did I misread again?!

In the earlier version of the R code posted today, I used


which does not include the diagonal, instead of the proper


a mistake that led to a wrong solution for k=6, as pointed out by Stephan.

debunking a (minor and personal) myth

Posted in Books, Kids, R, Statistics, University life with tags , , , , on September 9, 2015 by xi'an

diriXFor quite a while, I entertained the idea that Beta and Dirichlet proposals  were more adequate than (log-)normal random walks proposals for parameters on (0,1) and simplicia (simplices, simplexes), respectively, when running an MCMC. For instance, for p in (0,1) the value of the Markov chain at time t-1, the proposal at time t could be a Be(εp,ε{1-p}) generator, since its mean is equal to p and its variance is proportional to 1/(1+ε). (Although I cannot find track of this notion in my books.) The parameter ε can be calibrated towards a given acceptance rate, like the golden number 0.234 of Gelman, Gilks and Roberts (1996). However, when using this proposal on a mixture model, Kaniav Kamari and myself realised today that there is a catch, namely that pushing ε down to achieve an acceptance rate near 0.234 may end up in disaster, since the parameters of the Beta or of the Dirichlet may become lower than 1, which implies an infinite explosion on some boundaries of the parameter space. An explosion that gets more and more serious as ε decreases to zero. Hence is more and more likely to decrease the acceptance rate, thus to reduce ε, which in turns concentrates even more the support on the boundary and leads to a vicious circle and no convergence to the target acceptance rate… Continue reading

ABC model choice via random forests [and no fire]

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , on September 4, 2015 by xi'an

While my arXiv newspage today had a puzzling entry about modelling UFOs sightings in France, it also broadcast our revision of Reliable ABC model choice via random forests, version that we resubmitted today to Bioinformatics after a quite thorough upgrade, the most dramatic one being the realisation we could also approximate the posterior probability of the selected model via another random forest. (With no connection with the recent post on forest fires!) As discussed a little while ago on the ‘Og. And also in conjunction with our creating the abcrf R package for running ABC model choice out of a reference table. While it has been an excruciatingly slow process (the initial version of the arXived document dates from June 2014, the PNAS submission was rejected for not being enough Bayesian, and the latest revision took the whole summer), the slow maturation of our thoughts on the model choice issues led us to modify the role of random forests in the ABC approach to model choice, in that we reverted our earlier assessment that they could only be trusted for selecting the most likely model, by realising this summer the corresponding posterior could be expressed as a posterior loss and estimated by a secondary forest. As first considered in Stoehr et al. (2014). (In retrospect, this brings an answer to one of the earlier referee’s comments.) Next goal is to incorporate those changes in DIYABC (and wait for the next version of the software to appear). Another best-selling innovation due to Arnaud: we added a practical implementation section in the format of FAQ for issues related with the calibration of the algorithms.

reaching transcendence for Gaussian mixtures

Posted in Books, R, Statistics with tags , , , , on September 3, 2015 by xi'an

Nested sampling sample on top of a mixture log-likelihood“…likelihood inference is in a fundamental way more complicated than the classical method of moments.”

Carlos Amendola, Mathias Drton, and Bernd Sturmfels arXived a paper this Friday on “maximum likelihood estimates for Gaussian mixtures are transcendental”. By which they mean that trying to solve the five likelihood equations for a two-component Gaussian mixture does not lead to an algebraic function of the data. (When excluding the trivial global maxima spiking at any observation.) This is not highly surprising when considering two observations, 0 and x, from a mixture of N(0,1/2) and N(μ,1/2) because the likelihood equation


involves both exponential and algebraic terms. While this is not directly impacting (statistical) inference, this result has the computational consequence that the number of critical points ‘and also the maximum number of local maxima, depends on the sample size and increases beyond any bound’, which means that EM faces increasing difficulties in finding a global finite maximum as the sample size increases…

likelihood-free inference in high-dimensional models

Posted in Books, R, Statistics, University life with tags , , , , , , , , , on September 1, 2015 by xi'an

“…for a general linear model (GLM), a single linear function is a sufficient statistic for each associated parameter…”

Water Tower, Michigan Avenue, Chicago, Oct. 31, 2012The recently arXived paper “Likelihood-free inference in high-dimensional models“, by Kousathanas et al. (July 2015), proposes an ABC resolution of the dimensionality curse [when the dimension of the parameter and of the corresponding summary statistics] by turning Gibbs-like and by using a component-by-component ABC-MCMC update that allows for low dimensional statistics. In the (rare) event there exists a conditional sufficient statistic for each component of the parameter vector, the approach is just as justified as when using a generic ABC-Gibbs method based on the whole data. Otherwise, that is, when using a non-sufficient estimator of the corresponding component (as, e.g., in a generalised [not general!] linear model), the approach is less coherent as there is no joint target associated with the Gibbs moves. One may therefore wonder at the convergence properties of the resulting algorithm. The only safe case [in dimension 2] is when one of the restricted conditionals does not depend on the other parameter. Note also that each Gibbs step a priori requires the simulation of a new pseudo-dataset, which may be a major imposition on computing time. And that setting the tolerance for each parameter is a delicate calibration issue because in principle the tolerance should depend on the other component values. Continue reading

abcfr 0.9-3

Posted in R, Statistics, University life with tags , , , , , , , , on August 27, 2015 by xi'an

garden tree, Jan. 12, 2012In conjunction with our reliable ABC model choice via random forest paper, about to be resubmitted to Bioinformatics, we have contributed an R package called abcrf that produces a most likely model and its posterior probability out of an ABC reference table. In conjunction with the realisation that we could devise an approximation to the (ABC) posterior probability using a secondary random forest. “We” meaning Jean-Michel Marin and Pierre Pudlo, as I only acted as a beta tester!

abcrfThe package abcrf consists of three functions:

  • abcrf, which constructs a random forest from a reference table and returns an object of class `abc-rf’;
  • plot.abcrf, which gives both variable importance plot of a model choice abc-rf object and the projection of the reference table on the LDA axes;
  • predict.abcrf, which predict the model for new data and evaluate the posterior probability of the MAP.

An illustration from the manual:

mc.rf <- abcrf(snp[1:1e3, 1], snp[1:1e3, -1])
predict(mc.rf, snp[1:1e3, -1], snp.obs)

Get every new post delivered to your Inbox.

Join 918 other followers