## Leave the Pima Indians alone!

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , on July 15, 2015 by xi'an

“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the well-known Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toy-like to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)

Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.

As a first entry, Nicolas and James point out that the MAP can be derived by standard a Newton-Raphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for Newton-Raphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a Laplace-EM for a Cauchy prior. Unsurprisingly, the expectation-propagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasi-Monte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the IS-QMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.

In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, Metropolis-Hastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.

“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)

This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be  cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.

## R brut

Posted in Kids, pictures, R, Statistics, University life with tags , , , on July 2, 2015 by xi'an

## dynamic mixtures [at NBBC15]

Posted in R, Statistics with tags , , , , , , , , , , , , on June 18, 2015 by xi'an

A funny coincidence: as I was sitting next to Arnoldo Frigessi at the NBBC15 conference, I came upon a new question on Cross Validated about a dynamic mixture model he had developed in 2002 with Olga Haug and Håvård Rue [whom I also saw last week in Valencià]. The dynamic mixture model they proposed replaces the standard weights in the mixture with cumulative distribution functions, hence the term dynamic. Here is the version used in their paper (x>0)

$(1-w_{\mu,\tau}(x))f_{\beta,\lambda}(x)+w_{\mu,\tau}(x)g_{\epsilon,\sigma}(x)$

where f is a Weibull density, g a generalised Pareto density, and w is the cdf of a Cauchy distribution [all distributions being endowed with standard parameters]. While the above object is not a mixture of a generalised Pareto and of a Weibull distributions (instead, it is a mixture of two non-standard distributions with unknown weights), it is close to the Weibull when x is near zero and ends up with the Pareto tail (when x is large). The question was about simulating from this distribution and, while an answer was in the paper, I replied on Cross Validated with an alternative accept-reject proposal and with a somewhat (if mildly) non-standard MCMC implementation enjoying a much higher acceptance rate and the same fit.

## Le Monde puzzle [#913]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on June 12, 2015 by xi'an

An arithmetics Le Monde mathematical puzzle:

Find all bi-twin integers, namely positive integers such that adding 2 to any of their dividers returns a prime number.

An easy puzzle, once the R libraries on prime number decomposition can be found!, since it is straightforward to check for solutions. Unfortunately, I could not install the recent numbers package. So I used instead the schoolmath R package. Despite its possible bugs. But it seems to do the job for this problem:

lem=NULL
for (t in 1:1e4)
if (prod(is.prim(prime.factor(t)+2)==1))
lem=c(lem,t)digin=function(n){


which returned all solutions, albeit in a lengthy fashion:

> lem
[1] 1 3 5 9 11 15 17 25 27 29 33 41 45 51 55
[16] 59 71 75 81 85 87 99 101 107 121 123 125 135 137 145
[31] 149 153 165 177 179 187 191 197 205 213 225 227 239 243 255
[46] 261 269 275 281 289 295 297 303 311 319 321 347 355 363 369
[61] 375 405 411 419 425 431 435 447 451 459 461 493 495 505 521
[76] 531 535 537 561 569 573 591 599 605 615 617 625 639 641 649
[91] 659 675 681 685 697 717 725 729 745 765 781 783 807 809 821
[106] 825 827 841 843 857 867 881 885 891 895 909 933 935 955 957
[121] 963 985 1003 1019 1025 1031 ...


## quantile functions: mileage may vary

Posted in Books, R, Statistics with tags , , , , , , on May 12, 2015 by xi'an

When experimenting with various quantiles functions in R, I was shocked [ok this is a bit excessive, let us say surprised] by how widely the execution times would vary. To the point of blaming a completely different feature of R. Borrowing from Charlie Geyer’s webpage on the topic of probability distributions in R, here is a table for some standard distributions: I ran

u=runif(1e7)
system.time(x<-qcauchy(u))


choosing an arbitrary parameter whenever needed.

Distribution Function Time
Cauchy qcauchy 2.2
Chi-Square qchisq 43.8
Exponential qexp 0.95
F qf 34.2
Gamma qgamma 37.2
Logistic qlogis 1.7
Log Normal qlnorm 2.2
Normal qnorm 1.4
Student t qt 31.7
Uniform qunif 0.86
Weibull qweibull 2.9

Of course, it does not mean much in that all the slow distributions (except for Weibull) are parameterised. Nonetheless, that a chi-square inversion take 50 times longer than a uniform inversion remains puzzling as to why it is not coded more efficiently. In particular, I was wondering why the chi-square inversion was slower than the Gamma inversion. Rerunning both inversions showed that they are equivalent:

> u=runif(1e7)
> system.time(x<-qgamma(u,sha=1.5))
utilisateur système écoulé
21.534 0.016 21.532
> system.time(x<-qchisq(u,df=3))
utilisateur système écoulé
21.372 0.008 21.361


Which also shows how variable system.time can be.

## arbitrary distributions with set correlation

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , , , on May 11, 2015 by xi'an

A question recently posted on X Validated by Antoni Parrelada: given two arbitrary cdfs F and G, how can we simulate a pair (X,Y) with marginals  F and G, and with set correlation ρ? The answer posted by Antoni Parrelada was to reproduce the Gaussian copula solution: produce (X’,Y’) as a Gaussian bivariate vector with correlation ρ and then turn it into (X,Y)=(F⁻¹(Φ(X’)),G⁻¹(Φ(Y’))). Unfortunately, this does not work, because the correlation does not keep under the double transform. The graph above is part of my answer for a χ² and a log-Normal cdf for F amd G: while corr(X’,Y’)=ρ, corr(X,Y) drifts quite a  lot from the diagonal! Actually, by playing long enough with my function

tacor=function(rho=0,nsim=1e4,fx=qnorm,fy=qnorm)
{
x1=rnorm(nsim);x2=rnorm(nsim)
coeur=rho
rho2=sqrt(1-rho^2)
for (t in 1:length(rho)){
y=pnorm(cbind(x1,rho[t]*x1+rho2[t]*x2))
coeur[t]=cor(fx(y[,1]),fy(y[,2]))}
return(coeur)
}


Playing further, I managed to get an almost flat correlation graph for the admittedly convoluted call

tacor(seq(-1,1,.01),
fx=function(x) qchisq(x^59,df=.01),
fy=function(x) qlogis(x^59))


Now, the most interesting question is how to produce correlated simulations. A pedestrian way is to start with a copula, e.g. the above Gaussian copula, and to twist the correlation coefficient ρ of the copula until the desired correlation is attained for the transformed pair. That is, to draw the above curve and invert it. (Note that, as clearly exhibited by the graph just above, all desired correlations cannot be achieved for arbitrary cdfs F and G.) This is however very pedestrian and I wonder whether or not there is a generic and somewhat automated solution…

## corrected MCMC samplers for multivariate probit models

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on May 6, 2015 by xi'an

“Moreover, IvD point out an error in Nobile’s derivation which can alter its stationary distribution. Ironically, as we shall see, the algorithms of IvD also contain an error.”

Xiyun Jiao and David A. van Dyk arXived a paper correcting an MCMC sampler and R package MNP for the multivariate probit model, proposed by Imai and van Dyk in 2005. [Hence the abbreviation IvD in the above quote.] Earlier versions of the Gibbs sampler for the multivariate probit model by Rob McCulloch and Peter Rossi in 1994, with a Metropolis update added by Agostino Nobile, and finally an improved version developed by Imai and van Dyk in 2005. As noted in the above quote, Jiao and van Dyk have discovered two mistakes in this latest version, jeopardizing the validity of the output.

The multivariate probit model considered here is a multinomial model where the occurrence of the k-th category is represented as the k-th component of a (multivariate) normal (correlated) vector being the largest of all components. The latent normal model being non-identifiable since invariant by either translation or scale, identifying constraints are used in the literature. This means using a covariance matrix of the form Σ/trace(Σ), where Σ is an inverse Wishart random matrix. In their 2005 implementation, relying on marginal data augmentation—which essentially means simulating the non-identifiable part repeatedly at various steps of the data augmentation algorithm—, Imai and van Dyk missed a translation term and a constraint on the simulated matrices that lead to simulations outside the rightful support, as illustrated from the above graph [snapshot from the arXived paper].

Since the IvD method is used in many subsequent papers, it is quite important that these mistakes are signalled and corrected. [Another snapshot above shows how much both algorithm differ!] Without much thinking about this, I [thus idly] wonder why an identifying prior is not taking the place of a hard identifying constraint, as it should solve the issue more nicely. In that it would create less constraints and more entropy (!) in exploring the augmented space, while theoretically providing a convergent approximation of the identifiable parts. I may (must!) however miss an obvious constraint preventing this implementation.