## dynamic mixtures [at NBBC15]

Posted in R, Statistics with tags , , , , , , , , , , , , on June 18, 2015 by xi'an

A funny coincidence: as I was sitting next to Arnoldo Frigessi at the NBBC15 conference, I came upon a new question on Cross Validated about a dynamic mixture model he had developed in 2002 with Olga Haug and Håvård Rue [whom I also saw last week in Valencià]. The dynamic mixture model they proposed replaces the standard weights in the mixture with cumulative distribution functions, hence the term dynamic. Here is the version used in their paper (x>0)

$(1-w_{\mu,\tau}(x))f_{\beta,\lambda}(x)+w_{\mu,\tau}(x)g_{\epsilon,\sigma}(x)$

where f is a Weibull density, g a generalised Pareto density, and w is the cdf of a Cauchy distribution [all distributions being endowed with standard parameters]. While the above object is not a mixture of a generalised Pareto and of a Weibull distributions (instead, it is a mixture of two non-standard distributions with unknown weights), it is close to the Weibull when x is near zero and ends up with the Pareto tail (when x is large). The question was about simulating from this distribution and, while an answer was in the paper, I replied on Cross Validated with an alternative accept-reject proposal and with a somewhat (if mildly) non-standard MCMC implementation enjoying a much higher acceptance rate and the same fit.

## Le Monde puzzle [#913]

Posted in Books, Kids, Statistics, University life with tags , , , , , , , on June 12, 2015 by xi'an

An arithmetics Le Monde mathematical puzzle:

Find all bi-twin integers, namely positive integers such that adding 2 to any of their dividers returns a prime number.

An easy puzzle, once the R libraries on prime number decomposition can be found!, since it is straightforward to check for solutions. Unfortunately, I could not install the recent numbers package. So I used instead the schoolmath R package. Despite its possible bugs. But it seems to do the job for this problem:

lem=NULL
for (t in 1:1e4)
if (prod(is.prim(prime.factor(t)+2)==1))
lem=c(lem,t)digin=function(n){


which returned all solutions, albeit in a lengthy fashion:

> lem
[1] 1 3 5 9 11 15 17 25 27 29 33 41 45 51 55
[16] 59 71 75 81 85 87 99 101 107 121 123 125 135 137 145
[31] 149 153 165 177 179 187 191 197 205 213 225 227 239 243 255
[46] 261 269 275 281 289 295 297 303 311 319 321 347 355 363 369
[61] 375 405 411 419 425 431 435 447 451 459 461 493 495 505 521
[76] 531 535 537 561 569 573 591 599 605 615 617 625 639 641 649
[91] 659 675 681 685 697 717 725 729 745 765 781 783 807 809 821
[106] 825 827 841 843 857 867 881 885 891 895 909 933 935 955 957
[121] 963 985 1003 1019 1025 1031 ...


## quantile functions: mileage may vary

Posted in Books, R, Statistics with tags , , , , , , on May 12, 2015 by xi'an

When experimenting with various quantiles functions in R, I was shocked [ok this is a bit excessive, let us say surprised] by how widely the execution times would vary. To the point of blaming a completely different feature of R. Borrowing from Charlie Geyer’s webpage on the topic of probability distributions in R, here is a table for some standard distributions: I ran

u=runif(1e7)
system.time(x<-qcauchy(u))


choosing an arbitrary parameter whenever needed.

Distribution Function Time
Cauchy qcauchy 2.2
Chi-Square qchisq 43.8
Exponential qexp 0.95
F qf 34.2
Gamma qgamma 37.2
Logistic qlogis 1.7
Log Normal qlnorm 2.2
Normal qnorm 1.4
Student t qt 31.7
Uniform qunif 0.86
Weibull qweibull 2.9

Of course, it does not mean much in that all the slow distributions (except for Weibull) are parameterised. Nonetheless, that a chi-square inversion take 50 times longer than a uniform inversion remains puzzling as to why it is not coded more efficiently. In particular, I was wondering why the chi-square inversion was slower than the Gamma inversion. Rerunning both inversions showed that they are equivalent:

> u=runif(1e7)
> system.time(x<-qgamma(u,sha=1.5))
utilisateur système écoulé
21.534 0.016 21.532
> system.time(x<-qchisq(u,df=3))
utilisateur système écoulé
21.372 0.008 21.361


Which also shows how variable system.time can be.

## arbitrary distributions with set correlation

Posted in Books, Kids, pictures, R, Statistics, University life with tags , , , , , , , , , , on May 11, 2015 by xi'an

A question recently posted on X Validated by Antoni Parrelada: given two arbitrary cdfs F and G, how can we simulate a pair (X,Y) with marginals  F and G, and with set correlation ρ? The answer posted by Antoni Parrelada was to reproduce the Gaussian copula solution: produce (X’,Y’) as a Gaussian bivariate vector with correlation ρ and then turn it into (X,Y)=(F⁻¹(Φ(X’)),G⁻¹(Φ(Y’))). Unfortunately, this does not work, because the correlation does not keep under the double transform. The graph above is part of my answer for a χ² and a log-Normal cdf for F amd G: while corr(X’,Y’)=ρ, corr(X,Y) drifts quite a  lot from the diagonal! Actually, by playing long enough with my function

tacor=function(rho=0,nsim=1e4,fx=qnorm,fy=qnorm)
{
x1=rnorm(nsim);x2=rnorm(nsim)
coeur=rho
rho2=sqrt(1-rho^2)
for (t in 1:length(rho)){
y=pnorm(cbind(x1,rho[t]*x1+rho2[t]*x2))
coeur[t]=cor(fx(y[,1]),fy(y[,2]))}
return(coeur)
}


Playing further, I managed to get an almost flat correlation graph for the admittedly convoluted call

tacor(seq(-1,1,.01),
fx=function(x) qchisq(x^59,df=.01),
fy=function(x) qlogis(x^59))


Now, the most interesting question is how to produce correlated simulations. A pedestrian way is to start with a copula, e.g. the above Gaussian copula, and to twist the correlation coefficient ρ of the copula until the desired correlation is attained for the transformed pair. That is, to draw the above curve and invert it. (Note that, as clearly exhibited by the graph just above, all desired correlations cannot be achieved for arbitrary cdfs F and G.) This is however very pedestrian and I wonder whether or not there is a generic and somewhat automated solution…

## corrected MCMC samplers for multivariate probit models

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , on May 6, 2015 by xi'an

“Moreover, IvD point out an error in Nobile’s derivation which can alter its stationary distribution. Ironically, as we shall see, the algorithms of IvD also contain an error.”

Xiyun Jiao and David A. van Dyk arXived a paper correcting an MCMC sampler and R package MNP for the multivariate probit model, proposed by Imai and van Dyk in 2005. [Hence the abbreviation IvD in the above quote.] Earlier versions of the Gibbs sampler for the multivariate probit model by Rob McCulloch and Peter Rossi in 1994, with a Metropolis update added by Agostino Nobile, and finally an improved version developed by Imai and van Dyk in 2005. As noted in the above quote, Jiao and van Dyk have discovered two mistakes in this latest version, jeopardizing the validity of the output.

The multivariate probit model considered here is a multinomial model where the occurrence of the k-th category is represented as the k-th component of a (multivariate) normal (correlated) vector being the largest of all components. The latent normal model being non-identifiable since invariant by either translation or scale, identifying constraints are used in the literature. This means using a covariance matrix of the form Σ/trace(Σ), where Σ is an inverse Wishart random matrix. In their 2005 implementation, relying on marginal data augmentation—which essentially means simulating the non-identifiable part repeatedly at various steps of the data augmentation algorithm—, Imai and van Dyk missed a translation term and a constraint on the simulated matrices that lead to simulations outside the rightful support, as illustrated from the above graph [snapshot from the arXived paper].

Since the IvD method is used in many subsequent papers, it is quite important that these mistakes are signalled and corrected. [Another snapshot above shows how much both algorithm differ!] Without much thinking about this, I [thus idly] wonder why an identifying prior is not taking the place of a hard identifying constraint, as it should solve the issue more nicely. In that it would create less constraints and more entropy (!) in exploring the augmented space, while theoretically providing a convergent approximation of the identifiable parts. I may (must!) however miss an obvious constraint preventing this implementation.

## take those hats off [from R]!

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , , on May 5, 2015 by xi'an

This is presumably obvious to most if not all R programmers, but I became aware today of a hugely (?) delaying tactic in my R codes. I was working with Jean-Michel and Natesh [who are visiting at the moment] and when coding an MCMC run I was telling them that I usually preferred to code Nsim=10000 as Nsim=10^3 for readability reasons. Suddenly, I became worried that this representation involved a computation, as opposed to Nsim=1e3 and ran a little experiment:

> system.time(for (t in 1:10^8) x=10^3)
utilisateur     système      écoulé
30.704       0.032      30.717
> system.time(for (t in 1:1e8) x=10^3)
utilisateur     système      écoulé
30.338       0.040      30.359
> system.time(for (t in 1:10^8) x=1000)
utilisateur     système      écoulé
6.548       0.084       6.631
> system.time(for (t in 1:1e8) x=1000)
utilisateur     système      écoulé
6.088       0.032       6.115
> system.time(for (t in 1:10^8) x=1e3)
utilisateur     système      écoulé
6.134       0.029       6.157
> system.time(for (t in 1:1e8) x=1e3)
utilisateur     système      écoulé
6.627       0.032       6.654
> system.time(for (t in 1:10^8) x=exp(3*log(10)))
utilisateur     système      écoulé
60.571        0.000     57.103


So using the usual scientific notation with powers is taking its toll! While the calculator notation with e is cost free… Weird!

I understand that the R notation 10^6 is an abbreviation for a power function that can be equally applied to pi^pi, say, but still feel aggrieved that a nice scientific notation like 10⁶ ends up as a computing trap! I thus asked the question to the Stack Overflow forum, getting the (predictable) answer that the R code 10^6 meant calling the R power function, while 1e6 was a constant. Since 10⁶ does not differ from ππ, there is no reason 10⁶ should be recognised by R as a million. Except that it makes my coding more coherent.

> system.time( for (t in 1:10^8) x=pi^pi)
utilisateur     système      écoulé
44.518       0.000      43.179
> system.time( for (t in 1:10^8) x=10^6)
utilisateur     système      écoulé
38.336       0.000      37.860


Another thing I discovered from this answer to my question is that negative integers are also requesting call to a function:

> system.time( for (t in 1:10^8) x=1)
utilisateur     système      écoulé
10.561       0.801      11.062
> system.time( for (t in 1:10^8) x=-1)
utilisateur     système      écoulé
22.711       0.860      23.098


This sounds even weirder.

## the most patronizing start to an answer I have ever received

Posted in Books, Kids, R, Statistics, University life with tags , , , , , , on April 30, 2015 by xi'an

Another occurrence [out of many!] of a question on X validated where the originator (primitivus petitor) was trying to get an explanation without the proper background. On either Bayesian statistics or simulation. The introductory sentence to the question was about “trying to understand how the choice of priors affects a Bayesian model estimated using MCMC” but the bulk of the question was in fact failing to understand an R code for a random-walk Metropolis-Hastings algorithm for a simple regression model provided in a introductory blog by Florian Hartig. And even more precisely about confusing the R code dnorm(b, sd = 5, log = T) in the prior with rnorm(1,mean=b, sd = 5, log = T) in the proposal…

“You should definitely invest some time in learning the bases of Bayesian statistics and MCMC methods from textbooks or on-line courses.” X

So I started my answer with the above warning. Which sums up my feelings about many of those X validated questions, namely that primitivi petitores lack the most basic background to consider such questions. Obviously, I should not have bothered with an answer, but it was late at night after a long day, a good meal at the pub in Kenilworth, and a broken toe still bothering me. So I got this reply from the primitivus petitor that it was a patronizing piece of advice and he prefers to learn from R code than from textbooks and on-line courses, having “looked through a number of textbooks”. Good luck with this endeavour then!