## limited shelf validity

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on December 11, 2019 by xi'an

A great article from Steve Stigler in the new, multi-scaled, and so exciting Harvard Data Science Review magisterially operated by Xiao-Li Meng, on the limitations of old datasets. Illustrated by three famous datasets used by three equally famous statisticians, Quetelet, Bortkiewicz, and Gosset. None of whom were fundamentally interested in the data for their own sake. First, Quetelet’s data was (wrongly) reconstructed and missed the opportunity to beat Galton at discovering correlation. Second, Bortkiewicz went looking (or even cherry-picking!) for these rare events in yearly tables of mortality minutely divided between causes such as military horse kicks. The third dataset is not Guinness‘, but a test between two sleeping pills, operated rather crudely over inmates from a psychiatric institution in Kalamazoo, with further mishandling by Gosset himself. Manipulations that turn the data into dead data, as Steve put it. (And illustrates with the above skull collection picture. As well as warning against attempts at resuscitating dead data into what could be called “zombie data”.)

“Successful resurrection is only slightly more common than in Christian theology.”

His global perspective on dead data is that they should stop being used before extending their (shelf) life, rather than turning into benchmarks recycled over and over as a proof of concept. If only (my two cents) because it leads to calibrate (and choose) methods doing well over these benchmarks. Another example that could have been added to the skulls above is the Galaxy Velocity Dataset that makes frequent appearances in works estimating Gaussian mixtures. Which Radford Neal signaled at the 2001 ICMS workshop on mixture estimation as an inappropriate use of the dataset since astrophysical arguments weighted against a mixture modelling.

“…the role of context in shaping data selection and form—context in temporal, political, and social as well as scientific terms—has been shown to be a powerful and interesting phenomenon.”

The potential for “dead-er” data (my neologism!) increases with the epoch in that the careful sleuth work Steve (and others) conducted about these historical datasets is absolutely impossible with the current massive data sets. Massive and proprietary. And presumably discarded once the associated neural net is designed and sold. Letting the burden of unmasking the potential (or highly probable?) biases to others. Most interestingly, this recoups a “comment” in Nature of 17 October by Sabina Leonelli on the transformation of data from a national treasure to a commodity which “ownership can confer and signal power”. But her call for openness and governance of research data seems as illusory as other attempts to sever the GAFAs from their extra-territorial privileges…

## best unbiased estimator of θ² for a Poisson model

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on May 23, 2018 by xi'an

A mostly traditional question on X validated about the “best” [minimum variance] unbiased estimator of θ² from a Poisson P(θ) sample leads to the Rao-Blackwell solution

$\mathbb{E}[X_1X_2|\underbrace{\sum_{i=1}^n X_i}_S=s] = -\frac{s}{n^2}+\frac{s^2}{n^2}=\frac{s(s-1)}{n^2}$

and a similar estimator could be constructed for θ³, θ⁴, … With the interesting limitation that this procedure stops at the power equal to the number of observations (minus one?). But,  since the expectation of a power of the sufficient statistics S [with distribution P(nθ)] is a polynomial in θ, there is de facto no limitation. More interestingly, there is no unbiased estimator of negative powers of θ in this context, while this neat comparison on Wikipedia (borrowed from the great book of counter-examples by Romano and Siegel, 1986, selling for a mere \$180 on amazon!) shows why looking for an unbiased estimator of exp(-2θ) is particularly foolish: the only solution is (-1) to the power S [for a single observation]. (There is however a first way to circumvent the difficulty if having access to an arbitrary number of generations from the Poisson, since the Forsythe – von Neuman algorithm allows for an unbiased estimation of exp(-F(x)). And, as a second way, as remarked by Juho Kokkala below, a sample of at least two Poisson observations leads to a more coherent best unbiased estimator.)

## an accurate variance approximation

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , on February 7, 2017 by xi'an

In answering a simple question on X validated about producing Monte Carlo estimates of the variance of estimators of exp(-θ) in a Poisson model, I wanted to illustrate the accuracy of these estimates against the theoretical values. While one case was easy, since the estimator was a Binomial B(n,exp(-θ)) variate [in yellow on the graph], the other one being the exponential of the negative of the Poisson sample average did not enjoy a closed-form variance and I instead used a first order (δ-method) approximation for this variance which ended up working surprisingly well [in brown] given that the experiment is based on an n=20 sample size.

Thanks to the comments of George Henry, I stand corrected: the variance of the exponential version is easily manageable with two lines of summation! As

$\text{var}(\exp\{-\bar{X}_n\})=\exp\left\{-n\theta[1-\exp\{-2/n\}]\right\}$

$-\exp\left\{-2n\theta[1-\exp\{-1/n\}]\right\}$

which allows for a comparison with its second order Taylor approximation:

## snapshots from Nature

Posted in Books, Kids, pictures, University life with tags , , , , , , , , , , on September 19, 2016 by xi'an

Among many interesting things I read from the pile of Nature issues that had accumulated over a month of travelling, with a warning these are mostly “old” news by now!:

• the very special and untouched case of Cuba in terms of the Zika epidemics, thanks to a long term policy fighting mosquitoes at all levels of the society;
• an impressive map of the human cortex, which statistical analysis would be fascinating;
• an excerpt from Nature 13 August 1966 where the Poisson distribution was said to describe the distribution of scores during the 1966 World Cup;
• an analysis of a genetic experiment on evolution involving 50,000 generations (!) of Escherichia coli;
• a look back at the great novel Flowers for Algernon, novel I read eons ago;
• a Nature paper on the first soft robot, or octobot, along with some easier introduction, which did not tell which kind of operations could be accomplished by such a robot;
• a vignette on a Science paper about the interaction between honey hunters and hunting birds, which I also heard depicted on the French National Radio, with an experiment comparing the actual hunting (human) song, a basic sentence in the local language, and the imitation of the song of another bird. I could not understand why the experiment did not include hunting songs from other hunting groups, as they are highly different but just as effective. It would have helped in understanding how innate the reaction of the bird is;
• another literary entry at the science behind Mary Shelley’s Frankenstein;
• a study of the Mathematical Genealogy Project in terms of the few mathematicians who started most genealogies of mathematicians, including d’Alembert, advisor to Laplace of whom I am one of the many descendants, although the finding is not that astounding when considering usual genealogies where most branches die off and the highly hierarchical structure of power in universities of old.

## fine-sliced Poisson [a.k.a. sashimi]

Posted in Books, Kids, pictures, R, Running, Statistics, University life with tags , , , , , , , , , on March 20, 2014 by xi'an

As my student Kévin Guimard had not mailed me his own Poisson slice sampler of a Poisson distribution, I could not tell why the code was not working! My earlier post prompted him to do so and a somewhat optimised version is given below:

nsim = 10^4
lambda = 6

max.factorial = function(x,u){
k = x
parf=1
while (parf*u<1){
k = k + 1
parf = parf * k
}
k = k - (parf*u>1)
return (k)
}

x = rep(floor(lambda), nsim)
for (t in 2:nsim){
v1 = ceiling((log(runif(1))/log(lambda))+x[t-1])
ranj=max(0,v1):max.factorial(x[t-1],runif(1))
x[t]=sample(ranj,size=1)
}
barplot(as.vector(rbind(
table(x)/length(x),dpois(min(x):max(x),
lambda))),col=c("sienna","gold"))


As you can easily check by running the code, it does not work. My student actually majored my MCMC class and he spent quite a while pondering why the code was not working. I did ponder as well for a part of a morning in Warwick, removing causes for exponential or factorial overflows (hence the shape of the code), but not eliciting the issue… (This now sounds like lethal fugu sashimi! ) Before reading any further, can you spot the problem?!

The corrected R code is as follows:

x = rep(lambda, nsim)
for (t in 2:nsim){
v1=ceiling((log(runif(1))/log(lambda))+x[t-1])
ranj=max(0,v1):max.factorial(x[t-1],runif(1))
if (length(ranj)>1){
x[t] = sample(ranj, size = 1)
}else{
x[t]=ranj}
}


The culprit is thus the R function sample which simply does not understand Dirac masses and the basics of probability! When running

> sample(150:150,1)
[1] 23


you can clearly see where the problem stands…! Well-documented issue with sample that already caused me woes… Another interesting thing about this slice sampler is that it is awfully slow in exploring the tails. And to converge to the centre from the tails. This is not very pronounced in the above graph with a mean of 6. Moving to 50 makes it more apparent:

This is due to the poor mixing of the chain, as shown by the raw sequence below, which strives to achieve a single cycle out of 10⁵ iterations! In any case, thanks to Kévin for an interesting morning!

## sliced Poisson

Posted in Books, Kids, pictures, R, Running, Statistics, University life with tags , , , , on March 18, 2014 by xi'an

One of my students complained that his slice sampler of a Poisson distribution was not working when following the instructions in Monte Carlo Statistical Methods (Exercise 8.5). This puzzled me during my early morning run and I checked on my way back, even before attacking the fresh baguette I had brought from the bakery… The following R code is the check. And it does work! As the comparison above shows…

slice=function(el,u){
#generate uniform over finite integer set
mode=floor(lambda)
sli=mode
x=mode+1
while (dpois(x,el)>u){
sli=c(sli,x);x=x+1}
x=mode-1
while (dpois(x,el)>u){
sli=c(sli,x);x=x-1}
return(sample(sli,1))}

#example
T=10^4
lambda=2.414

x=rep(floor(lambda),T)
for (t in 2:T)
x[t]=slice(lambda,runif(1)*dpois(x[t-1],lambda))

barplot(as.vector(rbind(
table(x)/length(x),dpois(0:max(x),
lambda))),col=c("sienna","gold"))


## optimising accept-reject

Posted in R, Statistics, University life with tags , , , , , , on November 21, 2012 by xi'an

I spotted on R-bloggers a post discussing optimising the efficiency of programming accept-reject algorithms. While it is about SAS programming, and apparently supported by the SAS company, there are two interesting features with this discussion. The first one is about avoiding the dreaded loop in accept-reject algorithms. For instance, taking the case of the truncated-at-one Poisson distribution, the code

rtpois=function(n,lambda){
sampl=c()
while (length(sampl)<n){
x=rpois(1,lambda)
if (x!=0) sampl=c(sampl,x)}
return(sampl)
}


is favoured by my R course students but highly inefficient:

> system.time(rtpois(10^5,.5))
user  system elapsed
61.600  27.781  98.727


both for the stepwise increase in the size of the vector and for the loop. For instance, defining the vector sampl first requires a tenth of the above time (note the switch from 10⁵ to 10⁶):

> system.time(rtpois(10^6,.5))
user  system elapsed
54.155   0.200  62.635


As discussed by the author of the post, a more efficient programming should aim at avoiding the loop by predicting the number of proposals necessary to accept a given number of values. Since the bound M used in accept-reject algorithms is also the expected number of attempts for one acceptance, one should start with something around Mn proposed values. (Assuming of course all densities are properly normalised.) For instance, in the case of the truncated-at-one Poisson distribution based on proposals from the regular Poisson, the bound is 1/1-e. A first implementation of this principle is to build the sample via a few loops:

rtpois=function(n,lambda){
propal=rpois(ceiling(n/(1-exp(-lambda))),lambda)
propal=propal[propal>0]
n0=length(propal)
if (n0>=n)
return(propal[1:n])
else return(c(propal,rtpois(n-n0,lambda)))
}


with a higher efficiency:

> system.time(rtpois(10^6,.5))
user  system elapsed
0.816   0.092   0.928


Replacing the expectation with an upper bound using the variance of the negative binomial distribution does not make a significant dent in the computing time

rtpois=function(n,lambda){
M=1/(1-exp(-lambda))
propal=rpois(ceiling(M*(n+2*sqrt(n/M)/(M-1))),lambda)
propal=propal[propal>0]
n0=length(propal)
if (n0>=n)
return(propal[1:n])
else return(c(propal,rtpois(n-n0,lambda)))}


since we get

> system.time(rtpois(10^6,.5))
user  system elapsed
0.824   0.048   0.877


The second point about this Poisson example is that simulating a distribution with a restricted support using another distribution with a larger support is quite inefficient. Especially when λ goes to zero By comparison, using a Poisson proposal with parameter μ and translating it by 1 may bring a considerable improvement: without getting into the gory details, it can be shown that the optimal value of μ (in terms of maximal acceptance probability) is λ and that the corresponding probability of acceptance is

$\dfrac{1-\exp\{-\lambda\}}{\lambda}$

which is larger than the probability of the original approach when λ is less than one. As shown by the graph below, this allows for a lower bound in the probability of acceptance that remains tolerable.