## occupancy rules

Posted in Kids, R, Statistics with tags , , , , , , , on May 23, 2016 by xi'an

While the last riddle on The Riddler was rather anticlimactic, namely to find the mean of the number Y of empty bins in a uniform multinomial with n bins and m draws, with solution

$\mathbb{E}[Y]=n(1-\frac{1}{n})^m,$

[which still has a link with e in that the fraction of empty bins converges to e⁻¹ when n=m], this led me to some more involved investigation on the distribution of Y. While it can be shown directly that the probability that k bins are non-empty is

${n \choose k}\sum_{i=1}^k (-1)^{k-i}{k \choose i}(i/n)^m$

with an R representation by

miss<-function(n,m){
p=rep(0,n)
for (k in 1:n)
p[k]=choose(n,k)*sum((-1)^((k-1):0)*choose(k,1:k)*(1:k)^m)
return(rev(p)/n^m)}


I wanted to take advantage of the moments of Y, since it writes as a sum of n indicators, counting the number of empty cells. However, the higher moments of Y are not as straightforward as its expectation and I struggled with the representation until I came upon this formula

$\mathbb{E}[Y^k]=\sum_{i=1}^k {k \choose i} i! S(k,i) \left( 1-\frac{i}{n}\right)^m$

where S(k,i) denotes the Stirling number of the second kind… Or i!S(n,i) is the number of surjections from a set of size n to a set of size i. Which leads to the distribution of Y by inverting the moment equations, as in the following R code:

diss<-function(n,m){
A=matrix(0,n,n)
mome=rep(0,n)
A[n,]=rep(1,n)
mome[n]=1
for (k in 1:(n-1)){
A[k,]=(0:(n-1))^k
for (i in 1:k)
mome[k]=mome[k]+factorial(i)*as.integer(Stirling2(n,i))*
(1-(i+1)/n)^m*factorial(k)/factorial(k-i-1)}
return(solve(A,mome))}


that I still checked by raw simulations from the multinomial

zample<-function(n,m,T=1e4){
x=matrix(sample(1:n,m*T,rep=TRUE),nrow=T)
x=sapply(apply(x,1,unique),length)
return(n-x)}


## mean of an absolute Student’s t

Posted in R, Statistics, University life with tags , , , on December 1, 2011 by xi'an

Having (rather foolishly) involved myself into providing an answer for Cross Validated: “Can the standard deviation of non-negative data exceed the mean?“, I ended up having to derive the mean of the absolute value of a Student’s variate X.  (Well, not really, but then I did.) I think the following is correct:

$\mathbb{E}[|X|]=\mu(2\text{P}_{\mu,\nu}(X>0)-1)\qquad\qquad\\ \qquad\qquad+2 f(0,\nu)\dfrac{\nu}{\nu-1}\left[1+\mu^2/\nu\right]^{-(\nu-1)/2}$

where $f(x,\nu)$ is the density of the standard Student’s distribution. (I also checked by simulation. The derivation is there. And now fully arXived. Even though it is most likely already written somewhere else. See, for instance, Psarakis and Panaretos who studied the case of the absolute centred t rv, called the folded t rv.)