Sunday morning puzzle
A question from X validated that took me quite a while to fathom and then the solution suddenly became quite obvious:
If a sample taken from an arbitrary distribution on {0,1}⁶ is censored from its (0,0,0,0,0,0) elements, and if the marginal probabilities are know for all six components of the random vector, what is an estimate of the proportion of (missing) (0,0,0,0,0,0) elements?
Since the censoring modifies all probabilities by the same renormalisation, i.e. divides them by the probability to be different from (0,0,0,0,0,0), ρ, this probability can be estimated by looking at the marginal probabilities to be equal to 1, which equal the original and known marginal probabilities divided by ρ. Here is a short R code illustrating the approach that I wrote in the taxi home yesterday night:
#generate vectors N=1e5 zprobs=c(.1,.9) #iid example smpl=matrix(sample(0:1,6*N,rep=TRUE,prob=zprobs),ncol=6) pty=apply(smpl,1,sum) smpl=smpl[pty>0,] ps=apply(smpl,2,mean) cor=mean(ps/rep(zprobs[2],6)) #estimated original size length(smpl[,1])*cor
A broader question is how many values (and which values) of the sample can be removed before this recovery gets impossible (with the same amount of information).
December 8, 2015 at 9:00 pm
Its possible to estimate the proportion by a least squares approach, getting 1-(\sum f_i * p_i )/(\sum f_i ^2) as an estimator. compared to the procedure proposed in the x validated post, both the bias and the variance of the LS estimator becomes smaller as p gets large. for small p both estimators seems equal. Also this method can be easy generalized for the case with more censored states.