## preserving frequencies without resampling

Posted in Books, Kids, pictures, R, Statistics with tags , , on March 9, 2016 by xi'an

An interesting question came up on X validated a few days ago: given a probability vector p=(p¹,…,p⁷), is there a way to pick 5 values in {1,…,7} without replacement and still preserve the probability repartition in the resulting sample? In other words, is there a sampling without replacement strategy that leads to

$\mathbb{E}[\mathbb{I}_i(X^1)+\cdots+\mathbb{I}_i(X^5)]=5p^i$

for i=1,…,7..? Unless those probabilities p¹,…,p⁷ are close enough to 1/7, this is simply impossible as 5 values out of 7 have to be sampled, which imposes some minimal frequency on some of the values.

Hence a generic question:

given a vector p of k probabilities (summing up to 1), what is the constraint on this vector and on the number n of elements of the population one can draw without replacement in order to achieve a expected frequency of np on the resulting vector? That is,

$\mathbb{E}[\mathbb{I}_i(X_1)+\ldots+\mathbb{I}_i(X_n)]=np_i$

In the cases n=2,3, I managed to find and solve the system of equations satisfied by the sampling probability vector q, but I wondered if there exists a less pedestrian resolution. I then showed the problem to Robin Ryder while at CIRM for the Bayesian week and he quickly pointed out the answer by Brewer’s and Hanif’s book Sampling with unequal probabilities to this question, which does not use sampling with replacement with a fixed probability vector but instead modifies the remaining probabilities after each draw, as in the following R code:

kuh=(1:N)/sum((1:N)) #example of target
smpl=sample((1:N),1,rep=FALSE,pro=kuh*(1-kuh)/(1-n*kuh))
for (i in 2:n)
smpl=c(smpl,sample((1:N)[-smpl],1,rep=FALSE,
pro=(kuh*(1-kuh)/(1-(n-i+1)*kuh))[-smpl])


Hence the question is not completely solved, since I am still uncertain whether or not there exists a sampling without replacement that achieves the target probability! But at least this shows there is only a solution when all probabilities are less than 1/n, n being the number of draws…

## open problem

Posted in R, Statistics with tags , , , , , on October 24, 2013 by xi'an

On the plane back from Warwick, I was reading an ABC arXived paper by Umberto Picchini and Julie Forman, “Accelerating inference for diffusions observed with measurement error and large sample sizes using Approximate Bayesian Computation: A case study” and came upon this open problem:

“A closed-form expression for generating percentiles from a fi nite-components Gaussian mixture is unavailable.” (p.5)

which means solving

$\alpha\Phi(x)+(1-\alpha)\Phi(\{x-\mu)/\sigma) = \beta$

is not possible in closed form. (Of course it could also be argued that the equation Φ(x)=β is unavailable in closed-form ie that the analytic solution x=Φ-1(β) is formal…) While I can think of several numerical approaches, a few minutes with a sheet of paper let me convinced that indeed this is not solvable (hence not an open problem, contrary to the title of the post!).

Just for R practice (and my R course students!), here is a basic R code:

mixant=function(alpha=0.5,beta=0.95,mu,sig=1,prec=1/10^4){
onmal=1-alpha
qbeta=qnorm(beta)

# initial bounds
omb=min(qbeta,mu+sig*qbeta)
omB=max(qbeta,mu+sig*qbeta)
if (beta<alpha){
omB=min(omB,qnorm(beta/alpha))
}else{
omb=max(omb,mu+sig*qnorm((beta-alpha)/onmal))}
if (beta<onmal){
omB=min(omB,mu+sig*qnorm(beta/onmal))
}else{
omb=max(omb,qnorm((beta-onmal)/alpha))}

# iterations
for (t in 1:5){
ranj=seq(omb,omB,len=17)
cfs=alpha*pnorm(ranj)+onmal*pnorm((ranj-mu)/sig)
omb=max(ranj[cfs<=beta])
omB=min(ranj[cfs>=beta])

if ((omB-omb)<prec)
break()}
return(.5*(omb+omB))}