a Bernoulli factory of sorts?

Posted in Books, Kids, Statistics with tags , , , , , on May 10, 2016 by xi'an

A nice question was posted on X validated as to figure out a way to simulate a Bernoulli B(q) variate when using only a Bernoulli B(p) generator. With the additional question of handling the special case q=a/b, a rational probability. This is not exactly a Bernoulli factory problem in that q does not write as f(p), but still a neat challenge. My solution would have been similar to the one posted by William Huber, namely to simulate a sequence of B(p) or B(1-p) towards zooming on q until the simulation of the underlying uniforms U allows us to conclude at the position of U wrt q. For instance, if p>q and X~B(p) is equal to zero, the underlying uniform is more than p, hence more than q, leading to returning zero for the B(q) generation. Else, a second B(p) or B(1-p) generation means breaking the interval (0,p) into two parts, one of which allows for stopping the generation, and so on. The solution posted by William Huber contains an R code that could be easily improved by choosing for each interval between p and (1-p) towards the maximal probability of stopping. I still wonder at the ultimate optimal solution that would minimise the (average or median) number of calls to the Bernoulli(p) generator.

an integer programming riddle

Posted in Books, Kids, R with tags , , , , on April 21, 2016 by xi'an

A puzzle on The Riddler this week that ends up as a standard integer programming problem. Removing the little story around the question, it boils down to optimise

200a+100b+50c+25d

under the constraints

400a+400b+150c+50d≤1000, b≤a, a≤1, c≤8, d≤4,

and (a,b,c,d) all non-negative integers. My first attempt was a brute force R code since there are only 3 x 9 x 5 = 135 cases:

f.obj<-c(200,100,50,25)
f.con<-matrix(c(40,40,15,5,
-1,1,0,0,
1,0,0,0,
0,0,1,0,
0,0,0,1),ncol=4,byrow=TRUE)
f.dir<-c("=","=","=","=","=","=")
f.rhs<-c(100,0,1,8,4)

sol=0
for (a in 0:1)
for (b in 0:a)
for (k in 0:8)
for (d in 0:4){
cost=f.con%*%c(a,b,k,d)-f.rhs
if (max(cost)<=0){ gain=f.obj%*%c(a,b,k,d)
if (gain>sol){
sol=gain
argu=c(a,b,k,d)}}}


which returns the value:

> sol
[,1]
[1,]  425
> argu
[1] 1 0 3 3


This is confirmed by a call to an integer programming code like lpSolve:

> lp("max",f.obj,f.con,f.dir,f.rhs,all.int=TRUE)
Success: the objective function is 425
> lp("max",f.obj,f.con,f.dir,f.rhs,all.int=TRUE)\$sol
[1] 1 0 3 3


which provides the same solution.

slice sampling revisited

Posted in Books, pictures, Statistics with tags , , , , , , , , on April 15, 2016 by xi'an

Thanks to an X validated question, I re-read Radford Neal’s 2003 Slice sampling paper. Which is an Annals of Statistics discussion paper, and rightly so. While I was involved in the editorial processing of this massive paper (!), I had only vague memories left about it. Slice sampling has this appealing feature of being the equivalent of random walk Metropolis-Hastings for Gibbs sampling, without the drawback of setting a scale for the moves.

“These slice sampling methods can adaptively change the scale of changes made, which makes them easier to tune than Metropolis methods and also avoids problems that arise when the appropriate scale of changes varies over the distribution  (…) Slice sampling methods that improve sampling by suppressing random walks can also be constructed.” (p.706)

One major theme in the paper is fighting random walk behaviour, of which Radford is a strong proponent. Even at the present time, I am a bit surprised by this feature as component-wise slice sampling is exhibiting clear features of a random walk, exploring the subgraph of the target by random vertical and horizontal moves. Hence facing the potential drawback of backtracking to previously visited places.

“A Markov chain consisting solely of overrelaxed updates might not be ergodic.” (p.729)

Overrelaxation is presented as a mean to avoid the random walk behaviour by removing rejections. The proposal is actually deterministic projecting the current value to the “other side” of the approximate slice. If it stays within the slice it is accepted. This “reflection principle” [in that it takes the symmetric wrt the centre of the slice] is also connected with antithetic sampling in that it induces rather negative correlation between the successive simulations. The last methodological section covers reflective slice sampling, which appears as a slice version of Hamiltonian Monte Carlo (HMC). Given the difficulty in implementing exact HMC (reflected in the later literature), it is no wonder that Radford proposes an approximation scheme that is valid if somewhat involved.

“We can show invariance of this distribution by showing (…) detailed balance, which for a uniform distribution reduces to showing that the probability density for x¹ to be selected as the next state, given that the current state is x0, is the same as the probability density for x⁰ to be the next state, given that x¹ is the current state, for any states x⁰ and x¹ within [the slice] S.” (p.718)

In direct connection with the X validated question there is a whole section of the paper on implementing single-variable slice sampling that I had completely forgotten, with a collection of practical implementations when the slice

S={x; u < f(x) }

cannot be computed in an exact manner. Like the “stepping out” procedure. The resulting set (interval) where the uniform simulation in x takes place may well miss some connected component(s) of the slice. This quote may sound like a strange argument in that the move may well leave a part of the slice off and still satisfy this condition. Not really since it states that it must hold for any pair of states within S… The very positive side of this section is to allow for slice sampling in cases where the inversion of u < f(x) is intractable. Hence with a strong practical implication. The multivariate extension of the approximation procedure is more (potentially) fraught with danger in that it may fell victim to a curse of dimension, in that the box for the uniform simulation of x may be much too large when compared with the true slice (or slice of the slice). I had more of a memory of the “trail of crumbs” idea, mostly because of the name I am afraid!, which links with delayed rejection, as indicated in the paper, but seems awfully delicate to calibrate.

another riddle

Posted in Books, Kids, R with tags , , , , , , on March 29, 2016 by xi'an

A very nice puzzle on The Riddler last week that kept me busy on train and plane rides, runs and even in between over the weekend. The core of the puzzle is about finding the optimal procedure to select k guesses about the value of a uniformly random integer x in {a,a+1,…,b}, given that each guess y produces the position of x respective to y (less, equal, or more). If y=x at one stage, the player wins x. Optimal being defined as maximising the expected gain. After some (and more) experimentation, I found that, when b-a is large enough [depending on k], the optimal guess at stage i is b-f(i) with f(k)=0 and f(i-1)=2f(i)+1. For the values given on The Riddler, a=1,b=1000,k=9, my solution is to first guess at y=1000-f(9)=255 and this produces a gain of 380.31 with a probability of winning of 0.510, which seems amazingly large, but not so much when considering that 2⁹ is close to 500. Continue reading

preserving frequencies without resampling

Posted in Books, Kids, pictures, R, Statistics with tags , , on March 9, 2016 by xi'an

An interesting question came up on X validated a few days ago: given a probability vector p=(p¹,…,p⁷), is there a way to pick 5 values in {1,…,7} without replacement and still preserve the probability repartition in the resulting sample? In other words, is there a sampling without replacement strategy that leads to

$\mathbb{E}[\mathbb{I}_i(X^1)+\cdots+\mathbb{I}_i(X^5)]=5p^i$

for i=1,…,7..? Unless those probabilities p¹,…,p⁷ are close enough to 1/7, this is simply impossible as 5 values out of 7 have to be sampled, which imposes some minimal frequency on some of the values.

Hence a generic question:

given a vector p of k probabilities (summing up to 1), what is the constraint on this vector and on the number n of elements of the population one can draw without replacement in order to achieve a expected frequency of np on the resulting vector? That is,

$\mathbb{E}[\mathbb{I}_i(X_1)+\ldots+\mathbb{I}_i(X_n)]=np_i$

In the cases n=2,3, I managed to find and solve the system of equations satisfied by the sampling probability vector q, but I wondered if there exists a less pedestrian resolution. I then showed the problem to Robin Ryder while at CIRM for the Bayesian week and he quickly pointed out the answer by Brewer’s and Hanif’s book Sampling with unequal probabilities to this question, which does not use sampling with replacement with a fixed probability vector but instead modifies the remaining probabilities after each draw, as in the following R code:

kuh=(1:N)/sum((1:N)) #example of target
smpl=sample((1:N),1,rep=FALSE,pro=kuh*(1-kuh)/(1-n*kuh))
for (i in 2:n)
smpl=c(smpl,sample((1:N)[-smpl],1,rep=FALSE,
pro=(kuh*(1-kuh)/(1-(n-i+1)*kuh))[-smpl])


Hence the question is not completely solved, since I am still uncertain whether or not there exists a sampling without replacement that achieves the target probability! But at least this shows there is only a solution when all probabilities are less than 1/n, n being the number of draws…

at CIRM [#3]

Posted in Kids, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , on March 4, 2016 by xi'an

Simon Barthelmé gave his mini-course on EP, with loads of details on the implementation of the method. Focussing on the EP-ABC and MCMC-EP versions today. Leaving open the difficulty of assessing to which limit EP is converging. But mentioning the potential for asynchronous EP (on which I would like to hear more). Ironically using several times a logistic regression example, if not on the Pima Indians benchmark! He also talked about approximate EP solutions that relate to consensus MCMC. With a connection to Mark Beaumont’s talk at NIPS [at the time as mine!] on the comparison with ABC. While we saw several talks on EP during this week, I am still agnostic about the potential of the approach. It certainly produces a fast proxy to the true posterior and hence can be exploited ad nauseam in inference methods based on pseudo-models like indirect inference. In conjunction with other quick and dirty approximations when available. As in ABC, it would be most useful to know how far from the (ideal) posterior distribution does the approximation stands. Machine learning approaches presumably allow for an evaluation of the predictive performances, but less so for the modelling accuracy, even with new sampling steps. [But I know nothing, I know!]

Dennis Prangle presented some on-going research on high dimension [data] ABC. Raising the question of what is the true meaning of dimension in ABC algorithms. Or of sample size. Because the inference relies on the event d(s(y),s(y’))≤ξ or on the likelihood l(θ|x). Both one-dimensional. Mentioning Iain Murray’s talk at NIPS [that I also missed]. Re-expressing as well the perspective that ABC can be seen as a missing or estimated normalising constant problem as in Bornn et al. (2015) I discussed earlier. The central idea is to use SMC to simulate a particle cloud evolving as the target tolerance ξ decreases. Which supposes a latent variable structure lurking in the background.

Judith Rousseau gave her talk on non-parametric mixtures and the possibility to learn parametrically about the component weights. Starting with a rather “magic” result by Allman et al. (2009) that three repeated observations per individual, all terms in a mixture are identifiable. Maybe related to that simpler fact that mixtures of Bernoullis are not identifiable while mixtures of Binomial are identifiable, even when n=2. As “shown” in this plot made for X validated. Actually truly related because Allman et al. (2009) prove identifiability through a finite dimensional model. (I am surprised I missed this most interesting paper!) With the side condition that a mixture of p components made of r Bernoulli products is identifiable when p ≥ 2[log² r] +1, when log² is base 2-logarithm. And [x] the upper rounding. I also find most relevant this distinction between the weights and the remainder of the mixture as weights behave quite differently, hardly parameters in a sense.

more e’s [and R’s]

Posted in Kids, pictures, R, Statistics with tags , , , , , , , on February 22, 2016 by xi'an

Alex Thiéry suggested debiasing the biased estimate of e by Rhee and Glynn truncated series method, so I tried the method to see how much of an improvement (if any!) this would bring. I first attempted to naïvely implement the raw formula of Rhee and Glynn

$\hat{\mathfrak{e}} = \sum_{n=1}^N \{\hat{e}_{n+1}-\hat{e}_n\}\big/\mathbb{P}(N\ge n)$

with a (large) Poisson distribution on the stopping rule N, but this took ages. I then realised that the index n did not have to be absolute, i.e. to start at n=1 and proceed snailwise one integer at a time: the formula remains equally valid after a change of time, i.e. n=can start at an arbitrary value and proceeds by steps of arbitrary size, which obviously speeds things up! Continue reading