## pseudo slice sampling

Posted in Books, Statistics, University life with tags , , , , , on November 26, 2015 by xi'an

The workshop in Warwick last week made me aware of (yet) another arXiv posting I had missed: Pseudo-marginal slice sampling by Iain Murray and Matthew Graham. The idea is to mix the pseudo-marginal approach of Andrieu and Roberts (2009) with a noisy slice sampling scheme à la Neal (2003). The auxiliary random variable u used in the (pseudo-marginal) unbiased estimator of the target I(θ), Î(θ,u), and with distribution q(u) is merged with the random variable of interest so that the joint is

Î(θ,u)q(u)/C

and a Metropolis-Hastings proposal on that target simulating from k(θ,θ’)q(u’) [meaning the auxiliary is simulated independently] recovers the pseudo-marginal Metropolis-Hastings ratio

Î(θ’,u‘)k(θ’,θ)/Î(θ,u)k(θ,θ’)

(which is a nice alternative proof that the method works!). The novel idea in the paper is that the proposal on the auxiliary u can be of a different form, while remaining manageable. For instance, as a two-block Gibbs sampler. Or an elliptical slice sampler for the u component. The argument being that an independent update of u may lead the joint chain to get stuck. Among the illustrations in the paper, an Ising model (with no phase transition issue?) and a Gaussian process applied to the Pima Indian data set (despite a recent prohibition!). From the final discussion, I gather that the modification should be applicable to every (?) case when a pseudo-marginal approach is available, since the auxiliary distribution q(u) is treated as a black box. Quite an interesting read and proposal!

## vertical likelihood Monte Carlo integration

Posted in Books, pictures, Running, Statistics, Travel, University life with tags , , , , , , , on April 17, 2015 by xi'an

A few months ago, Nick Polson and James Scott arXived a paper on one of my favourite problems, namely the approximation of normalising constants (and it went way under my radar, as I only became aware of it quite recently!, then it remained in my travel bag for an extra few weeks…). The method for approximating the constant Z draws from an analogy with the energy level sampling methods found in physics, like the Wang-Landau algorithm. The authors rely on a one-dimensional slice sampling representation of the posterior distribution and [main innovation in the paper] add a weight function on the auxiliary uniform. The choice of the weight function links the approach with the dreaded harmonic estimator (!), but also with power-posterior and bridge sampling. The paper recommends a specific weighting function, based on a “score-function heuristic” I do not get. Further, the optimal weight depends on intractable cumulative functions as in nested sampling. It would be fantastic if one could draw directly from the prior distribution of the likelihood function—rather than draw an x [from the prior or from something better, as suggested in our 2009 Biometrika paper] and transform it into L(x)—but as in all existing alternatives this alas is not the case. (Which is why I find the recommendations in the paper for practical implementation rather impractical, since, were the prior cdf of L(X) available, direct simulation of L(X) would be feasible. Maybe not the optimal choice though.)

“What is the distribution of the likelihood ordinates calculated via nested sampling? The answer is surprising: it is essentially the same as the distribution of likelihood ordinates by recommended weight function from Section 4.”

The approach is thus very much related to nested sampling, at least in spirit. As the authors later demonstrate, nested sampling is another case of weighting, Both versions require simulations under truncated likelihood values. Albeit with a possibility of going down [in likelihood values] with the current version. Actually, more weighting could prove [more] efficient as both the original nested and vertical sampling simulate from the prior under the likelihood constraint. Getting away from the prior should help. (I am quite curious to see how the method is received and applied.)

## fine-sliced Poisson [a.k.a. sashimi]

Posted in Books, Kids, pictures, R, Running, Statistics, University life with tags , , , , , , , , , on March 20, 2014 by xi'an

As my student Kévin Guimard had not mailed me his own Poisson slice sampler of a Poisson distribution, I could not tell why the code was not working! My earlier post prompted him to do so and a somewhat optimised version is given below:

```nsim = 10^4
lambda = 6

max.factorial = function(x,u){
k = x
parf=1
while (parf*u<1){
k = k + 1
parf = parf * k
}
k = k - (parf*u>1)
return (k)
}

x = rep(floor(lambda), nsim)
for (t in 2:nsim){
v1 = ceiling((log(runif(1))/log(lambda))+x[t-1])
ranj=max(0,v1):max.factorial(x[t-1],runif(1))
x[t]=sample(ranj,size=1)
}
barplot(as.vector(rbind(
table(x)/length(x),dpois(min(x):max(x),
lambda))),col=c("sienna","gold"))
```

As you can easily check by running the code, it does not work. My student actually majored my MCMC class and he spent quite a while pondering why the code was not working. I did ponder as well for a part of a morning in Warwick, removing causes for exponential or factorial overflows (hence the shape of the code), but not eliciting the issue… (This now sounds like lethal fugu sashimi! ) Before reading any further, can you spot the problem?!

The corrected R code is as follows:

```x = rep(lambda, nsim)
for (t in 2:nsim){
v1=ceiling((log(runif(1))/log(lambda))+x[t-1])
ranj=max(0,v1):max.factorial(x[t-1],runif(1))
if (length(ranj)>1){
x[t] = sample(ranj, size = 1)
}else{
x[t]=ranj}
}
```

The culprit is thus the R function sample which simply does not understand Dirac masses and the basics of probability! When running

```> sample(150:150,1)
[1] 23
```

you can clearly see where the problem stands…! Well-documented issue with sample that already caused me woes… Another interesting thing about this slice sampler is that it is awfully slow in exploring the tails. And to converge to the centre from the tails. This is not very pronounced in the above graph with a mean of 6. Moving to 50 makes it more apparent:

This is due to the poor mixing of the chain, as shown by the raw sequence below, which strives to achieve a single cycle out of 10⁵ iterations! In any case, thanks to Kévin for an interesting morning!

## sliced Poisson

Posted in Books, Kids, pictures, R, Running, Statistics, University life with tags , , , , on March 18, 2014 by xi'an

One of my students complained that his slice sampler of a Poisson distribution was not working when following the instructions in Monte Carlo Statistical Methods (Exercise 8.5). This puzzled me during my early morning run and I checked on my way back, even before attacking the fresh baguette I had brought from the bakery… The following R code is the check. And it does work! As the comparison above shows…

```slice=function(el,u){
#generate uniform over finite integer set
mode=floor(lambda)
sli=mode
x=mode+1
while (dpois(x,el)>u){
sli=c(sli,x);x=x+1}
x=mode-1
while (dpois(x,el)>u){
sli=c(sli,x);x=x-1}
return(sample(sli,1))}

#example
T=10^4
lambda=2.414

x=rep(floor(lambda),T)
for (t in 2:T)
x[t]=slice(lambda,runif(1)*dpois(x[t-1],lambda))

barplot(as.vector(rbind(
table(x)/length(x),dpois(0:max(x),
lambda))),col=c("sienna","gold"))
```

## optimisation via slice sampling

Posted in Statistics with tags , , , , on December 20, 2012 by xi'an

This morning, over breakfast; I read the paper recently arXived by John Birge et Nick Polson. I was intrigued by the combination of optimisation and of slice sampling, but got a wee disappointed by the paper, in that it proposes a simple form of simulated annealing, minimising f(x) by targeting a small collection of energy functions exp{-τf(x)}. Indeed, the slice sampler is used to explore each of those targets, i.e. for a fixed temperature τ. For the four functions considered in the paper, a slice sampler can indeed be implemented, but this feature could be seen as a marginalia, given that a random walk Metropolis-Hastings algorithm could be used as a proposal mechanism in other cases. The other intriguing fact is that annealing is not used in the traditional way of increasing the coefficient τ along iterations (as in our SAME algorithm), for which convergence issues are much more intricate, but instead stays at the level where a whole (Markov) sample is used for each temperature, the outcomes being then compared in terms of localisation (and maybe for starting at the next temperature value). I did not see any discussion about the selection of the successive temperatures, which usually is a delicate issue in realistic settings, nor of the stopping rule(s) used to decide the maximum has been reached.

## confusing errata in Monte Carlo Statistical Methods

Posted in Books, Statistics, University life with tags , , , , , on December 7, 2011 by xi'an

Following the earlier errata on Monte Carlo Statistical Methods, I received this email from Nick Foti:

I have a quick question about example 8.2 in Monte Carlo Statistical Methods which derives a slice sampler for a truncated N(-3,1) distribution (note, the book says it is a N(3,1) distribution, but the code and derivation are for a N(-3,1)). My question is that the slice set A(t+1) is described as

$\{y : f_1(y) \geq u f_1(x^{(t)}) \};$

which makes sense if u ~ U(0,1) as it corresponds to the previously described algorithm. However, in the errata for the book it says that u should actually be u(t+1) which according to the algorithm should be distributed as U(0,f1(x)). So unless something clever is being done with ratios of the f1‘s, it seems like the u(t+1) should be U(0,1) in this case, right?

There is indeed a typo in Example 8.4: the mean 3 should be -3… As for the errata, it addresses the missing time indices. Nick is right in assuming that those uniforms are indeed on (0,1), rather than on (0,f1(x)) as in Algorithm A.31. Apologies to all those confused by the errata!

## Andrew gone NUTS!

Posted in pictures, R, Statistics, University life with tags , , , , , , , , , , on November 24, 2011 by xi'an

Matthew Hoffman and Andrew Gelman have posted a paper on arXiv entitled “The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo” and developing an improvement on the Hamiltonian Monte Carlo algorithm called NUTS (!). Here is the abstract:

Hamiltonian Monte Carlo (HMC) is a Markov chain Monte Carlo (MCMC) algorithm that avoids the random walk behavior and sensitivity to correlated parameters that plague many MCMC methods by taking a series of steps informed by first-order gradient information. These features allow it to converge to high-dimensional target distributions much more quickly than simpler methods such as random walk Metropolis or Gibbs sampling. However, HMC’s performance is highly sensitive to two user-specified parameters: a step size ε and a desired number of steps L. In particular, if L is too small then the algorithm exhibits undesirable random walk behavior, while if L is too large the algorithm wastes computation. We introduce the No-U-Turn Sampler (NUTS), an extension to HMC that eliminates the need to set a number of steps L. NUTS uses a recursive algorithm to build a set of likely candidate points that spans a wide swath of the target distribution, stopping automatically when it starts to double back and retrace its steps. Empirically, NUTS perform at least as efficiently as and sometimes more efficiently than a well tuned standard HMC method, without requiring user intervention or costly tuning runs. We also derive a method for adapting the step size parameter ε on the fly based on primal-dual averaging. NUTS can thus be used with no hand-tuning at all. NUTS is also suitable for applications such as BUGS-style automatic inference engines that require efficient “turnkey” sampling algorithms.

Now, my suspicious and pessimistic nature always makes me wary of universality claims! I completely acknowledge the difficulty in picking the number of leapfrog steps L in the Hamiltonian algorithm, since the theory behind does not tell anything useful about L. And the paper is quite convincing in its description of the NUTS algorithm, being moreover beautifully written.  As indicated in the paper, the “doubling” solution adopted by NUTS is reminding me of Radford Neal’s procedure in his Annals of Statistics paper on slice sampling. (The NUTS algorithm also relies on a slice sampling step.) However, besides its ability to work as an automatic Hamiltonian methodology, I wonder about the computing time (and the “unacceptably large amount of memory”, p.12) required by the doubling procedure: 2j is growing fast with j! (If my intuition is right, the computing time should increase rather quickly with the dimension. And I do not get the argument within the paper that the costly part is the gradient computation: it seems to me the gradient must be computed for all of the 2j points.) The authors also mention delayed rejection à la Tierney and Mira (1999) and the scheme reminded me a wee of the pinball sampler we devised a while ago with Kerrie Mengersen. Choosing the discretisation step ε is more “traditional”, using the stochastic approximation approach we set in our unpublished-yet-often-quoted tech report with Christophe Andrieu. (I do not think I got the crux of the “dual averaging” for optimal calibration on p.11) The illustration through four benchmarks [incl. a stochastic volatility model!] is quite convincing as well, with (unsurprisingly) great graphical tools. A final grumble: that the code is “only” available in the proprietary language Matlab! Now, I bet some kind of Rao-Blackwellisation is feasible with all the intermediate simulations!