## pseudo-marginal MCMC with minimal replicas

Posted in Books, Statistics, University life with tags , , , , , , on November 16, 2016 by xi'an

A week ago, Chris Sherlock, Alexandre Thiery and Anthony Lee (Warwick) arXived a short paper where they show that it is most efficient to use only one random substitute to the likelihood when considering a pseudo-marginal algorithm. Thus generalising an earlier paper of Luke Bornn and co-authors I commented earlier. A neat side result in the paper is that the acceptance probability for m replicas [in the pseudo-marginal approximation] is at most m/s the acceptance probability for s replicas when s<m. The main result is as follows:

There is a (superficial?) connection with the fact that when running Metropolis-within-Gibbs there is no advantage in doing more than one single Metropolis iteration, as the goal is to converge to the joint posterior, rather than approximating better the full conditional…

## a pseudo-marginal perspective on the ABC algorithm

Posted in Mountains, pictures, Statistics, University life with tags , , , , , , , , on May 5, 2014 by xi'an

My friends Luke Bornn, Natesh Pillai and Dawn Woodard just arXived along with Aaron Smith a short note on the convergence properties of ABC. When compared with acceptance-rejection or regular MCMC. Unsurprisingly, ABC does worse in both cases. What is central to this note is that ABC can be (re)interpreted as a pseudo-marginal method where the data comparison step acts like an unbiased estimator of the true ABC target (not of the original ABC target, mind!). From there, it is mostly an application of Christophe Andrieu’s and Matti Vihola’s results in this setup. The authors also argue that using a single pseudo-data simulation per parameter value is the optimal strategy (as compared with using several), when considering asymptotic variance. This makes sense in terms of simulating in a larger dimensional space but what of the cost of producing those pseudo-datasets against the cost of producing a new parameter? There are a few (rare) cases where the datasets are much cheaper to produce.

## MCMC using an approximation

Posted in R, Statistics with tags , , , , on April 17, 2014 by xi'an

This paper by Andrès Christen and Colin Fox is an reexamination of an earlier paper by Fox and Nicholls (1997) where the idea of testing for acceptance before computing the exact likelihood was first suggested. In the event of a low acceptance rate, and of a good enough approximation, the overall cost can be significantly reduced obviously. The issue is then to recover the exact target. In practice, the original proposal q is used to generate a value θ’ that is tested against the approximate target f⁰. If accepted, θ’ is then tested against the true target f, using a pseudo proposal q⁰ which is simply translating the earlier step. A sort of delayed acceptance that only efficiently reduces the computing cost when the approximation is good enough. Since, overall, the probability of acceptance of a given θ’ is smaller with this scheme. (Or, as stated within the paper, the original kernel dominates the new one in Peskun’s sense.) The most interesting question about this paper is how to compute the approximation. Subsampling the original data and hence using a small enough part of the likelihood function reduces the computing cost but induces variability.

#prior p~Be(7.5,.5)
#data  x~B(N,p)
#posterior p|x~Be(x+7.5,N+.5-x)
T=10^4;N=100;y=32;a=7.5;b=.5

p=rep(.2,T)
for (t in 2:T){
pp=p[t-1]+runif(1,-.1,.1) #random walk
p[t]=p[t-1] #by default rejection
if ((pp>0)&(pp<1)){
ratio=dbinom(y,N,pp)/dbinom(y,N,p[t-1])
if ((runif(1)<ratio)&(runif(1)<
dbeta(pp,a,b)/dbeta(p[t-1],a,b)))
p[t]=pp
}}


One application of the method I checked during my stay in Leuven was to separate the likelihood from the prior and compare only one term to a uniform at a time. For instance, with a costly prior, the approximate target f⁰ could be the likelihood, in which case the first acceptance probability would be based on the ratio of the likelihoods and the second acceptance probability on the ratio of the priors. As shown by the above graph, on a standard normal-normal example, the true posterior and the density estimate of the simulated sample almost coincide (for 10⁵ MCMC iterations)…

Actually, the principle also applies to a product of individual terms as in a likelihood so each term can be evaluated separately! This increases the variability and the potential for rejection but if each likelihood term is costly to compute this could bring some improvement. The graphs and the R code below illustrate an implementation in the Beta-binomial case when the binomial observation is replaced with a sequence of Bernoulli observations. The fit is adequate on 10⁵ iterations, but the autocorrelation in the sequence is high (the ACF is for the 100 times thinned sequence) and the acceptance rate down to 5%.

#even better, use the likelihood decomposition
T=10^5
yful=rep(0,N)
yful[sample(1:N,y)]=1 #pseudo Bernoulli sequence

p=rep(.2,T)
for (t in 2:T){
pp=p[t-1]+runif(1,-.1,.1) #random walk
p[t]=p[t-1] #by default rejection

if ((pp>0)&(pp<1)){
n=1
ratio=dbinom(yful[n],1,pp)/dbinom(yful[n],1,p[t-1])
while ((n<N)&(runif(1)<ratio)){
n=n+1
ratio=dbinom(yful[n],1,pp)/dbinom(yful[n],1,p[t-1])}
if ((n==N)&(runif(1)<ratio)){
if (runif(1)<dbeta(pp,a,b)/dbeta(p[t-1],a,b))
p[t]=pp
}
}}


Posted in pictures, Statistics, Travel with tags , , , , , , , , , , , on April 15, 2014 by xi'an

“At equilibrium, we thus should not expect gains of several orders of magnitude.”

As was signaled to me several times during the MCqMC conference in Leuven, Rémi Bardenet, Arnaud Doucet and Chris Holmes (all from Oxford) just wrote a short paper for the proceedings of ICML on a way to speed up Metropolis-Hastings by reducing the number of terms one computes in the likelihood ratio involved in the acceptance probability, i.e.

$\prod_{i=1}^n\frac{L(\theta^\prime|x_i)}{L(\theta|x_i)}.$

The observations appearing in this likelihood ratio are a random subsample from the original sample. Even though this leads to an unbiased estimator of the true log-likelihood sum, this approach is not justified on a pseudo-marginal basis à la Andrieu-Roberts (2009). (Writing this in the train back to Paris, I am not convinced this approach is in fact applicable to this proposal as the likelihood itself is not estimated in an unbiased manner…)

In the paper, the quality of the approximation is evaluated by Hoeffding’s like inequalities, which serves as the basis for a stopping rule on the number of terms eventually evaluated in the random subsample. In fine, the method uses a sequential procedure to determine if enough terms are used to take the decision and the probability to take the same decision as with the whole sample is bounded from below. The sequential nature of the algorithm requires to either recompute the vector of likelihood terms for the previous value of the parameter or to store all of them for deriving the partial ratios. While the authors adress the issue of self-evaluating whether or not this complication is worth the effort, I wonder (from my train seat) why they focus so much on recovering the same decision as with the complete likelihood ratio and the same uniform. It would suffice to get the same distribution for the decision (an alternative that is easier to propose than to create of course). I also (idly) wonder if a Gibbs version would be manageable, i.e. by changing only some terms in the likelihood ratio at each iteration, in which case the method could be exact… (I found the above quote quite relevant as, in an alternative technique we are constructing with Marco Banterle, the speedup is particularly visible in the warmup stage.) Hence another direction in this recent flow of papers attempting to speed up MCMC methods against the incoming tsunami of “Big Data” problems.

## running MCMC for too long, and even longer…

Posted in Books, pictures, Running, Statistics, University life with tags , , , , , , , , , , , on October 23, 2013 by xi'an

Following my earlier post about the young astronomer who feared he was running his MCMC for too long, here is an update from his visit to my office this morning.  This visit proved quite an instructive visit for both of us. (Disclaimer: the picture of an observatory seen from across Brunel’s suspension bridge in Bristol is as earlier completely unrelated with the young astronomer!)

First, the reason why he thought MCMC was running too long was that the acceptance rate was plummeting down to zero, whatever the random walk scale. The reason for this behaviour is that he was actually running a standard simulated annealing algorithm, hence observing the stabilisation of the Markov chain in one of the (global) modes of the target function. In that sense, he was right that the MCMC was run for “too long”, as there was nothing to expect once the mode had been reached and the temperature turned down to zero. So the algorithm was working correctly.

Second, the astronomy problem he considers had a rather complex likelihood, for which he substituted a distance between the (discretised) observed data and (discretised) simulated data, simulated conditional on the current parameter value. Now…does this ring a bell? If not, here is a three letter clue: ABC… Indeed, the trick he had found to get around this likelihood calculation issue was to re-invent a version of ABC-MCMC! Except that the distance was re-introduced into a regular MCMC scheme as a substitute to the log-likelihood. And compared with the distance at the previous MCMC iteration. This is quite clever, even though this substitution suffers from a normalisation issue (that I already mentioned in the post about Holmes’ and Walker’s idea to turn loss functions into pseudo likelihoods. Regular ABC does not encounter this difficult, obviously. I am still bemused by this reinvention of ABC from scratch!

So we are now at a stage where my young friend will experiment with (hopefully) correct ABC steps, trying to derive the tolerance value from warmup simulations and use some of the accelerating tricks suggested by Umberto Picchini and Julie Forman to avoid simulating the characteristics of millions of stars for nothing. And we agreed to meet soon for an update. Indeed, a fairly profitable morning for both of us!