Alex Terenin told me during the welcoming reception of MCqMC 2016 that he, along with Shawfeng Dong and David Draper, had arXived a paper on GPU implementation of the Gibbs sampler and thanked me profusely for my acceptreject algorithm of the truncated normal distribution. Algorithm that he reprogrammed in CUDA. The paper is mostly a review on the specifics of GPU programming and of the constraints when compared with CPUs. The type of models considered therein allows for GPU implementation because of a very large number of latent variables that are independent conditional on the parameter θ. Like, e.g., the horseshoe probit regression model, which is how my sampler enters the picture. Acceptreject algorithms are not ideally suited for GPUs because of the while not_accepted in the code, but I did not get [from our discussion] why it is more efficient to wait for the while loop to exit when compared with running more proposals and subset the accepted ones later. Presumably because this is too costly when ensuring at least one is accepted. The paper also mentions the issue of ensuring random generators remain valid when stretched across many threads, advocating block skips as discussed in an earlier (or even ancient) ‘Og post. In line with earlier comparison tests, the proper GPU implementation of the Gibbs sampler in this setting leads to improvements that are order of magnitude faster. Nonetheless, I wonder at the universality of the comparison in that GPUs lack the programming interface that is now available for CPUs. Some authors, like the current ones, have been putting some effort in constructing random generators in CUDA, but the entry cost for newbies like me still sounds overwhelming.
Archive for CPU
GPUaccelerated Gibbs sampling
Posted in Statistics, Travel, University life with tags CPU, CUDA, GPU, horseshoe prior, parallel MCMC, probit model, pseudorandom generator on August 18, 2016 by xi'anLeave the Pima Indians alone!
Posted in Books, R, Statistics, University life with tags ABC, Bayes factor, benchmark, Chib's approximation, CPU, diabetes, EPABC, expectationpropagation, Gibbs sampling, Jim Berger, logistic regression, MCMC algorithms, Monte Carlo Statistical Methods, NewtonRaphson algorithm, Pima Indians, probit model, R on July 15, 2015 by xi'an“…our findings shall lead to us be critical of certain current practices. Specifically, most papers seem content with comparing some new algorithm with Gibbs sampling, on a few small datasets, such as the wellknown Pima Indians diabetes dataset (8 covariates). But we shall see that, for such datasets, approaches that are even more basic than Gibbs sampling are actually hard to beat. In other words, datasets considered in the literature may be too toylike to be used as a relevant benchmark. On the other hand, if ones considers larger datasets (with say 100 covariates), then not so many approaches seem to remain competitive” (p.1)
Nicolas Chopin and James Ridgway (CREST, Paris) completed and arXived a paper they had “threatened” to publish for a while now, namely why using the Pima Indian R logistic or probit regression benchmark for checking a computational algorithm is not such a great idea! Given that I am definitely guilty of such a sin (in papers not reported in the survey), I was quite eager to read the reasons why! Beyond the debate on the worth of such a benchmark, the paper considers a wider perspective as to how Bayesian computation algorithms should be compared, including the murky waters of CPU time versus designer or programmer time. Which plays against most MCMC sampler.
As a first entry, Nicolas and James point out that the MAP can be derived by standard a NewtonRaphson algorithm when the prior is Gaussian, and even when the prior is Cauchy as it seems most datasets allow for NewtonRaphson convergence. As well as the Hessian. We actually took advantage of this property in our comparison of evidence approximations published in the Festschrift for Jim Berger. Where we also noticed the awesome performances of an importance sampler based on the Gaussian or Laplace approximation. The authors call this proposal their gold standard. Because they also find it hard to beat. They also pursue this approximation to its logical (?) end by proposing an evidence approximation based on the above and Chib’s formula. Two close approximations are provided by INLA for posterior marginals and by a LaplaceEM for a Cauchy prior. Unsurprisingly, the expectationpropagation (EP) approach is also implemented. What EP lacks in theoretical backup, it seems to recover in sheer precision (in the examples analysed in the paper). And unsurprisingly as well the paper includes a randomised quasiMonte Carlo version of the Gaussian importance sampler. (The authors report that “the improvement brought by RQMC varies strongly across datasets” without elaborating for the reasons behind this variability. They also do not report the CPU time of the ISQMC, maybe identical to the one for the regular importance sampling.) Maybe more surprising is the absence of a nested sampling version.
In the Markov chain Monte Carlo solutions, Nicolas and James compare Gibbs, MetropolisHastings, Hamiltonian Monte Carlo, and NUTS. Plus a tempering SMC, All of which are outperformed by importance sampling for small enough datasets. But get back to competing grounds for large enough ones, since importance sampling then fails.
“…let’s all refrain from now on from using datasets and models that are too simple to serve as a reasonable benchmark.” (p.25)
This is a very nice survey on the theme of binary data (more than on the comparison of algorithms in that the authors do not really take into account design and complexity, but resort to MSEs versus CPus). I however do not agree with their overall message to leave the Pima Indians alone. Or at least not for the reason provided therein, namely that faster and more accurate approximations methods are available and cannot be beaten. Benchmarks always have the limitation of “what you get is what you see”, i.e., the output associated with a single dataset that only has that many idiosyncrasies. Plus, the closeness to a perfect normal posterior makes the logistic posterior too regular to pause a real challenge (even though MCMC algorithms are as usual slower than iid sampling). But having faster and more precise resolutions should on the opposite be cause for cheers, as this provides a reference value, a golden standard, to check against. In a sense, for every Monte Carlo method, there is a much better answer, namely the exact value of the integral or of the optimum! And one is hardly aiming at a more precise inference for the benchmark itself: those Pima Indians [whose actual name is Akimel O’odham] with diabetes involved in the original study are definitely beyond help from statisticians and the model is unlikely to carry out to current populations. When the goal is to compare methods, as in our 2009 paper for Jim Berger’s 60th birthday, what matters is relative speed and relative ease of implementation (besides the obvious convergence to the proper target). In that sense bigger and larger is not always relevant. Unless one tackles really big or really large datasets, for which there is neither benchmark method nor reference value.
GPUs in computational statistics
Posted in pictures, Statistics, Travel with tags Coventry, CPU, GPU, haggis, random number generator, Robert Burns, Stilton, University of Warwick on January 27, 2012 by xi'anThe workshop in Warwick yesterday went on very quickly! The room was packed. The three first talks were by myself, Christophe and Pierre, so less about GPUs and more about simulation techniques which could benefit from or even require implementation on GPUS. (I did manage to have complete slides this time… More seriously, Christophe’s talk set me thinking on the issue of estimating the likelihood function in ways different (?) from the one used in ABC.) The second half was more novel for me, in that the three talks went into the computing and computer aspects of GPUS, with Chris Holmes doing sparse [Lassolike] regression on a scale otherwise [i.e. w/o GPUs] impossible, Chris [fourth Chris in the list of speakers!] Barnes explaining ABC for molecular biology and design (a point I plan to discuss on a later post), with even more details about the architecture and programming of GPUs, and Michael Stumpf delivering a grand finale, with essentially three talks into one: network analysis (incl. terrific movie bits incorporated within the beamer slides!), GPUs vs. CPUs and older alternatives, and random generators on GPU, commenting on a recent paper by Salmon et al. (SC, 2011) and showing that true gains in efficiency from using GPUs involved a heavy involvement into the hardware structure… A very exciting day followed by Stilton cheese degustation and haggis (if not poems) to celebrate Burns’ night!

Some hae meat and canna eat,
And some wad eat that want it;
But we hae meat, and we can eat,
And sae let the Lord be thankit.