Alex Terenin told me during the welcoming reception of MCqMC 2016 that he, along with Shawfeng Dong and David Draper, had arXived a paper on GPU implementation of the Gibbs sampler and thanked me profusely for my acceptreject algorithm of the truncated normal distribution. Algorithm that he reprogrammed in CUDA. The paper is mostly a review on the specifics of GPU programming and of the constraints when compared with CPUs. The type of models considered therein allows for GPU implementation because of a very large number of latent variables that are independent conditional on the parameter θ. Like, e.g., the horseshoe probit regression model, which is how my sampler enters the picture. Acceptreject algorithms are not ideally suited for GPUs because of the while not_accepted in the code, but I did not get [from our discussion] why it is more efficient to wait for the while loop to exit when compared with running more proposals and subset the accepted ones later. Presumably because this is too costly when ensuring at least one is accepted. The paper also mentions the issue of ensuring random generators remain valid when stretched across many threads, advocating block skips as discussed in an earlier (or even ancient) ‘Og post. In line with earlier comparison tests, the proper GPU implementation of the Gibbs sampler in this setting leads to improvements that are order of magnitude faster. Nonetheless, I wonder at the universality of the comparison in that GPUs lack the programming interface that is now available for CPUs. Some authors, like the current ones, have been putting some effort in constructing random generators in CUDA, but the entry cost for newbies like me still sounds overwhelming.
Archive for GPU
GPUaccelerated Gibbs sampling
Posted in Statistics, Travel, University life with tags CPU, CUDA, GPU, horseshoe prior, parallel MCMC, probit model, pseudorandom generator on August 18, 2016 by xi'anFoundations of Statistical Algorithms [book review]
Posted in Books, Linux, R, Statistics, University life with tags big data, GPU, Hadoop, MapReduce, optimisation, parallelisation, R, R package, random number generation, randomness, scalability, simulation, statistical algorithms, statistical computing on February 28, 2014 by xi'anThere is computational statistics and there is statistical computing. And then there is statistical algorithmic. Not the same thing, by far. This 2014 book by Weihs, Mersman and Ligges, from TU Dortmund, the later being also a member of the R Core team, stands at one end of this wide spectrum of techniques required by modern statistical analysis. In short, it provides the necessary skills to construct statistical algorithms and hence to contribute to statistical computing. And I wish I had the luxury to teach from Foundations of Statistical Algorithms to my graduate students, if only we could afford an extra yearly course…
“Our aim is to enable the reader (…) to quickly understand the main ideas of modern numerical algorithms [rather] than having to memorize the current, and soon to be outdated, set of popular algorithms from computational statistics.”(p.1)
The book is built around the above aim, first presenting the reasons why computers can produce answers different from what we want, using least squares as a mean to check for (in)stability, then second establishing the ground forFishman Monte Carlo methods by discussing (pseudo)random generation, including MCMC algorithms, before moving in third to bootstrap and resampling techniques, and concluding with parallelisation and scalability. The text is highly structured, with frequent summaries, a division of chapters all the way down to subsubsubsections, an R implementation section in each chapter, and a few exercises. Continue reading
resampling and [GPU] parallelism
Posted in Statistics, University life with tags GPU, particle MCMC, Raftery and Lewis' number of iterations, random number generator, resampling, stratified resampling, systematic resampling on March 13, 2012 by xi'anIn a recent note posted on arXiv, Lawrence Murray compares the implementation of resampling schemes for parallel systems like GPUs. Given a system of weighted particles, (x_{i},ω_{i}), there are several ways of drawing a sample according to those weights:
 regular multinomial resampling, where each point in the (new) sample is one of the (x_{i},ω_{i}), with probability (x_{i},ω_{i}), meaning there is a uniform generated for each point;
 stratified resampling, where the weights are added, divided into equal pieces and a uniform is sampled on each piece, which means that points with large weights are sampled at least once and those with small weights at most once;
 systematic resampling, which is the same as the above except that the same uniform is used for each piece,
 Metropolis resampling, where a Markov chain converges to the distribution (ω_{1},…, ω_{P}) on {1,…,P},
The three first resamplers are common in the particle system literature (incl. Nicolas Chopin’s PhD thesis), but difficult to adapt to GPUs (and I always feel uncomfortable with the fact that systematic uses a single uniform!), while the last one is more unusual, but actually wellfitted for a parallel implementation. While Lawrence Murray suggests using Raftery and Lewis’ (1992) assessment of the required number of Metropolis iterations to “achieve convergence”, I would instead suggest taking advantage of the toric nature of the space (as represented above) to run a random walk and wait for the equivalent of a complete cycle. In any case, this is a cool illustration of the new challenges posed by parallel implementations (like the development of proper random generators).
GPUs in computational statistics
Posted in pictures, Statistics, Travel with tags Coventry, CPU, GPU, haggis, random number generator, Robert Burns, Stilton, University of Warwick on January 27, 2012 by xi'anThe workshop in Warwick yesterday went on very quickly! The room was packed. The three first talks were by myself, Christophe and Pierre, so less about GPUs and more about simulation techniques which could benefit from or even require implementation on GPUS. (I did manage to have complete slides this time… More seriously, Christophe’s talk set me thinking on the issue of estimating the likelihood function in ways different (?) from the one used in ABC.) The second half was more novel for me, in that the three talks went into the computing and computer aspects of GPUS, with Chris Holmes doing sparse [Lassolike] regression on a scale otherwise [i.e. w/o GPUs] impossible, Chris [fourth Chris in the list of speakers!] Barnes explaining ABC for molecular biology and design (a point I plan to discuss on a later post), with even more details about the architecture and programming of GPUs, and Michael Stumpf delivering a grand finale, with essentially three talks into one: network analysis (incl. terrific movie bits incorporated within the beamer slides!), GPUs vs. CPUs and older alternatives, and random generators on GPU, commenting on a recent paper by Salmon et al. (SC, 2011) and showing that true gains in efficiency from using GPUs involved a heavy involvement into the hardware structure… A very exciting day followed by Stilton cheese degustation and haggis (if not poems) to celebrate Burns’ night!

Some hae meat and canna eat,
And some wad eat that want it;
But we hae meat, and we can eat,
And sae let the Lord be thankit.
English trip (1)
Posted in Statistics, Travel, University life with tags ABC, Cambridge University, CRiSM, Edinburgh, GPU, graphics processing units, Kenilworth, model choice, parallel processing, Scotland, University of Oxford, University of Warwick, WangLandau algorithm, Warwick on January 25, 2012 by xi'anToday, I am attending a workshop on the use of graphics processing units in Statistics in Warwick, supported by CRiSM, presenting our recent works with Randal Douc, Pierre Jacob and Murray Smith. (I will use the same slides as in Telecom two months ago, hopefully avoiding the loss of integral and summation signs this time!) Pierre Jacob will talk about WangLandau.
Then, tomorrow, I am off to Cambridge to talk about ABC and model choice on Friday afternoon. (Presumably using the same slides as in Provo.)
The (1) in the title is in prevision of a second trip to Oxford next month and another one to Bristol two months after! (The trip to Edinburgh does not count of course, since it is in Scotland!)
GPUs in Computational Statistics [Warwick, Jan. 25]
Posted in Statistics, Travel, University life with tags ABC, ABC model choice, Bayesian model choice, computational statistics, GPU, parallel processing, University of Cambridge, University of Warwick on January 6, 2012 by xi'anNext January 25, I will take part in a workshop at the University of Warwick, (organised by CRiSM and CSC) on the theme of GPUs in Computational Statistics. Even though I have not directly worked on GPUs, I will talk about our joint work with Pierre Jacob and Murray Smith. While Pierre will talk about Parallel WangLandau. From there I will travel to Cambridge for a seminar on ABC model choice the next Friday.
Parallel processing of independent MetropolisHastings algorithms
Posted in R, Statistics, University life with tags Chris Holmes, cloud computing, GPU, importance sampling, JSM 2010, MetropolisHastings, parallelisation, RaoBlackwellisation, Valencia 9 on October 12, 2010 by xi'anWith Pierre Jacob, my PhD student, and Murray Smith, from National Institute of Water and Atmospheric Research, Wellington, who actually started us on this project at the last and latest Valencia meeting, we have completed a paper on using parallel computing in independent MetropolisHastings algorithms. The paper is arXived and the abstract goes as follows:
In this paper, we consider the implications of the fact that parallel rawpower can be exploited by a generic Metropolis–Hastings algorithm if the proposed values are independent. In particular, we present improvements to the independent Metropolis–Hastings algorithm that significantly decrease the variance of any estimator derived from the MCMC output, for a null computing cost since those improvements are based on a fixed number of target density evaluations. Furthermore, the techniques developed in this paper do not jeopardize the Markovian convergence properties of the algorithm, since they are based on the Rao–Blackwell principles of Gelfand and Smith (1990), already exploited in Casella and Robert 91996), Atchadé and Perron (2005) and Douc and Robert (2010). We illustrate those improvement both on a toy normal example and on a classical probit regression model but insist on the fact that they are universally applicable.
I am quite excited about the results in this paper, which took advantage of (a) older works of mine on RaoBlackwellisation, (b) Murray’s interests in costly likelihoods, and (c) our mutual excitement when hearing about GPU parallel possibilities from Chris Holmes’ talk in Valencia. (As well as directions drafted in an exciting session in Vancouver!) The (free) gains over standard independent MetropolisHastings estimates are equivalent to using importance sampling gains, while keeping the Markov structure of the original chain. Given that 100 or more parallel threads can be enhanced from current GPU cards, this is clearly a field with much potential! The graph below
gives the variance improvements brought by three RaoBlackwell estimates taking advantage of parallelisation over the initial MCMC estimate (first entry) with the importance sampling estimate (last entry) using only 10 parallel threads.