Archive for parallel MCMC

21w5107 [day 2]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on December 1, 2021 by xi'an

After a rich and local (if freezing) dinner on a rooftop facing the baroque Oaxaca cathedral, and an early invigorating outdoor swim in my case!, the morning session was mostly on mixtures, with Helen Ogden exploring X validation for (estimating the number k of components for) finite mixtures, when using the likelihood as an objective function. I was unclear of the goal however when considering that the data supporting the study was Uniform (0,1), nothing like a mixture of Normal distributions. And about the consistency attached to the objective function. The session ended with Diana Cai presenting a counter-argument in the sense that she proved, along with Trevor Campbell and Tamara Broderick, that the posterior on k diverges to infinity with the number n of observations if a mixture model is misspecified for said data. Which does not come as a major surprise since there is no properly defined value of k when the data is not generated from the adopted mixture. I would love to see an extension to the case when the k component mixture contains a non-parametric component! In-between, Alexander Ly discussed Bayes factors for multiple datasets, with some asymptotics showing consistency for some (improper!) priors if one sample size grows to infinity. With actually attaining the same rate under both hypotheses. Luis Nieto-Barajas presented an approach on uncertainty assessment through KL divergence for random probability measures, which requires a calibration of the KL in this setting, as KL does not enjoy a uniform scale, and a prior on a Pólya tree. And Chris Holmes presented a recent work with Edwin Fong and Steven Walker on a prediction approach to Bayesian inference. Which I had had on my reading list for a while. It is a very original proposal where likelihoods and priors are replaced by the sequence of posterior predictives and only parameters of interest get simulated. The Bayesian flavour of the approach is delicate to assess though, albeit a form of non-parametric Bayesian perspective… (I still need to read the paper carefully.)

In the afternoon session, Judith Rousseau presented her recent foray in cut posteriors for semi-parametric HMMs. With interesting outcomes for efficiently estimating the transition matrix, the component distributions, and the smoothing distribution. I wonder at the connection with safe Bayes in that cut posteriors induce a loss of information. Sinead Williamson spoke on distributed MCMC for BNP. Going back at the “theme of the day”, namely clustering and finding the correct (?) number of clusters. With a collapsed versus uncollapsed division that reminded me of the marginal vs. conditional María Gil-Leyva discussed yesterday. Plus a decomposition of a random measure into a finite mixture and an infinite one that also reminded me of the morning talk of Diana Cai. (And making me wonder at the choice of the number K of terms in the finite part.) Michele Guindani spoke about clustering distributions (with firecrackers as a background!). Using the nDP mixture model, which was show to suffer from degeneracy (as discussed by Frederico Camerlenghi et al. in BA). The subtle difference stands in using the same (common) atoms in all random distributions at the top of the hierarchy, with independent weights. Making the partitions partially exchangeable. The approach relies on Sylvia’s generalised mixtures of finite mixtures. With interesting applications to microbiome and calcium imaging (including a mice brain in action!). And Giovanni Rebaudo presented a generalised notion of clustering aligned on a graph, with some observations located between the nodes corresponding to clusters. Represented as a random measure with common parameters for the clusters and separated parameters outside. Interestingly playing on random partitions, Pólya urns, and species sampling.

parallel MCMC

Posted in Books, Statistics, Travel, University life with tags , , , on September 9, 2020 by xi'an

Yesterday, I remotely took part in the thesis defence of Balazs Nemeth, at Hasselt University, Belgium. As the pandemic conditions were alas still too uncertain to allow for travelling between France and Belgium… The thesis is about parallel strategies for speeding up MCMC, although the title is “Message passing computational methods with pharmacometrics applications”, as the thesis was suppported by Johnson & Johnson. (The defence was in English, as I do not understand a word of Dutch…)  Among the solutions, distributed affine-invariant sampling à la Goodman & Weare, speculative parallelism for SMC, and an automated parallelisation for hierarchical models that is the core input of the thesis. These methods were not associated with designing new MCMC algorithms but rather intended to take advantage of parallelisation for existing MCMC algorithms, which meant issues like asynchronicity or data splitting were not considered therein. I however found the work in the thesis innovative and promising and the PhD candidate was awarded the title by the jury at the end of the defence!

distributed posteriors

Posted in Books, Statistics, Travel, University life with tags , , , , , , , on February 27, 2019 by xi'an

Another presentation by our OxWaSP students introduced me to the notion of distributed posteriors, following a 2018 paper by Botond Szabó and Harry van Zanten. Which corresponds to the construction of posteriors when conducting a divide & conquer strategy. The authors show that an adaptation of the prior to the division of the sample is necessary to recover the (minimax) convergence rate obtained in the non-distributed case. This is somewhat annoying, except that the adaptation amounts to take the original prior to the power 1/m, when m is the number of divisions. They further show that when the regularity (parameter) of the model is unknown, the optimal rate cannot be recovered unless stronger assumptions are made on the non-zero parameters of the model.

“First of all, we show that depending on the communication budget, it might be advantageous to group local machines and let different groups work on different aspects of the high-dimensional object of interest. Secondly, we show that it is possible to have adaptation in communication restricted distributed settings, i.e. to have data-driven tuning that automatically achieves the correct bias-variance trade-off.”

I find the paper of considerable interest for scalable MCMC methods, even though the setting may happen to sound too formal, because the study incorporates parallel computing constraints. (Although I did not investigate the more theoretical aspects of the paper.)

GPU-accelerated Gibbs sampling

Posted in Statistics, Travel, University life with tags , , , , , , on August 18, 2016 by xi'an

Alex Terenin told me during the welcoming reception of MCqMC 2016 that he, along with Shawfeng Dong and David Draper, had arXived a paper on GPU implementation of the Gibbs sampler and thanked me profusely for my accept-reject algorithm of the truncated normal distribution. Algorithm that he reprogrammed in CUDA. The paper is mostly a review on the specifics of GPU programming and of the constraints when compared with CPUs.  The type of models considered therein allows for GPU implementation because of a very large number of latent variables that are independent conditional on the parameter θ. Like, e.g., the horseshoe probit regression model, which is how my sampler enters the picture. Accept-reject algorithms are not ideally suited for GPUs because of the while not_accepted in the code, but I did not get [from our discussion] why it is more efficient to wait for the while loop to exit when compared with running more proposals and subset the accepted ones later. Presumably because this is too costly when ensuring at least one is accepted. The paper also mentions the issue of ensuring random generators remain valid when stretched across many threads, advocating block skips as discussed in an earlier (or even ancient) ‘Og post. In line with earlier comparison tests, the proper GPU implementation of the Gibbs sampler in this setting leads to improvements that are order of magnitude faster. Nonetheless, I wonder at the universality of the comparison in that GPUs lack the programming interface that is now available for CPUs. Some authors, like the current ones, have been putting some effort in constructing random generators in CUDA, but the entry cost for newbies like me still sounds overwhelming.

asynchronous distributed Gibbs sampling

Posted in Books, Statistics, Travel, University life with tags , , , , , , , on October 13, 2015 by xi'an

Alexander Terenin, Dan Simpson, and David Draper just arXived a paper on an alternative brand of Gibbs sampler, which they think can revolutionise the sampler and overcome its well-known bottlenecks. David had sent me the paper in advance and thus I had time to read it in the plane to Calgary. (This is also the very first paper I see acknowledging a pair of trousers..! With no connection whatsoever with bottlenecks!)

“Note that not all updates that are sent will be received by all other workers: because of network traffic congestion and other types of failures, a significant portion of the updates will be lost along the way.”

The approach is inherently parallel in that several “workers” (processors or graphical units) run Gibbs samplers in parallel, using their current knowledge of the system. This means they update a component of the model parameter, based on the information they have last received, and then send back this new value to the system. For physical reasons, there is not instantaneity in this transmission and thus all workers do not condition on the same “current” value, necessarily. Perceiving this algorithm as a “garden of forking paths” where each full conditional uses values picked at random from a collection of subchains (one for each worker), I can see why the algorithm should remain valid.

“Thus, the quality of this [ABC] method rises and falls with the ingenuity of the user in identifying nearly-sufficient statistics.”

It is also clear that this approach allows for any degree of parallelisation. However, it is less clear to me why this should constitute an improvement. With respect to the bottlenecks mentioned at the beginning of the paper, I do not truly see how the large data problem is bypassed. Except in cases where conditionals only depend on small parts of the data. Or why large dimensions can be more easily managed when compared with a plain Gibbs sampler or, better, parallel plain Gibbs samplers that would run on the same number of processors. (I do not think the paper runs the comparison in that manner, using instead a one-processor Gibbs sampler as its benchmark. Or less processors in the third example.) Since the forking paths repeatedly merge back at aperiodic stages, there is no multiplication or clear increase of the exploratory abilities of the sampler. Except for having competing proposed values [or even proposals] selected randomly. So maybe reaching a wee bit farther from time to time.

%d bloggers like this: