## turn-key and scalable synchronous distributed MCMC algorithm

Posted in Statistics, University life with tags , , , , , on April 29, 2022 by xi'an

Last week, I attended a Lagrange seminar where Vincent Plassier presented a ICML²¹ paper he had co-authored with Maxime Vono, Alain Durmus, and Eric Moulines. Aiming at distributed MCMC algorithms that operate on several machines, with a target distribution that breaks as a target

$\int\prod_{i=1}^b \pi_i(\theta,z_i)\,\text d\mathbf{z}=\prod_{i=1}^b e^{U_i(A_i\theta)}$

where θ is common to all terms. And each term in the product can (only) be computed locally. This setup is obviously the same as for the embarrassingly parallel approaches of Neiswanger et al. (2014) and Scott et al. (2016). And it follows an earlier proposal of Vono et al. (2020), which appears as a full Gibbs algorithm on the augmented parameters (θ,z), assuming each term is a conditional density in the latent z’s. Which requires constant communications between the b workers and the central “master” node when θ is concerned. The ICML²¹ paper overcomes this difficulty by defining an approximate target with a Normal component in z. Meaning that the (approximate) conditional distribution of θ given the latent z is Normal, i.e. considering the augmented joint

$\prod_{i=1}^b\exp\left\{u_i(z_i)-\rho_i||z_i-A_i\theta||^2\right\}$

but despite the Gaussian aspect, this is not always practical:

“When d [is large], this Gibbs sampling scheme unfortunately leads to prohibitive computational costs and hence prevents its practical use for general Bayesian inference problems.”

The authors then move to simulating from several Langevin step, more specifically running one move of the Euler-Maruyama discretisation scheme of the overdamped Langevin stochastic differential equation. Communication with the central node is then reduced. The paper proposes a proof of convergence in this unusual (since overdamped) setup. As well as bounds on the bias due to the inclusion of the latent variables. They also manage to find the required scaling of the various parameters involved (Normal variance, discretisation scale, Langevin runs) to achieve convergence, which I find rather remarkable. The table at the top illustrates the comparison with earlier methods, whenever available.

## Bayesian synthetic likelihood

Posted in Statistics with tags , , , , , , , on December 13, 2017 by xi'an

Leah Price, Chris Drovandi, Anthony Lee and David Nott published earlier this year a paper in JCGS on Bayesian synthetic likelihood, using Simon Wood’s synthetic likelihood as a substitute to the exact likelihood within a Bayesian approach. While not investigating the theoretical properties of this approximate approach, the paper compares it with ABC on some examples. In particular with respect to the number n of Monte Carlo replications used to approximate the mean and variance of the Gaussian synthetic likelihood.

Since this approach is most naturally associated with an MCMC implementation, it requires new simulations of the summary statistics at each iteration, without a clear possibility to involve parallel runs, in contrast to ABC. However in the final example of the paper, the authors reach values of n of several thousands, making use of multiple cores relevant, if requiring synchronicity and checks at every MCMC iteration.

The authors mention that “ABC can be viewed as a pseudo-marginal method”, but this has a limited appeal since the pseudo-marginal is a Monte Carlo substitute for the ABC target, not the original target. Similarly, there exists an unbiased estimator of the Gaussian density due to Ghurye and Olkin (1969) that allows to perceive the estimated synthetic likelihood version as a pseudo-marginal, once again wrt a target that differs from the original one. And the bias reappears under mis-specification, that is when the summary statistics are not normally distributed. It seems difficult to assess this normality or absence thereof in realistic situations.

“However, when the distribution of the summary statistic is highly irregular, the output of BSL cannot be trusted, while ABC represents a robust alternative in such cases.”

To make synthetic likelihood and ABC algorithms compatible, the authors chose a Normal kernel for ABC. Still, the equivalence is imperfect in that the covariance matrix need be chosen in the ABC case and is estimated in the synthetic one. I am also lost to the argument that the synthetic version is more efficient than ABC, in general (page 8). As for the examples, the first one uses a toy Poisson posterior with a single sufficient summary statistic, which is not very representative of complex situations where summary statistics are extremes or discrete. As acknowledged by the authors this is a case when the Normality assumption applies. For an integer support hidden process like the Ricker model, normality vanishes and the outcomes of ABC and synthetic likelihood differ, which makes it difficult to compare the inferential properties of both versions (rather than the acceptance rates), while using a 13-dimension statistic for estimating a 3-dimension parameter is not recommended for ABC, as discussed by Li and Fearnhead (2017). The same issue appears in the realistic cell motility example, with 145 summaries versus two parameters. (In the philogenies studied by DIYABC, the number of summary statistics is about the same but we now advocate a projection to the parameter dimension by the medium of random forests.)

Given the similarity between both approaches, I wonder at a confluence between them, where synthetic likelihood could maybe be used to devise PCA on the summary statistics and facilitate their projection on a space with much smaller dimensions. Or estimating the mean and variance functions in the synthetic likelihood towards producing directly simulations of the summary statistics.

## simple, scalable and accurate posterior interval estimation

Posted in Statistics with tags , , , , , , , on July 6, 2016 by xi'an

“There is a lack of simple and scalable algorithms for uncertainty quantification.”

A paper by Cheng Li , Sanvesh Srivastava, and David Dunson that I had missed and which was pointed out on Andrew’s blog two days ago. As recalled in the very first sentence of the paper, above, the existing scalable MCMC algorithms somewhat fail to account for confidence (credible) intervals. In the sense that handling parallel samples does not naturally produce credible intervals.Since the approach is limited to one-dimensional quantity of interest, ζ=h(θ), the authors of the paper consider the MCMC approximations of the cdf of the said quantity ζ based on the manageable subsets like as many different approximations of the same genuine posterior distribution of that quantity ζ. (Corrected by a power of the likelihood but dependent on the particular subset used for the estimation.) The estimate proposed in the paper is a Wasserstein barycentre of the available estimations, barycentre that is defined as minimising the sum of the Wasserstein distances to all estimates. (Why should this measure be relevant: the different estimates may be of different quality). Interestingly (at least at a computational level), the solution is such that the quantile function of the Wasserstein barycentre is the average of the estimated quantiles functions. (And is there an alternative loss returning the median cdf?) A confidence interval based on the quantile function can then be directly derived. The paper shows that this Wasserstein barycentre converges to the true (marginal) posterior as the sample size m of each sample grows to infinity (and faster than 1/√m), with the strange side-result that the convergence is in 1/√n when the MLE of the global parameter θ is unbiased. Strange to me because unbiasedness is highly dependent on parametrisation while the performances of this estimator should not be, i.e., should be invariant under reparameterisation. Maybe this is due to ζ being a linear transform of θ in the convergence theorem… In any case, I find this question of merging cdf’s from poorly defined approximations to an unknown cdf of the highest interest and look forward any further proposal to this effect!

## patterns of scalable Bayesian inference

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on February 24, 2016 by xi'an

Elaine Angelino, Matthew Johnson and Ryan Adams just arXived a massive survey of 118 pages on scalable Bayesian inference, which could have been entitled Bayes for Big Data, as this monograph covers state-of-the-art computational approaches to large and complex data structures. I did not read each and every line of it, but I have already recommended it to my PhD students. Some of its material unsurprisingly draws from the recent survey by Rémi Bardenet et al. (2015) I discussed a while ago. It also relates rather frequently to the somewhat parallel ICML paper of Korattikara et al. (2014). And to the firefly Monte Carlo procedure also discussed previously here.

Chapter 2 provides some standard background on computational techniques, Chapter 3 covers MCMC with data subsets, Chapter 4 gives some entries on MCMC with parallel and distributed architectures, Chapter 5 focus on variational solutions, and Chapter 6 is about open questions and challenges.

“Insisting on zero asymptotic bias from Monte Carlo estimates of expectations may leave us swamped in errors from high variance or transient bias.”

One central theme of the paper is the need for approximate solutions, MCMC being perceived as the exact solution. (Somewhat wrongly in the sense that the product of an MCMC is at best an empirical version of the true posterior, hence endowed with a residual and incompressible variation for a given computing budget.) While Chapter 3 stresses the issue of assessing the distance to the true posterior, it does not dwell at all on computing times and budget, which is arguably a much harder problem. Chapter 4 seems to be more aware of this issue since arguing that “a way to use parallel computing resources is to run multiple sequential MCMC algorithms at once [but that this] does not reduce the transient bias in MCMC estimates of posterior expectations” (p.54). The alternatives are to use either prefetching (which was the central theme of Elaine Angelino’s thesis), asynchronous Gibbs with the new to me (?) Hogwild Gibbs algorithms (connected in Terenin et al.’s recent paper, not quoted in the paper), some versions of consensus Monte Carlo covered in earlier posts, the missing links being in my humble opinion an assessment of the worth of those solutions (in the spirit of “here’s the solution, what was the problem again?”) and once again the computing time issue. Chapter 5 briefly discusses some recent developments in variational mean field approximations, which is farther from my interests and (limited) competence, but which appears as a particular class of approximate models and thus could (and should?) relate to likelihood-free methods. Chapter 6 about the current challenges of the field is presumably the most interesting in this monograph in that it produces open questions and suggests directions for future research. For instance, opposing the long term MCMC error with the short term transient part. Or the issue of comparing different implementations in a practical and timely perspective.

## Optimization Monte Carlo: Efficient and embarrassingly parallel likelihood-free inference

Posted in Books, Statistics, Travel with tags , , , , , , , , on December 16, 2015 by xi'an

Ted Meeds and Max Welling have not so recently written about an embarrassingly parallel approach to ABC that they call optimisation Monte Carlo. [Danke Ingmar for pointing out the reference to me.] They start from a rather innocuous rephrasing of the ABC posterior, writing the pseudo-observations as deterministic transforms of the parameter and of a vector of uniforms. Innocuous provided this does not involve an infinite number of uniforms, obviously. Then they suddenly switch to the perspective that, for a given uniform vector u, one should seek the parameter value θ that agrees with the observation y. A sort of Monte Carlo inverse regression: if

y=f(θ,u),

then invert this equation in θ. This is quite clever! Maybe closer to fiducial than true Bayesian statistics, since the prior does not occur directly [only as a weight p(θ)], but if this is manageable [and it all depends on the way f(θ,u) is constructed], this should perform better than ABC! After thinking about it a wee bit more in London, though, I realised this was close to impossible in the realistic examples I could think of. But I still like the idea and want to see if anything at all can be made of this…

“However, it is hard to detect if our optimization succeeded and we may therefore sometimes reject samples that should not have been rejected. Thus, one should be careful not to create a bias against samples u for which the optimization is difficult. This situation is similar to a sampler that will not mix to remote local optima in the posterior distribution.”

Now, the paper does not go that way but keeps the ε-ball approach as in regular ABC, to derive an approximation of the posterior density. For a while I was missing the difference between the centre of the ball and the inverse of the above equation, bottom of page 3. But then I realised the former was an approximation to the latter. When the authors discuss their approximation in terms of the error ε, I remain unconvinced by the transfer of the tolerance to the optimisation error, as those are completely different notions. This also applies to the use of a Jacobian in the weight, which seems out of place since this Jacobian appears in a term associated with (or replacing) the likelihood, f(θ,u), which is then multiplied by the prior p(θ). (Assuming a Jacobian exists, which is unclear when considering most simulation patterns use hard bounds and indicators.) When looking at the toy examples, it however makes sense to have a Jacobian since the selected θ’s are transforms of the u’s. And the p(θ)’s are simply importance weights correcting for the wrong target. Overall, the appeal of the method proposed in the paper remains unclear to me. Most likely because I did not spend enough time over it.