## Advances in scalable Bayesian computation [day #4]

Posted in Books, Mountains, pictures, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , on March 7, 2014 by xi'an

Final day of our workshop Advances in Scalable Bayesian Computation already, since tomorrow morning is an open research time ½ day! Another “perfect day in paradise”, with the Banff Centre campus covered by a fine snow blanket, still falling…, and making work in an office of BIRS a dream-like moment.

Still looking for a daily theme, parallelisation could be the right candidate, even though other talks this week went into parallelisation issues, incl. Steve’s talk yesterday. Indeed, Anthony Lee gave a talk this morning on interactive sequential Monte Carlo, where he motivated the setting by a formal parallel structure. Then, Darren Wilkinson surveyed the parallelisation issues in Monte Carlo, MCMC, SMC and ABC settings, before arguing in favour of a functional language called Scala. (Neat entries to those topics can be found on Darren’s blog.) And in the afternoon session, Sylvia Frühwirth-Schnatter exposed her approach to the (embarrassingly) parallel problem, in the spirit of Steve’s , David Dunson’s and Scott’s (a paper posted on the day I arrived in Chamonix and hence I missed!). There was plenty to learn from that talk (do not miss the Yin-Yang moment at 25 mn!), but it also helped me to break a difficulty I had with the consensus Bayes representation for two weeks (more on that later!). And, while Marc Suchard mostly talked about flu and trees in a very pleasant and broad talk, he also had a slide on parallelisation to fit the theme! While unrelated with parallelism,  Nicolas Chopin’s talk was on sequential quasi-Monte Carlo algorithms: while I had heard previous versions of this talk in Chamonix and BigMC, I found it full of exciting stuff. And it clearly got the room truly puzzled by this possibility, in a positive way! Similarly, Alex Lenkoski spoke about extreme rain events in Norway with no trace of parallelism, but the general idea behind the examples was to question the notion of the calibrated Bayesian (with possible connections with the cut models).

This has been a wonderful week and I am sure the participants got as much as I did from the talks and the informal exchanges. Thanks to BIRS for the sponsorship and the superb organisation of the week (and to the Banff Centre for providing such a paradisical environment). I feel very privileged to have benefited from this support, even though I deadly hope to be back in Banff within a few years.

## parallel MCMC via Weirstrass sampler (a reply by Xiangyu Wang)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on January 3, 2014 by xi'an

Almost immediately after I published my comments on his paper with David Dunson, Xiangyu Wang sent a long comment that I think worth a post on its own (especially, given that I am now busy skiing and enjoying Chamonix!). So here it is:

Thanks for the thoughtful comments. I did not realize that Neiswanger et al. also proposed the similar trick to avoid combinatoric problem as we did for the rejection sampler. Thank you for pointing that out.

For the criticism 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close.)

Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3.)

Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to θ and then do one step Gibbs update to obtain a new θ, and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.

The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.

I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is small enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?

Actually I’m adding a new algorithm into the Weierstrass rejection sampling, which will render it thoroughly free from the dimensionality curse of p. The new scheme is applicable to the nonparametric method in Neiswanger et al. (2013) as well. It should appear soon in the second version of the draft.

## parallel MCMC via Weirstrass sampler

Posted in Books, Statistics, University life with tags , , , , , , , , on January 2, 2014 by xi'an

During O’Bayes 2013, Xiangyu Wang and David Dunson arXived a paper (with the above title) that David then presented on the 19th.  The setting is quite similar to the recently discussed embarrassingly parallel paper of Neiswanger et al., in that Xiangyu and David start from the same product representation of the target (posterior). Namely,

$p(\theta|x) = \prod_{i=1}^m p_i(\theta|x).$

However, they criticise the choice made by Neiswanger et al to use MCMC approximations to each component of the product for the following reasons:

1. Curse of dimensionality in the number of parameters p
2. Curse of dimensionality in the number of subsets m
3. Tail degeneration
4. Support inconsistency and mode misspecification Continue reading

## Asymptotically Exact, Embarrassingly Parallel MCMC

Posted in Books, Statistics, University life with tags , , , , , , on November 26, 2013 by xi'an

Willie Neiswanger, Chong Wang, and Eric Xing (from CMU) recently arXived a paper entitled as above. The “embarrassing” in the title refers to the “embarrassingly simple” solution proposed therein, namely to solve the difficulty in handling very large datasets by running completely independent parallel MCMC samplers on parallel threads or computers and using the outcomes of those samplers as density estimates, pulled together as a product towards an approximation of the true posterior density. In other words, the idea is to break the posterior as

$p(\theta|x) = \prod_{i=1}^m p_i(\theta|x)$

and to use the estimate

$\hat p(\theta|x) = \prod_{i=1}^m \hat p_i(\theta|x)$

where the individual estimates are obtained by, say, non-parametric estimates. The method is then “asymptotically exact” in the weak (and unsurprising) sense of being converging in the number of MCMC iterations. Still, there is a theoretical justification that is not found in previous parallel methods that mixed all resulting samples without accounting for the subsampling. And I also appreciate the point that, in many cases, running MCMC samplers with subsamples produces faster convergence.

In the paper, the division of p into its components is done by partitioning the iid data into m subsets. And taking a power 1/m of the prior in each case. (Which may induce improperness issues.) However, the subdivision is arbitrary and can thus be implemented in other cases than the fairly restrictive iid setting. Because each (subsample)  non-parametric estimate involves T terms, the resulting overall estimate contains Tm terms and the authors suggest using an independent Metropolis-within-Gibbs sampler to handle this complexity. Which is necessary [took me a while to realise this!] for producing a final sample from the (approximate) true posterior distribution. As an aside, I wonder why the bandwidths are all equal across all subsamples, as they should depend on those. And as it would not make much of a difference. It would also be interesting to build a typology of cases where subsampling leads to subposteriors that are close to orthogonal, preventing the implementation of the method.

As it happened, I read this paper on the very day Nial Friel (University College Dublin) gave a seminar at the Big’MC seminar on the convergence of approximations to ergodic transition kernels, based on the recent results of Mitrophanov on the stability of Markov chains, where he introduced the topic by the case of datasets large enough to prevent the computation of the likelihood function.

## another vanilla Rao-Blackwellisation

Posted in Statistics, University life with tags , , , , , on September 16, 2013 by xi'an

In the latest issue of Statistics and Computing (2013, Issue 23, pages 577-587), Iliopoulos and Malefaki published a paper that relates to our vanilla Rao-Blackwellisation paper with Randal Douc. The idea is to derive another approximation to the ideal importance sampling weight using the “accepted” Markov chain. (With Randal, we had a Bernoulli factory representation.) The density g(x) of the accepted chain being unknown; it is represented as the expectation under π of the function

$\min\left\{q(z|x)/\pi(z),q(x|z)/\pi(x)\right\}$

and hence estimated by a self-normalised average based on the whole Markov chain. This means the resulting importance estimate uses twice the output of the algorithm and that it is biased and of order O(n²), thus the same order as our original Rao-Blackwellised estimator (Robert & Casella, 1996)… This also means convergence and CLT are very hard to establish: the main convergence theorem of the paper holds only for finite state spaces, with a surprising smaller asymptotic variance for this self-normalised average than for the ideal importance sampling estimator in the independent Metropolis-Hastings case. (Both are biased by being self-normalised and the paper does not consider the magnitude of those biases.)

Interestingly, the authors also ran a comparison with our parallelised Rao-Blackwellised version (with Pierre Jacob and Murray Smith), but conclude (P.58) at a larger CPU (should be GPU!!) required by the parallelisation, which does not really make sense: when compared with the plain Metropolis-Hastings implementation, run on a single processor, the parallel version only requires an extra random permutation per thread or per processor. I thus suspect a faulty implementation that induces this CPU being linear in the size of the blocks, like maybe only saving one output per block… Also interestingly, the paper re-analyses the Pima Indian probit model Jean-Michel Marin and I (and many others) used as benchmark in several of our papers. As in the most standard examples, the outcome shows a mild reduction in variance when using this estimated importance sampling version. Maybe a comparison with the ideal importance sampler (i.e. the one that does not divide by the sum of the weights since using normalised versions of the target and importance densities) would have helped in the comparison.