parallel MCMC via Weirstrass sampler

IMG_2230During O’Bayes 2013, Xiangyu Wang and David Dunson arXived a paper (with the above title) that David then presented on the 19th.  The setting is quite similar to the recently discussed embarrassingly parallel paper of Neiswanger et al., in that Xiangyu and David start from the same product representation of the target (posterior). Namely,

p(\theta|x) = \prod_{i=1}^m p_i(\theta|x).

However, they criticise the choice made by Neiswanger et al to use MCMC approximations to each component of the product for the following reasons:

  1. Curse of dimensionality in the number of parameters p
  2. Curse of dimensionality in the number of subsets m
  3. Tail degeneration
  4. Support inconsistency and mode misspecification

While I agree on point 1 (although there may be other ways than kernel estimation to mix samples from the terms in the product, terms Neiswanger et al. called the subposteriors), which is also a drawback with the current method, point 2 is not such a clearcut drawback. I first had the same reaction about the Tm explosion in the number of terms in a product of m sums of T terms, but Neiswanger et al. use a clever trick to avoid the combinatoric explosion, namely to operate one mixture at a time. Having non-manageable targets is not such an issue in the post-MCMC era, isn’t it?! Point 3 is formally correct, in that the kernel tail behaviour induces the kernel estimate tail behaviour, most likely disconnected from the true target tail behaviour, but this feature is true for any non-parametric estimate, I believe, even for the Weierstraß transform, and hence maybe not so relevant in practice. In fact, my operational intuition is that by lifting the tails up, the simulation from the subposteriors should help in visiting the tails of the true target. (Proofs of convergence and bound rely on L1 norms, not particularly interested in tails.) At last, point 4 does not seem to life-threatening: assuming the true target can be computed up to a normalising constant, the value of the target for every simulated parameter could be computed, eliminating those outside the support of the product and highlighting modal regions. (The paper does mention the improvement brought by computing those target values.)

The Wierstraß transform of a density f is a convolution of f and of an arbitrary kernel K. The authors propose to simulate from the product of the Wierstraß transform, using a multi-tiered Gibbs sampler. This way, the parameter is only simulated once and from a controlled kernel, while the random effects from the convolution are related with each subposterior. While the method requires coordination between the parallel threads, the components of the target are separately computed on a single thread. (I do not get the refinement in Section 4.2, about how “we can regard [θ] as a hyper-parameter, which allows to draw samples of [random effects] hierarchically”. Once a cycle is completed, all threads must wait for the common θ to be updated.) Maybe the clearest perspective on the Wierstraß transform is the rejection sampling version (Section 4.3) where simulations from the subpriors are merged together into a normal proposal on θ to be accepted with a probability depending on the subprior simulations.

There is something puzzling and vaguely familiar about the Wierstraß transform, something I cannot really put the finger on. It reminds me of the (great) convolution trick Jacques Neveu used to prove the Central Limit Theorem in his École Polytechnique notes. But it also reminds me of West and Harrison’s notion of time-evolving parameter (notion that took [me] a while to seep in…). Definitely interesting!

4 Responses to “parallel MCMC via Weirstrass sampler”

  1. Hi,
    I am curious about the “post-MCMC era” statement. Do you mean that, in your opinion, advances in e.g. sequential Monte Carlo, Hamiltonian Monte Carlo and ABC are *already* fully replacing classic MCMC approaches (including adaptive MCMC)? I am not sure we are there (yet). For example the elegant particle MCMC methodology by Andrieu-Doucet-Holenstein which can be used as a “gold standard” (when computationally feasible) to deal with fairly complex dynamical models, is a SMC step embedded within a MCMC step; so we are not really completely avoiding the MCMC toolbox. I do agree that in some cases we can perform inference using other (non-MCMC) methodologies, but are we already in the post-MCMC era?

  2. Xiangyu Wang Says:

    er… sorry for typing so many words… Maybe I was just too excited as you made so many insightful comments. Actually I’m adding a new algorithm into the Weierstrass rejection sampling, which will render it thoroughly free from the dimensionality curse of p. The new scheme is applicable to the nonparametric method in Neiswanger et al. (2013) as well. It should appear soon in the second version of the draft. Hopefully you will be interested.

  3. Xiangyu Wang Says:

    Hi, Xi’an, thanks for the thoughtful comments. I did not realize that Neiswanger et al. also proposed the similar trick to avoid combinatoric problem as we did for the rejection sampler. Thank you for pointing that out.

    For the criticize 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close)

    Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3)

    Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to \theta and then do one step Gibbs update to obtain a new \theta, and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.

    The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.

    PS: I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is smaller enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.