I am curious about the “post-MCMC era” statement. Do you mean that, in your opinion, advances in e.g. sequential Monte Carlo, Hamiltonian Monte Carlo and ABC are *already* fully replacing classic MCMC approaches (including adaptive MCMC)? I am not sure we are there (yet). For example the elegant particle MCMC methodology by Andrieu-Doucet-Holenstein which can be used as a “gold standard” (when computationally feasible) to deal with fairly complex dynamical models, is a SMC step embedded within a MCMC step; so we are not really completely avoiding the MCMC toolbox. I do agree that in some cases we can perform inference using other (non-MCMC) methodologies, but are we already in the post-MCMC era? ]]>

Great, thanks! I am busy skiing now but will turn it into a post asap!!!

]]>For the criticize 3 on the tail degeneration, we did not mean to fire on the non-parametric estimation issues, but rather the problem caused by using the product equation. When two densities are multiplied together, the accuracy of the product mainly depends on the tail of the two densities (the overlapping area), if there are more than two densities, the impact will be more significant. As a result, it may be unwise to directly use the product equation, as the most distant sub-posteriors could be potentially very far away from each other, and most of the sub posterior draws are outside the overlapping area. (The full Gibbs sampler formulated in our paper does not have this issue, as shown in equation 5, there is a common part multiplied on each sub-posterior, which brought them close)

Point 4 stated the problem caused by averaging. The approximated density follows Neiswanger et al. (2013) will be a mixture of Gaussian, whose component means are the average of the sub-posterior draws. Therefore, if sub-posteriors stick to different modes (assuming the true posterior is multi-modal), then the approximated density is likely to mess up the modes, and produce some faked modes (eg. average of the modes. We provide an example in the simulation 3)

Sorry for the vague description of the refining method (4.2). The idea is kinda dull. We start from an initial approximation to and then do one step Gibbs update to obtain a new , and we call this procedure ‘refining’, as we believe such process would bring the original approximation closer to the true posterior distribution.

The first (4.1) and the second (4.2) algorithms do seem weird to be called as ‘parallel’, since they are both modified from the Gibbs sampler described in (4) and (5). The reason we want to propose these two algorithms is to overcome two problems. The first is the dimensionality curse, and the second is the issue when the subset inferences are not extremely accurate (subset effective sample size small) which might be a common scenario for logistic regression (with large parameters) even with huge data set. First, algorithm (4.1) and (4.2) both start from some initial approximations, and attempt to improve to obtain a better approximation, thus avoid the dimensional issue. Second, in our simulation 1, we attempt to pull down the performance of the simple averaging by worsening the sub-posterior performance (we allocate smaller amount of data to each subset), and the non-parametric method fails to approximate the combined density as well. However, the algorithm 4.1 and 4.2 still work in this case.

PS: I have some problem with the logistic regression example provided in Neiswanger et al. (2013). As shown in the paper, under the authors’ setting (not fully specified in the paper), though the non-parametric method is better than simple averaging, the approximation error of simple averaging is smaller enough for practical use (I also have some problem with their error evaluation method), then why should we still bother to use a much more complicated method?

]]>