Unbiased Bayes for Big Data: Path of partial posteriors
“Data complexity is sub-linear in N, no bias is introduced, variance is finite.”
Heiko Strathman, Dino Sejdinovic and Mark Girolami have arXived a few weeks ago a paper on the use of a telescoping estimator to achieve an unbiased estimator of a Bayes estimator relying on the entire dataset, while using only a small proportion of the dataset. The idea is that a sequence converging—to an unbiased estimator—of estimators φt can be turned into an unbiased estimator by a stopping rule T:
is indeed unbiased. In a “Big Data” framework, the components φt are MCMC versions of posterior expectations based on a proportion αt of the data. And the stopping rule cannot exceed αt=1. The authors further propose to replicate this unbiased estimator R times on R parallel processors. They further claim a reduction in the computing cost of
which means that a sub-linear cost can be achieved. However, the gain in computing time means higher variance than for the full MCMC solution:
“It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (…), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compare approaches in that region. “
I first got a “free lunch” impression when reading the paper, namely it sounded like using a random stopping rule was enough to overcome unbiasedness and large size jams. This is not the message of the paper, but I remain both intrigued by the possibilities the unbiasedness offers and bemused by the claims therein, for several reasons:
- the above estimator requires computing T MCMC (partial) estimators φt in parallel. All of those estimators have to be associated with Markov chains in a stationary regime and they all are associated with independent chains. While addressing the convergence of a single chain, the paper does not truly cover the simultaneous convergence assessment on a group of T parallel MCMC sequences. And the paragraph below is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation. In fact, further discussion with the authors (!) led me to understand this relates to the existing alternatives for handling large data, like firefly Monte Carlo: Convergence to the stationary remains essential (and somewhat problematic) for all the partial estimators.
“If a Markov chain is, in line with above considerations, used for computing partial posterior expectations 𝔼πt[ϕ(θ)], it need not be induced by any form of approximation, noise injection, or state-space augmentation of the transition kernel. As a result, the notorious difficulties of ensuring acceptable mixing and problems of stickiness are conveniently side-stepped –which is in sharp contrast to all existing approaches.”
- the impact of the distribution of the stopping time T over the performances of the estimator is uncanny! Its tail should simply decreases more slowly than the square difference between the partial estimators. This requirement is both hard to achieve [given that the variances of the—partial—MCMC estimators are hard to assess] and with negative consequences on the overall computing time. The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).
- the stopping time T must have a positive probability to take the largest possible value, corresponding to using the whole sample, in which case the approach gets worse than using directly the original MCMC algorithm, as noted on the first quote above. This shows (as stated in the first quote above) the approach cannot uniformly improve upon the standard MCMC.
- the comparison in the (log-)normal toy example is difficult to calibrate. (And why differentiating a log normal from a normal sample? and why are the tails extremely wide for 2²⁶ datapoints?) The number of likelihood evaluations is 5e-4 times smaller for the de-biased version, hence means a humongous gain in computing time, but how does this partial exploration of the information contained in the data impact the final variability of the estimate? If I judge from Figure 4 (top) after 300 replications, one still observes a range of 0.1 at the end of the 300 iterations. If I could produce the Bayes estimate for the whole data, the variability would be of order 2e-4… If the goal is to find an estimate of the parameter of interest, with a predetermined error, this is fine. But then the comparison with the genuine Bayesian answer is not very meaningful.
- when averaging the unbiased estimators R times, it is unclear whether or not the same subset of nt datapoints is used to compute the partial estimator φt. On the one hand, using different subsets should improve the connection with the genuine Bayes estimator. On the other hand, this may induce higher storage and computing costs.
- cases when the likelihood does not factorise are not always favourable to the de-biasing approach (Section 5.1) in that the availability of the joint density of a subset of the whole data may prove an issue. Take for instance a time series or a graphical network. Computing the joint density of a subset requires stringent conditions on the way the subset is selected.
Overall, I think I understand the purpose of the paper better now I have read it a few times. The comparison is only relevant against other limited information solutions. Not against the full
Monty MCMC. However, in the formal examples processed in the paper, a more direct approach would be to compute (in parallel) MCMC estimates for increasing portions of the data, add a dose of bootstrap to reduce bias, and check for stabilisation of the averages.