## Unbiased Bayes for Big Data: Path of partial posteriors [a reply from the authors]

Posted in Statistics, University life with tags , , , , , , , , , on February 27, 2015 by xi'an

[Here is a reply by Heiko Strathmann to my post of yesterday. Along with the slides of a talk in Oxford mentioned in the discussion.]

Thanks for putting this up, and thanks for the discussion. Christian, as already exchanged via email, here are some answers to the points you make.

First of all, we don’t claim a free lunch — and are honest with the limitations of the method (see negative examples). Rather, we make the point that we can achieve computational savings in certain situations — essentially exploiting redundancy (what Michael called “tall” data in his note on subsampling & HMC) leading to fast convergence of posterior statistics.

Dan is of course correct noticing that if the posterior statistic does not converge nicely (i.e. all data counts), then truncation time is “mammoth”. It is also correct that it might be questionable to aim for an unbiased Bayesian method in the presence of such redundancies. However, these are the two extreme perspectives on the topic. The message that we want to get along is that there is a trade-off in between these extremes. In particular the GP examples illustrate this nicely as we are able to reduce MSE in a regime where posterior statistics have *not* yet stabilised, see e.g. figure 6.

“And the following paragraph is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation.”

To clarify, the paragraph refers to the additional convergence issues induced by alternative Markov transition kernels of mini-batch-based full posterior sampling methods by Welling, Bardenet, Dougal & co. For example, Firefly MC’s mixing time is increased by a factor of 1/q where q*N is the mini-batch size. Mixing of stochastic gradient Langevin gets worse over time. This is not true for our scheme as we can use standard transition kernels. It is still essential for the partial posterior Markov chains to converge (if MCMC is used). However, as this is a well studied problem, we omit the topic in our paper and refer to standard tools for diagnosis. All this is independent of the debiasing device.

Yesterday in Oxford, Pierre Jacob pointed out that if MCMC is used for estimating partial posterior statistics, the overall result is not unbiased. We had a nice discussion how this bias could be addressed via a two-stage debiasing procedure: debiasing the MC estimates as described in the “Unbiased Monte Carlo” paper by Agapiou et al, and then plugging those into the path estimators — though it is (yet) not so clear how (and whether) this would work in our case.
In the current version of the paper, we do not address the bias present due to MCMC. We have a paragraph on this in section 3.2. Rather, we start from a premise that full posterior MCMC samples are a gold standard. Furthermore, the framework we study is not necessarily linked to MCMC – it could be that the posterior expectation is available in closed form, but simply costly in N. In this case, we can still unbiasedly estimate this posterior expectation – see GP regression.

“The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).”

It is true that the choice is crucial in order to control the variance. However, provided that partial posterior expectations converge at a rate n with n the size of a minibatch, computational complexity can be reduced to N1-α (α<β) without variance exploding. There is a trade-off: the faster the posterior expectations converge, more computation can be saved; β is in general unknown, but can be roughly estimated with the “direct approach” as we describe in appendix.

About the “direct approach”
It is true that for certain classes of models and φ functionals, the direct averaging of expectations for increasing data sizes yields good results (see log-normal example), and we state this. However, the GP regression experiments show that the direct averaging gives a larger MSE as with debiasing applied. This is exactly the trade-off mentioned earlier.

I also wonder what people think about the comparison to stochastic variational inference (GP for Big Data), as this hasn’t appeared in discussions yet. It is the comparison to “non-unbiased” schemes that Christian and Dan asked for.

## Unbiased Bayes for Big Data: Path of partial posteriors

Posted in Statistics, University life with tags , , , , , , , , , on February 26, 2015 by xi'an

“Data complexity is sub-linear in N, no bias is introduced, variance is finite.”

Heiko Strathman, Dino Sejdinovic and Mark Girolami have arXived a few weeks ago a paper on the use of a telescoping estimator to achieve an unbiased estimator of a Bayes estimator relying on the entire dataset, while using only a small proportion of the dataset. The idea is that a sequence  converging—to an unbiased estimator—of estimators φt can be turned into an unbiased estimator by a stopping rule T:

$\sum_{t=1}^T \dfrac{\varphi_t-\varphi_{t-1}}{\mathbb{P}(T\ge t)}$

is indeed unbiased. In a “Big Data” framework, the components φt are MCMC versions of posterior expectations based on a proportion αt of the data. And the stopping rule cannot exceed αt=1. The authors further propose to replicate this unbiased estimator R times on R parallel processors. They further claim a reduction in the computing cost of

$\mathcal{O}(N^{1-\alpha})\qquad\text{if}\qquad\mathbb{P}(T=t)\approx e^{-\alpha t}$

which means that a sub-linear cost can be achieved. However, the gain in computing time means higher variance than for the full MCMC solution:

“It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (…), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compare approaches in that region. “

I first got a “free lunch” impression when reading the paper, namely it sounded like using a random stopping rule was enough to overcome unbiasedness and large size jams. This is not the message of the paper, but I remain both intrigued by the possibilities the unbiasedness offers and bemused by the claims therein, for several reasons: Continue reading

## Bayesian filtering and smoothing [book review]

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , on February 25, 2015 by xi'an

When in Warwick last October, I met Simo Särkkä, who told me he had published an IMS monograph on Bayesian filtering and smoothing the year before. I thought it would be an appropriate book to review for CHANCE and tried to get a copy from Oxford University Press, unsuccessfully. I thus bought my own book that I received two weeks ago and took the opportunity of my Czech vacations to read it… [A warning pre-empting accusations of self-plagiarism: this is a preliminary draft for a review to appear in CHANCE under my true name!]

“From the Bayesian estimation point of view both the states and the static parameters are unknown (random) parameters of the system.” (p.20)

Bayesian filtering and smoothing is an introduction to the topic that essentially starts from ground zero. Chapter 1 motivates the use of filtering and smoothing through examples and highlights the naturally Bayesian approach to the problem(s). Two graphs illustrate the difference between filtering and smoothing by plotting for the same series of observations the successive confidence bands. The performances are obviously poorer with filtering but the fact that those intervals are point-wise rather than joint, i.e., that the graphs do not provide a confidence band. (The exercise section of that chapter is superfluous in that it suggests re-reading Kalman’s original paper and rephrases the Monty Hall paradox in a story unconnected with filtering!) Chapter 2 gives an introduction to Bayesian statistics in general, with a few pages on Bayesian computational methods. A first remark is that the above quote is both correct and mildly confusing in that the parameters can be consistently estimated, while the latent states cannot. A second remark is that justifying the MAP as associated with the 0-1 loss is incorrect in continuous settings.  The third chapter deals with the batch updating of the posterior distribution, i.e., that the posterior at time t is the prior at time t+1. With applications to state-space systems including the Kalman filter. The fourth to sixth chapters concentrate on this Kalman filter and its extension, and I find it somewhat unsatisfactory in that the collection of such filters is overwhelming for a neophyte. And no assessment of the estimation error when the model is misspecified appears at this stage. And, as usual, I find the unscented Kalman filter hard to fathom! The same feeling applies to the smoothing chapters, from Chapter 8 to Chapter 10. Which mimic the earlier ones. Continue reading

## reading classics (The End)

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , on February 24, 2015 by xi'an

Today was the final session of our Reading Classics Seminar for the academic year 2014-2015. I have not reported on this seminar much so far because it has had starting problems, namely hardly any student present on the first classes and therefore several re-starts until we reached a small group of interested students. And this is truly The End for this enjoyable experiment as this is the final year for my TSI Master at Paris-Dauphine, as it will become integrated within the new MASH Master next year.

As a last presentation for the entire series, my student picked John Skilling’s Nested Sampling, not that it was in my list of “classics”, but he had worked on the paper in a summer project and was thus reasonably fluent with the topic. As he did a good enough job (!), here are his slides.

Some of the questions that came to me during the talk were on how to run nested sampling sequentially, both in the data and in the number of simulated points, and on incorporating more deterministic moves in order to remove some of the Monte Carlo variability. I was about to ask about (!) the Hamiltonian version of nested sampling but then he mentioned his last summer internship on this very topic! I also realised during that talk that the formula (for positive random variables)

$\int_0^\infty(1-F(x))\text{d}x = \mathbb{E}_F[X]$

does not require absolute continuity of the distribution F.

## the fundamental incompatibility of HMC and data subsampling

Posted in Books, Statistics, University life with tags , , , , , , on February 23, 2015 by xi'an

Last week, Michael Betancourt, from WarwickarXived a neat wee note on the fundamental difficulties in running HMC on a subsample of the original data. The core message is that using only one fraction of the data to run an HMC with the hope that it will preserve the stationary distribution does not work. The only way to recover from the bias is to use a Metropolis-Hastings step using the whole data, a step that both kills most of the computing gain and has very low acceptance probabilities. Even the strategy that subsamples for each step in a single trajectory fails: there cannot be a significant gain in time without a significant bias in the outcome. Too bad..! Now, there are ways of accelerating HMC, for instance by parallelising the computation of gradients but, just as in any other approach (?), the information provided by the whole data is only available when looking at the whole data.

## another Sally Clark?

Posted in Statistics, University life with tags , , , , , , on February 22, 2015 by xi'an

“I don’t trust my own intuition when an apparent coincidence occurs; I have to sit down and do the calculations to check whether it’s the kind of thing I might expect to occur at some time and place.” D. Spiegelhalter

I just read in The Guardian an article on the case of the nurse Benjamin Geen, whose conviction to 30 years in jail in 2006 for the murder of two elderly patients rely on inappropriate statistical expertise. As for Sally Clark, the evidence was built around “unusual patterns” of deaths associated with a particular nurse, without taking into account the possible biases in building such patterns. The case against the 2006 expertise is based on reports by David Spiegelhalter, Norman Fenton, Stephen Senn and Sheila Bird, who constitute enough of a dream team towards reconsidering a revision of the conviction. As put forward by Prof Fenton, “at least one hospital in the country would be expected to see this many events over a four-year period, purely by chance.”

## a Nice talk

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , on February 20, 2015 by xi'an

Today, I give a talk on our testing paper in Nice, in a workshop run in connection with our Calibration ANR grant:

The slides are directly extracted from the paper but it still took me quite a while to translate the paper into those, during the early hours of our Czech break this week.

One added perk of travelling to Nice is the flight there, as it parallels the entire French Alps, a terrific view in nice weather!