**T**oday I [barely made it on a delayed train from Leaminton Spa to Oxford as I] chaired my invited session at SPA 2015 on advanced MCMC methodology. The three speakers, Randal Douc, Mike Pitt and Matti Vihola, all gave talks related to the pseudo-marginal technique. For instance, Randal gave examples of guaranteed variance improvements by adding randomisation steps in the generation of the rv’s behind the unbiased estimation of the likelihood function. Mike Pitt presented the paper I discussed a little while ago about evaluating the computing performances of pseudo-marginal approximations, with a fairly compelling perspective [I may have missed from the paper] on approximating the distribution on the approximation to the log-likelihood as a normal. Which led me to ponder at the ultimate version where the log-likelihood itself would get directly simulated in an MCMC algorithm bypassing the preliminary simulation of the parameters. Sounds a bit too fantasy-like to be of any use… Matti Vihola also presented recent results with Christophe Andrieu on comparing pseudo-marginal approximations, based on convex ordering properties. They included a domination result on ABC-MCM algorithms, as noted in a recent post. Which made me musing about the overall importance of unbiasedness in the global picture, where all we need are converging approximations, *in fine*.

## Archive for unbiased estimation

## SPA 2015 Oxford [my day #2]

Posted in pictures, Statistics, Travel, University life with tags British Rail, Keble College, Leamington Spa, Oxford, particle filters, pseudo-marginal MCMC, SPA 2015, systematic resampling, unbiased estimation, University of Oxford on July 17, 2015 by xi'an## Unbiased Bayes for Big Data: Path of partial posteriors [a reply from the authors]

Posted in Statistics, University life with tags arXiv, bias vs. variance, big data, convergence assessment, de-biasing, Firefly MC, MCMC, Monte Carlo Statistical Methods, telescoping estimator, unbiased estimation on February 27, 2015 by xi'an*[Here is a reply by Heiko Strathmann to my post of yesterday. Along with the slides of a talk in Oxford mentioned in the discussion.]*

Thanks for putting this up, and thanks for the discussion. Christian, as already exchanged via email, here are some answers to the points you make.

First of all, we don’t claim a free lunch — and are honest with the limitations of the method (see negative examples). Rather, we make the point that we *can* achieve computational savings in certain situations — essentially exploiting redundancy (what Michael called “tall” data in his note on subsampling & HMC) leading to fast convergence of posterior statistics.

Dan is of course correct noticing that if the posterior statistic does not converge nicely (i.e. all data counts), then truncation time is “mammoth”. It is also correct that it might be questionable to aim for an unbiased Bayesian method in the presence of such redundancies. However, these are the two extreme perspectives on the topic. The message that we want to get along is that there is a trade-off in between these extremes. In particular the GP examples illustrate this nicely as we are able to reduce MSE in a regime where posterior statistics have *not* yet stabilised, see e.g. figure 6.

“And the following paragraph is further confusing me as it seems to imply that convergence is not that important thanks to the de-biasing equation.”

To clarify, the paragraph refers to the *additional* convergence issues induced by alternative Markov transition kernels of mini-batch-based full posterior sampling methods by Welling, Bardenet, Dougal & co. For example, Firefly MC’s mixing time is increased by a factor of 1/q where q*N is the mini-batch size. Mixing of stochastic gradient Langevin gets worse over time. This is *not* true for our scheme as we can use standard transition kernels. It is still essential for the partial posterior Markov chains to converge (*if* MCMC is used). However, as this is a well studied problem, we omit the topic in our paper and refer to standard tools for diagnosis. All this is independent of the debiasing device.

**About MCMC convergence.**

Yesterday in Oxford, Pierre Jacob pointed out that if MCMC is used for estimating partial posterior statistics, the overall result is *not* unbiased. We had a nice discussion how this bias could be addressed via a two-stage debiasing procedure: debiasing the MC estimates as described in the “Unbiased Monte Carlo” paper by Agapiou et al, and then plugging those into the path estimators — though it is (yet) not so clear how (and whether) this would work in our case.

In the current version of the paper, we do not address the bias present due to MCMC. We have a paragraph on this in section 3.2. Rather, we start from a premise that full posterior MCMC samples are a gold standard. Furthermore, the framework we study is not necessarily linked to MCMC – it could be that the posterior expectation is available in closed form, but simply costly in N. In this case, we can still unbiasedly estimate this posterior expectation – see GP regression.

“The choice of the tail rate is thus quite delicate to validate against the variance constraints (2) and (3).”

It is true that the choice is crucial in order to control the variance. However, provided that partial posterior expectations converge at a rate n^{-β} with n the size of a minibatch, computational complexity can be reduced to N^{1-α} (α<β) without variance exploding. There is a trade-off: the faster the posterior expectations converge, more computation can be saved; β is in general unknown, but can be roughly estimated with the “direct approach” as we describe in appendix.

**About the “direct approach”**

It is true that for certain classes of models and φ functionals, the direct averaging of expectations for increasing data sizes yields good results (see log-normal example), and we state this. However, the GP regression experiments show that the direct averaging gives a larger MSE as with debiasing applied. This is exactly the trade-off mentioned earlier.

I also wonder what people think about the comparison to stochastic variational inference (GP for Big Data), as this hasn’t appeared in discussions yet. It is the comparison to “non-unbiased” schemes that Christian and Dan asked for.

## Unbiased Bayes for Big Data: Path of partial posteriors

Posted in Statistics, University life with tags arXiv, bag of little bootstraps, bias vs. variance, big data, convergence assessment, de-biasing, MCMC, Monte Carlo Statistical Methods, telescoping estimator, unbiased estimation on February 26, 2015 by xi'an*“Data complexity is sub-linear in N, no bias is introduced, variance is finite.”*

**H**eiko Strathman, Dino Sejdinovic and Mark Girolami have arXived a few weeks ago a paper on the use of a telescoping estimator to achieve an unbiased estimator of a Bayes estimator relying on the entire dataset, while using only a small proportion of the dataset. The idea is that a sequence converging—to an unbiased estimator—of estimators φ_{t} can be turned into an unbiased estimator by a stopping rule T:

is indeed unbiased. In a “Big Data” framework, the components φ_{t} are MCMC versions of posterior expectations based on a proportion α_{t} of the data. And the stopping rule cannot exceed α_{t}=1. The authors further propose to replicate this unbiased estimator R times on R parallel processors. They further claim a reduction in the computing cost of

which means that a sub-linear cost can be achieved. However, the gain in computing time means higher variance than for the full MCMC solution:

“It is clear that running an MCMC chain on the full posterior, for any statistic, produces more accurate estimates than the debiasing approach, which by construction has an additional intrinsic source of variance. This means that if it is possible to produce even only a single MCMC sample (…), the resulting posterior expectation can be estimated with less expected error. It is therefore not instructive to compareapproaches in that region. “

I first got a “free lunch” impression when reading the paper, namely it sounded like using a random stopping rule was enough to overcome unbiasedness and large size jams. This is not the message of the paper, but I remain both intrigued by the possibilities the unbiasedness offers *and* bemused by the claims therein, for several reasons: Continue reading

## which parameters are U-estimable?

Posted in Books, Kids, Statistics, University life with tags cross validated, epiphany, Erich Lehmann, George Casella, mathematical statistics, Theory of Point Estimation, U-estimability, unbiased estimation on January 13, 2015 by xi'an**T**oday *(01/06)* was a double epiphany in that I realised that one of my long-time beliefs about unbiased estimators did not hold. Indeed, when checking on Cross Validated, I found this question: For which distributions is there a closed-form unbiased estimator for the standard deviation? And the presentation includes the normal case for which indeed there exists an unbiased estimator of σ, namely

which derives directly from the chi-square distribution of the sum of squares divided by σ². When thinking further about it, if *a posteriori*!, it is now fairly obvious given that σ is a *scale* parameter. Better, any power of σ can be similarly estimated in a unbiased manner, since

And this property extends to all location-scale models.

So how on Earth was I so convinced that there was no unbiased estimator of σ?! I think it stems from reading too quickly a result in, I think, Lehmann and Casella, result due to Peter Bickel and Erich Lehmann that states that, for a convex family of distributions *F*, there exists an unbiased estimator of a functional *q(F)* (for a sample size *n* large enough) if and only if *q(αF+(1-α)G)* is a polynomial in *0≤α≤1*. Because of this, I had this

This leaves open the question as to which transforms of the parameter(s) are unbiasedly estimable (or U-estimable) for a given parametric family, like the normal N(μ,σ²). I checked in Lehmann’s first edition earlier today and could not find an answer, besides the definition of U-estimability. Not only the question is interesting *per se* but the answer could come to correct my long-going impression that unbiasedness is a rare event, i.e., that the collection of transforms of the model parameter that are U-estimable is a very small subset of the whole collection of transforms.

## a neat (theoretical) Monte Carlo result

Posted in Books, Statistics, University life with tags confidence sets, importance sampling, Monte Carlo integration, unbiased estimation, variance reduction on December 19, 2014 by xi'an**M**ark Huber just arXived a short paper where he develops a Monte Carlo approach that bounds the probability of large errors

by computing a lower bound on the sample size r and I wondered at the presence of μ in the bound as it indicates the approach is not translation invariant. One reason is that the standard deviation of the simulated random variables is bounded by cμ. Another reason is that Mark uses as its estimator the median

where the S’s are partial averages of sufficient length and the R’s are independent uniforms over (1-ε,1+ε): using those uniforms may improve the coverage of given intervals but it also means that the absolute scale of the error is multiplied by the scale of S, namely μ. I first thought that some a posteriori recentering could improve the bound but since this does not impact the variance of the simulated random variables, I doubt it is possible.

## Bangalore workshop [ಬೆಂಗಳೂರು ಕಾರ್ಯಾಗಾರ]

Posted in pictures, R, Running, Statistics, Travel, University life, Wines with tags auxiliary variable, Bangalore, demarginalisation, Gibbs sampler, IFCAM, Indian Institute of Science, unbiased estimation on July 31, 2014 by xi'an**S**econd day at the Indo-French Centre for Applied Mathematics and the workshop. Maybe not the most exciting day in terms of talks (as I missed the first two plenary sessions by (a) oversleeping and (b) running across the campus!). However I had a neat talk with another conference participant that led to [what I think are] interesting questions… (And a very good meal in a local restaurant as the guest house had not booked me for dinner!)

**T**o wit: given a target like

the simulation of λ can be demarginalised into the simulation of

where **z** is a latent (and artificial) variable. This means a Gibbs sampler simulating λ given z and z given λ can produce an outcome from the target (*). Interestingly, another completion is to consider that the z_{i}‘s are U(0,y_{i}) and to see the quantity

as an unbiased estimator of the target. What’s quite intriguing is that the quantity remains the same but with different motivations: (a) demarginalisation versus unbiasedness and (b) z_{i} ∼ Exp(λ) versus z_{i} ∼ U(0,y_{i}). The stationary is the same, as shown by the graph below, the core distributions are [formally] the same, … but the reasoning deeply differs.

**O**bviously, since unbiased estimators of the likelihood can be justified by auxiliary variable arguments, this is not in fine a big surprise. Still, I had not thought of the analogy between demarginalisation and unbiased likelihood estimation previously. Continue reading

## trying to speed up Metropolis… and failing!

Posted in R, Statistics, University life with tags big data, delayed acceptance, pseudo-marginal MCMC, unbiased estimation on June 13, 2014 by xi'an**A **while ago (but still after Iceland since I used the *thorn* rune as a math symbol!), I wrote the following post draft as a memo. Now that Marco Banterle, Clara Grazian and myself have completed our delayed acceptance paper, it may be of interest to some readers to see how a first attempt proved fruitless.

**I**n the past days, I tried to speed up my student Clara’s code, reducing the number of prior evaluations in a Metropolis-Hastings algorithm by trying to reject proposals with low likelihoods *before* computing the corresponding prior. (This is one of those rare problems where the prior is the culprit.) My first idea was to start from the case when the new likelihood þ(θ’) was lower than the previous likelihood þ(θ)

þ(θ’) < þ(θ)

as the indicator

ℑ{u≤þ(θ’)/þ(θ)}

is an unbiased estimator of the ratio þ(θ’)/þ(θ) . When u is generated from a uniform U(0,1). (All u’s in this page will be uniform U(0,1), unless otherwise indicated.) Does this mean low values of þ(θ’) can be rejected *prior* to computing π(θ’)? Unfortunately no, since

ℑ{u≤þ(θ’)/þ(θ)} π(θ’)/ π(θ) ∧ 1

is not an unbiased estimator of

{þ(θ’)/þ(θ)} {π(θ’)/ π(θ)} ∧ 1

**T**hen I considered

{þ(θ’)/þ(θ)} {π(θ’)/ π(θ)} ∧ þ(θ)/þ(θ’)}

which can be associated with the unbiased estimator

ℑ{u≤þ(θ’)/þ(θ)} {π(θ’)/ π(θ) ∧ þ(θ)/þ(θ’)}

Once more unfortunately, since

{π(θ’)/ π(θ) ∧ þ(θ)/þ(θ’)}

is not easily bounded, unless π(θ’)/π(θ) itself is bounded (a self-defeating condition!), generating an unbiased estimator of this second term is not obvious.

**A**nother attempt was to consider an exact approximation à la Andrieu and Roberts (2009, Annals of Stat.), using an unbiased estimator of þ(θ’), namely

ℑ{þ(θ)u≤þ(θ’)} þ(θ) ℑ{þ(θ)>þ(θ’)} + þ(θ’) ℑ{þ(θ)<þ(θ’)}

and to plug this unbiased estimator in the Metropolis-Hastings acceptance ratio,

π(θ’) [ℑ{þ(θ)u’≤þ(θ’)}þ(θ)ℑ{þ(θ)>þ(θ’)} + þ(θ’)ℑ{þ(θ)<þ(θ’)}] / π(θ) [ℑ{þ(θ’)u≤þ(θ)}þ(θ’)ℑ{þ(θ’)>þ(θ)} + þ(θ)ℑ{þ(θ)>þ(θ’)}] ∧ 1

which leads to

π(θ’) [ℑ{þ(θ)u’≤þ(θ’)}þ(θ)] / π(θ) þ(θ) ∧ 1 = π(θ’) ℑ{þ(θ)u’≤þ(θ’)} / π(θ) ∧ 1

when þ(θ)>þ(θ’) and to

π(θ’) þ(θ’) / π(θ) ℑ{þ(θ’)u≤þ(θ)}þ(θ’) ∧ 1 = π(θ’) / π(θ) ℑ{þ(θ’)u≤þ(θ)} ∧ 1

otherwise. But this is not correct either. As shown by the above histogram versus target.

#data x~B(N,p) #posterior p|x~Be(x+7.5,N+.5-x) T=10^4 N=100;y=32 a=7.5;b=.5 chk=NULL p=rep(.2,T) for (t in 2:T){ pp=p[t-1]+runif(1,-.1,.1) #random walk p[t]=p[t-1] #by default rejection if ((pp>0)&(pp<1)){ ratio=dbinom(y,N,pp)/dbinom(y,N,p[t-1]) if (ratio<1){ if ((runif(1)<ratio)&(runif(1)< dbeta(pp,a,b)/dbeta(p[t-1],a,b))) p[t]=pp else{ if ((runif(1)*ratio>1)||(runif(1)< dbeta(pp,a,b)/dbeta(p[t-1],a,b))) p[t]=pp} }}