## borderline infinite variance in importance sampling

**A**s I was still musing about the posts of last week around infinite variance importance sampling and its potential corrections, I wondered at whether or not there was a fundamental difference between “just” having a [finite] variance and “just” having none. In conjunction with Aki’s post. To get a better feeling, I ran a quick experiment with Exp(1) as the target and Exp(a) as the importance distribution. When estimating **E[X]**=1, the above graph opposes a=1.95 to a=2.05 (variance versus no variance, bright yellow versus wheat), a=2.95 to a=3.05 (third moment versus none, bright yellow versus wheat), and a=3.95 to a=4.05 (fourth moment versus none, bright yellow versus wheat). The graph below is the same for the estimation of **E[**exp(**X**/2)**]**=2, which has an integrand that is not square integrable under the target. Hence seems to require higher moments for the importance weight. Hard to derive universal theories from those two graphs, however they show that protection against sudden drifts in the estimation sequence. As an aside [not really!], apart from our rather confidential Confidence bands for Brownian motion and applications to Monte Carlo simulation with Wilfrid Kendall and Jean-Michel Marin, I do not know of many studies that consider the sequence of averages time-wise rather than across realisations at a given time and still think this is a more relevant perspective for simulation purposes.

November 23, 2015 at 5:37 am

One thing that might be interesting to try is to sample from a uniform distribution and use inverse transform sampling to get samples from the importance distribution. You could create plots like those in the post that would continuously transform from one to the other as you smoothly vary the parameter of the importance distribution. You could take the parameter smoothly across the boundary of infinite variance and see exactly what happens to those large jumps.

November 23, 2015 at 12:58 pm

Thanks, Corey: I actually used the same Exp(1) sample for all graphs, rescaling by the proper factor each time! Hence the pictures implement your proposal. Great foresight, isn’t it?!

November 23, 2015 at 1:49 am

Is there any way to formalise this intuition using an operator interpolation-type argument?

The standard way that this works is that you have 2 Banach spaces (X [say functions with 2 moments] and Y [say L^1]) that are nicely contained as subspaces of a big underlying space Z. and you have a linear operator T: X->R and T: Y->R.

Now interpolation allows you to define a scale of “in between” spaces (X,Y)_t (this is another Banach space nicely contained in Z) such that

||T||_{(X,Y)_t} \leq C ||T||_X^t ||T||_Y^{1-t},

where those norms are operator norms and 0<t<1.

So basically if T is some sort of error measure such that ||T||_Y <C and ||T||_X < n^{-k}, then

|| T ||_{(X,Y)_t} <= C n^{-tk}.

Now, I'm not likely to find the time to work it all the way through, but it would be quite surprising to me if there wasn't a way to do this so that (X,Y)_t was the space of all functions with 2t (<2) moments.

(This is especially true given that you can think of functions with k moments as weighted L^1 spaces, so Z = L^1 is a natural enveloping space)