As I was still musing about the posts of last week around infinite variance importance sampling and its potential corrections, I wondered at whether or not there was a fundamental difference between “just” having a [finite] variance and “just” having none. In conjunction with Aki’s post. To get a better feeling, I ran a quick experiment with Exp(1) as the target and Exp(a) as the importance distribution. When estimating E[X]=1, the above graph opposes a=1.95 to a=2.05 (variance versus no variance, bright yellow versus wheat), a=2.95 to a=3.05 (third moment versus none, bright yellow versus wheat), and a=3.95 to a=4.05 (fourth moment versus none, bright yellow versus wheat). The graph below is the same for the estimation of E[exp(X/2)]=2, which has an integrand that is not square integrable under the target. Hence seems to require higher moments for the importance weight. Hard to derive universal theories from those two graphs, however they show that protection against sudden drifts in the estimation sequence. As an aside [not really!], apart from our rather confidential Confidence bands for Brownian motion and applications to Monte Carlo simulation with Wilfrid Kendall and Jean-Michel Marin, I do not know of many studies that consider the sequence of averages time-wise rather than across realisations at a given time and still think this is a more relevant perspective for simulation purposes.
Archive for infinite variance estimators
The following is mostly based on our arXived paper with Andrew Gelman and the references mentioned there.
Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution.
When k<1/2, the variance exists and the central limit theorem holds. Chen and Shao (2004) show further that the rate of convergence to normality is faster when higher moments exist. When 1/2≤k<1, the variance does not exist (but mean exists), the generalized central limit theorem holds, and we may assume the rate of convergence is faster when k is closer to 1/2.
In our experiments in the arXived paper and in Vehtari, Gelman, and Gabry (2015), we have observed that Pareto smoothed importance sampling (PSIS) usually converges well also with k>1/2 but k close to 1/2 (let’s say k<0.7). But if k<1 and k is close to 1 (let’s say k>0.7) the convergence is much worse and both naïve importance sampling and PSIS are unreliable.
Two figures are attached, which show the results comparing IS and PSIS in the Exp(1/2) and Exp(1/10) examples. The results were computed with repeating 1000 times a simulation with 10000 samples in each. We can see the bad performance of IS in both examples as you also illustrated. In Exp(1/2) case, PSIS is also to produce much more stable results. In Exp(1/10) case, PSIS is able to reduce the variance of the estimate, but it is not enough to avoid a big bias.
It would be interesting to have more theoretical justification why infinite variance is not so big problem if k is close to 1/2 (e.g. how the convergence rate is related to the amount of fractional moments).
I guess that max ω[t] / ∑ ω[t] in Chaterjee and Diaconis has some connection to the tail shape parameter of the generalized Pareto distribution, but it is likely to be much noisier as it depends on the maximum value instead of a larger number of tail samples as in the approach by Koopman, Shephard, and Creal (2009).A third figure shows an example where the variance is finite, with “an Exp(1) proposal for an Exp(1/1.9) target”, which corresponds to k≈0.475 < 1/2. Although the variance is finite, we are close to the border and the performance of basic IS is bad. There is no sharp change in the practical behaviour with a finite number of draws when going from finite variance to infinite variance. Thus, I think it is not enough to focus on the discrete number of moments, but for example, the Pareto shape parameter k gives us more information. Koopman, Shephard, and Creal (2009) also estimated the Pareto shape k, but they formed a hypothesis test whether the variance is finite and thus discretising the information in k, and assuming that finite variance is enough to get good performance.
Sourav Chatterjee and Persi Diaconis arXived yesterday an exciting paper where they study the proper sample size in an importance sampling setting with no variance. That’s right, with no variance. They give as a starting toy example the use of an Exp(1) proposal for an Exp(1/2) target, where the importance ratio exp(x/2)/2 has no ξ order moment (for ξ≥2). So the infinity in the variance is somehow borderline in this example, which may explain why the estimator could be considered to “work”. However, I disagree with the statement “that a sample size a few thousand suffices” for the estimator of the mean to be close to the true value, that is, 2. For instance, the picture I drew above is the superposition of 250 sequences of importance sampling estimators across 10⁵ iterations: several sequences show huge jumps, even for a large number of iterations, which are characteristic of infinite variance estimates. Thus, while the expected distance to the true value can be closely evaluated via the Kullback-Leibler divergence between the target and the proposal (which by the way is infinite when using a Normal as proposal and a Cauchy as target), there are realisations of the simulation path that can remain far from the true value and this for an arbitrary number of simulations. (I even wonder if, for a given simulation path, waiting long enough should
not lead to those unbounded jumps.) The first result is frequentist, while the second is conditional, i.e., can occur for the single path we have just simulated… As I taught in class this very morning, I thus remain wary about using an infinite variance estimator. (And not only in connection with the harmonic mean quagmire. As shown below by the more extreme case of simulating an Exp(1) proposal for an Exp(1/10) target, where the mean is completely outside the range of estimates.) Wary, then, even though I find the enclosed result about the existence of a cut-off sample size associated with this L¹ measure quite astounding. Continue reading Last week A while ago, Aki Vehtari and Andrew Gelman arXived a paper on self-normalised importance sampling estimators, Pareto smoothed importance sampling. That I commented almost immediately and then sat on, waiting for the next version. Since the two A’s are still working on that revision, I eventually decided to post the comments, before a series of posts on the same issue. Disclaimer: the above quote from and picture of Pareto are unrelated with the paper!
A major drawback with importance samplers is that they can produce infinite variance estimators. Aki and Andrew compare in this study the behaviour of truncated importance weights, following a paper of Ionides (2008) Andrew and I had proposed as a student project last year, project that did not conclude. The truncation is of order √S, where S is the number of simulation, rescaled by the average weight (which should better be the median weight in the event of infinite variance weights). While this truncation leads to finite variance, it also induces a possibly far from negligible bias, bias that the paper suggests to reduce via a Pareto modelling of the largest or extreme weights. Three possible conclusions come from the Pareto modelling and the estimation of the Pareto shape k. If k<½, there is no variance issue and truncation is not necessary; if ½<k<1, the estimator has a mean but no variance, and if k>1, it does not even has a mean. The latter case sounds counter-intuitive since the self-normalised importance sampling estimator is the ratio of an estimate of a finite integral by an estimate of a positive constant… Aki and Andrew further use the Pareto estimation to smooth out the largest weights as estimated quantiles. They also eliminate the largest weights when k comes close to 1 or higher values.
On a normal toy example, simulated with too small a variance, the method is seen to reduce the variability if not the bias. In connection with my above remark, k does never appear as significantly above 1 in this example. A second toy example uses a shifted t distribution as proposal. This setting should not induce a infinite variance problem since the inverse of a t density remains integrable under a normal distribution, but the variance grows with the bias in the t proposal and the Pareto index k as well, exceeding the boundary value 1 in the end. Similar behaviour is observed on a multidimensional example.
The issue I have with this approach is the same I set to Andrew last year, namely why would one want to use a poor importance sampler and run the risk of ending up with a worthless approximation? Detecting infinite variance estimation is obviously an essential first step step to produce reliable approximation but a second step would to seek a substitute for the proposal in an automated manner, possibly by increasing the tails of the original one, or in running a reparameterisation of the original problem with the same proposal. Towards thinner tails of the target. Automated sounds unrealistic, obviously, but so does trusting an infinite variance estimate. If worse comes to worse, we should acknowledge and signal that the current sampler cannot be trusted. As in statistical settings, we should be able to state we cannot produce a satisfactory solution (and hence need more data or different models).
while simulating from the associated density, g. While seemingly unaware of the (huge) literature in the area, he re-derived [a version of] the harmonic mean estimate by considering the [inverted importance sampling] identity
when α is a probability density and by using for α the uniform over the whole range of the simulations from g. This choice of α obviously leads to an estimator with infinite variance when the support of g is unbounded, but the idea can be easily salvaged by using instead another uniform distribution, for instance on an highest density region, as we studied in our papers with Darren Wraith and Jean-Michel Marin. (Unfortunately, the originator of the question does not seem any longer interested in the problem.)
Over a welcomed curry yesterday night in Edinburgh I read this 2008 paper by Koopman, Shephard and Creal, testing the assumptions behind importance sampling, which purpose is to check on-line for (in)finite variance in an importance sampler, based on the empirical distribution of the importance weights. To this goal, the authors use the upper tail of the weights and a limit theorem that provides the limiting distribution as a type of Pareto distribution
over (0,∞). And then implement a series of asymptotic tests like the likelihood ratio, Wald and score tests to assess whether or not the power ξ of the Pareto distribution is below ½. While there is nothing wrong with this approach, which produces a statistically validated diagnosis, I still wonder at the added value from a practical perspective, as raw graphs of the estimation sequence itself should exhibit similar jumps and a similar lack of stabilisation as the ones seen in the various figures of the paper. Alternatively, a few repeated calls to the importance sampler should disclose the poor convergence properties of the sampler, as in the above graph. Where the blue line indicates the true value of the integral.
Pierre Jacob and Alexandre Thiéry just arXived a highly pertinent paper on the most debated issue of non-negative unbiased estimators (of positive quantities). If you remember that earlier post of mine, I mentioned the issue in connection with the Russian roulette estimator(s) of Mark Girolami et al. And, as Pierre and Alexandre point out in the paper, there is also a clear and direct connection with the Bernoulli factory problem. And with our Vanilla Rao-Blackwellisation technique (sadly overlooked, once more!).
The first thing I learned from the paper is how to turn a converging sequence into an unbiased estimator. If (En) is this converging sequence, with limit μ, then
is unbiased..! Amazing. Even though the choice of the distribution of N matters towards getting a finite variance estimator, this transform is simply amazing. (Of course, once one looks at it, one realises it is the “old” trick of turning a series into a sequence and vice-versa. Still…!) And then you can reuse it into getting an unbiased estimator for almost any transform of μ.
The second novel thing in the paper is the characterisation of impossible cases for non-negative unbiased estimators. For instance, if the original sequence has an unbounded support, there cannot be such an estimator. If the support is an half-line, the transform must be
monotonous monotonic. If the support is a bounded interval (a,b), then the transform must be bounded from below by a polynomial bound
(where the extra-parameters obviously relate to the transform). (In this later case, the authors also show how to derive a Bernoulli estimator from the original unbiased estimator.)