**T**opi Paananen, Juho Piironen, Paul-Christian Bürkner and Aki Vehtari have recently arXived a work on constructing an adapted importance (sampling) distribution. The beginning is more a review than a new contribution, covering the earlier work by Vehtari, Gelman and Gabri (2017): estimating the Pareto rate for the importance weight distribution helps in assessing whether or not this distribution allows for a (necessary) second moment. In case it does not (seem to), the authors propose an affine transform of the importance distribution, using the earlier sample to match the first two moments of the distribution. Or of the targeted function. Adaptation that is controlled by the same Pareto rate technique, as in the above picture (from the paper). Predicting a natural objection as to the poor performances of the earlier samples, the paper suggests to use robust estimators of these moments, for instance via Pareto smoothing. It also suggests using multiple importance sampling as a way to regularise and robustify the estimates. While I buy the argument of fitting the target moments to achieve a better fit of the importance sampling, I remain unclear as to why an affine transform would change the (poor) tail behaviour of the importance sampler. Hence why it would apply in full generality. An alternative could consist in finding appropriate Box-Cox transforms, although the difficulty would certainly increase with the dimension.

## Archive for infinite variance estimators

## improved importance sampling via iterated moment matching

Posted in Statistics with tags curse of dimensionality, finite variance, importance sampling, infinite variance estimators, Pareto smoothed importance sampling on August 1, 2019 by xi'an## Gibbs clashes with importance sampling

Posted in pictures, Statistics with tags Amsterdam, cross validated, Gibbs sampling, importance sampling, infinite variance estimators, normalising constant on April 11, 2019 by xi'an**I**n an X validated question, an interesting proposal was made: at each (component-wise) step of a Gibbs sampler, replace simulation from the exact full conditional with simulation from an alternate density and weight the resulting simulation with a term made of a product of (a) the previous weight (b) the ratio of the true conditional over the substitute for the new value and (c) the inverse ratio for the earlier value of the same component. Which does not work for several reasons:

- the reweighting is doomed by its very propagation in that it keeps multiplying ratios of expectation one, which means an almost sure chance of degenerating;
- the weights are computed for a previous value that has not been generated from the same proposal and is anyway already properly weighted;
- due to the change in dimension produced by Gibbs, the actual target is the full conditional, which involves an intractable normalising constant;
- there is no guarantee for the weights to have finite variance, esp. when the proposal has thinner tails than the target.

as can be readily checked by a quick simulation experiment. The funny thing is that a proper importance weight can be constructed when envisioning the sequence of Gibbs steps as a Metropolis proposal (in the dimension of the target).

## importance sampling and necessary sample size

Posted in Books, Statistics with tags arXiv, efficient importance sampling, infinite variance estimators, Monte Carlo approximations, Monte Carlo Statistical Methods, Persi Diaconis on September 7, 2016 by xi'an**D**aniel Sanz-Alonso arXived a note yesterday where he analyses importance sampling from the point of view of empirical distributions. With the difficulty that unnormalised importance sampling estimators are not associated with an empirical distribution since the sum of the weights is not one. For several f-divergences, he obtains upper bounds on those divergences between the empirical cdf and a uniform version, D(w,u), which translate into lower bounds on the importance sample size. I however do not see why this divergence between a weighted sampled and the uniformly weighted version is relevant for the divergence between the target and the proposal, nor how the resulting Monte Carlo estimator is impacted by this bound. A side remark [in the paper] is that those results apply to infinite variance Monte Carlo estimators, as in the recent paper of Chatterjee and Diaconis I discussed earlier, which also discussed the necessary sample size.

## ISBA 2016 [#2]

Posted in Books, pictures, Running, Statistics, Travel, University life, Wines with tags Bruno de Finetti, Forte Village, harmonic mean estimator, infinite variance estimators, ISBA 2016, Italy, nested sampling, NIPS 2015, Persi Diaconis, probabilistic numerics, Sardinia on June 15, 2016 by xi'an**T**oday I attended Persi Diaconis’ de Finetti’s ISBA Lecture and not only because I was an invited discussant, by all means!!! Persi was discussing his views on Bayesian numerical analysis. As already expressed in his 1988 paper. Which now appears as a foundational precursor to probabilistic numerics. And which is why I had a very easy time in preparing my discussion as I mostly borrowed from my NIPS slides. With some degree of legitimacy since I was already a discussant there. Anyway, here is the most novel slide in the discussion, built upon my realisation that the principle behind nested sampling is fairly generic for integral approximation, rather than being restricted to marginal likelihood approximation.

Among many interesting things, Persi’s talk made me think anew about infinite variance importance sampling. And about the paper by Souraj Chatterjee and Persi that I discussed a few months ago. In that some regularisation of those “useless” importance estimates can stem from prior modelling. Not as an aside, let me add I am very grateful to the ISBA 2016 organisers and to the chair of the de Finetti lecture committee for their invitation to discuss this talk!

## borderline infinite variance in importance sampling

Posted in Books, Kids, Statistics with tags continuity, importance sampling, infinite variance estimators, moments, Monte Carlo experiment, Monte Carlo Statistical Methods on November 23, 2015 by xi'an**A**s I was still musing about the posts of last week around infinite variance importance sampling and its potential corrections, I wondered at whether or not there was a fundamental difference between “just” having a [finite] variance and “just” having none. In conjunction with Aki’s post. To get a better feeling, I ran a quick experiment with Exp(1) as the target and Exp(a) as the importance distribution. When estimating **E[X]**=1, the above graph opposes a=1.95 to a=2.05 (variance versus no variance, bright yellow versus wheat), a=2.95 to a=3.05 (third moment versus none, bright yellow versus wheat), and a=3.95 to a=4.05 (fourth moment versus none, bright yellow versus wheat). The graph below is the same for the estimation of **E[**exp(**X**/2)**]**=2, which has an integrand that is not square integrable under the target. Hence seems to require higher moments for the importance weight. Hard to derive universal theories from those two graphs, however they show that protection against sudden drifts in the estimation sequence. As an aside [not really!], apart from our rather confidential Confidence bands for Brownian motion and applications to Monte Carlo simulation with Wilfrid Kendall and Jean-Michel Marin, I do not know of many studies that consider the sequence of averages time-wise rather than across realisations at a given time and still think this is a more relevant perspective for simulation purposes.

## Paret’oothed importance sampling and infinite variance [guest post]

Posted in Kids, pictures, R, Statistics, University life with tags central limit theorem, generalised Pareto distribution, importance sampling, infinite variance estimators, Pareto distribution, Pareto smoothed importance sampling, smoothing on November 17, 2015 by xi'an*[Here are some comments sent to me by Aki Vehtari in the sequel of the previous posts.]*

The following is mostly based on our arXived paper with Andrew Gelman and the references mentioned there.

Koopman, Shephard, and Creal (2009) proposed to make a sample based estimate of the existence of the moments using generalized Pareto distribution fitted to the tail of the weight distribution. The number of existing moments is less than 1/k (when k>0), where k is the shape parameter of generalized Pareto distribution.

When k<1/2, the variance exists and the central limit theorem holds. Chen and Shao (2004) show further that the rate of convergence to normality is faster when higher moments exist. When 1/2≤k<1, the variance does not exist (but mean exists), the generalized central limit theorem holds, and we may assume the rate of convergence is faster when k is closer to 1/2.

In the example with “Exp(1) proposal for an Exp(1/2) target”, k=1/2 and we are truly on the border.

In our experiments in the arXived paper and in Vehtari, Gelman, and Gabry (2015), we have observed that Pareto smoothed importance sampling (PSIS) usually converges well also with k>1/2 but k close to 1/2 (let’s say k<0.7). But if k<1 and k is close to 1 (let’s say k>0.7) the convergence is much worse and both naïve importance sampling and PSIS are unreliable.

Two figures are attached, which show the results comparing IS and PSIS in the Exp(1/2) and Exp(1/10) examples. The results were computed with repeating 1000 times a simulation with 10000 samples in each. We can see the bad performance of IS in both examples as you also illustrated. In Exp(1/2) case, PSIS is also to produce much more stable results. In Exp(1/10) case, PSIS is able to reduce the variance of the estimate, but it is not enough to avoid a big bias.

It would be interesting to have more theoretical justification why infinite variance is not so big problem if k is close to 1/2 (e.g. how the convergence rate is related to the amount of fractional moments).

I guess that max ω[t] / ∑ ω[t] in Chaterjee and Diaconis has some connection to the tail shape parameter of the generalized Pareto distribution, but it is likely to be much noisier as it depends on the maximum value instead of a larger number of tail samples as in the approach by Koopman, Shephard, and Creal (2009).A third figure shows an example where the variance is finite, with “an Exp(1) proposal for an Exp(1/1.9) target”, which corresponds to k≈0.475 < 1/2. Although the variance is finite, we are close to the border and the performance of basic IS is bad. There is no sharp change in the practical behaviour with a finite number of draws when going from finite variance to infinite variance. Thus, I think it is not enough to focus on the discrete number of moments, but for example, the Pareto shape parameter k gives us more information. Koopman, Shephard, and Creal (2009) also estimated the Pareto shape k, but they formed a hypothesis test whether the variance is finite and thus discretising the information in k, and assuming that finite variance is enough to get good performance.

## importance sampling with infinite variance

Posted in pictures, R, Statistics, University life with tags arXiv, convergence diagnostics, cut-off, effective sample size, importance sampling, infinite variance estimators, Kullback-Leibler divergence, Persi Diaconis on November 13, 2015 by xi'an

“In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(μ|ν)) is necessary and sufficient for accurate estimation by importance sampling.”

**S**ourav Chatterjee and Persi Diaconis arXived yesterday an exciting paper where they study the proper sample size in an importance sampling setting with no variance. That’s right, *with no variance*. They give as a starting toy example the use of an Exp(1) proposal for an Exp(1/2) target, where the importance ratio exp(x/2)/2 has no ξ order moment (for ξ≥2). So the infinity in the variance is somehow borderline in this example, which may explain why the estimator could be considered to “work”. However, I disagree with the statement “that a sample size a few thousand suffices” for the estimator of the mean to be close to the true value, that is, 2. For instance, the picture I drew above is the superposition of 250 sequences of importance sampling estimators across 10⁵ iterations: several sequences show huge jumps, even for a large number of iterations, which are characteristic of infinite variance estimates. Thus, while the expected distance to the true value can be closely evaluated via the Kullback-Leibler divergence between the target and the proposal (which by the way is infinite when using a Normal as proposal and a Cauchy as target), there are realisations of the simulation path that can remain far from the true value and this for an arbitrary number of simulations. (I even wonder if, for a given simulation path, waiting long enough should ~~not~~ lead to those unbounded jumps.) The first result is frequentist, while the second is conditional, i.e., can occur for the single path we have just simulated… As I taught in class this very morning, I thus remain wary about using an infinite variance estimator. (And not only in connection with the harmonic mean quagmire. As shown below by the more extreme case of simulating an Exp(1) proposal for an Exp(1/10) target, where the mean is completely outside the range of estimates.) Wary, then, even though I find the enclosed result about the existence of a cut-off sample size associated with this L¹ measure quite astounding. Continue reading