Archive for effective sample size

importance sampling with infinite variance

Posted in pictures, R, Statistics, University life with tags , , , , , , , on November 13, 2015 by xi'an

“In this article it is shown that in a fairly general setting, a sample of size approximately exp(D(μ|ν)) is necessary and sufficient for accurate estimation by importance sampling.”

Sourav Chatterjee and Persi Diaconis arXived yesterday an exciting paper where they study the proper sample size in an importance sampling setting with no variance. That’s right, with no variance. They give as a starting toy example the use of an Exp(1) proposal for an Exp(1/2) target, where the importance ratio exp(x/2)/2 has no ξ order moment (for ξ≥2). So the infinity in the variance is somehow borderline in this example, which may explain why the estimator could be considered to “work”. However, I disagree with the statement “that a sample size a few thousand suffices” for the estimator of the mean to be close to the true value, that is, 2. For instance, the picture I drew above is the superposition of 250 sequences of importance sampling estimators across 10⁵ iterations: several sequences show huge jumps, even for a large number of iterations, which are characteristic of infinite variance estimates. Thus, while the expected distance to the true value can be closely evaluated via the Kullback-Leibler divergence between the target and the proposal (which by the way is infinite when using a Normal as proposal and a Cauchy as target), there are realisations of the simulation path that can remain far from the true value and this for an arbitrary number of simulations. (I even wonder if, for a given simulation path, waiting long enough should not lead to those unbounded jumps.) The first result is frequentist, while the second is conditional, i.e., can occur for the single path we have just simulated… As I taught in class this very morning, I thus remain wary about using an infinite variance estimator. (And not only in connection with the harmonic mean quagmire. As shown below by the more extreme case of simulating an Exp(1) proposal for an Exp(1/10) target, where the mean is completely outside the range of estimates.) Wary, then, even though I find the enclosed result about the existence of a cut-off sample size associated with this L¹ measure quite astounding. Continue reading

locally weighted MCMC

Posted in Books, Statistics, University life with tags , , , , , , , , on July 16, 2015 by xi'an

Street light near the St Kilda Road bridge, Melbourne, July 21, 2012Last week, on arXiv, Espen Bernton, Shihao Yang, Yang Chen, Neil Shephard, and Jun Liu (all from Harvard) proposed a weighting scheme to associated MCMC simulations, in connection with the parallel MCMC of Ben Calderhead discussed earlier on the ‘Og. The weight attached to each proposal is either the acceptance probability itself (with the rejection probability being attached to the current value of the MCMC chain) or a renormalised version of the joint target x proposal, either forward or backward. Both solutions are unbiased in that they have the same expectation as the original MCMC average, being some sort of conditional expectation. The proof of domination in the paper builds upon Calderhead’s formalism.

This work reminded me of several reweighting proposals we made over the years, from the global Rao-Blackwellisation strategy with George Casella, to the vanilla Rao-Blackwellisation solution we wrote with Randal Douc a few years ago, both of whom also are demonstrably improving upon the standard MCMC average. By similarly recycling proposed but rejected values. Or by diminishing the variability due to the uniform draw. The slightly parallel nature of the approach also connects with our parallel MCM version with Pierre Jacob (now Harvard as well!) and Murray Smith (who now leaves in Melbourne, hence the otherwise unrelated picture).

intrinsic quantity for a Markov chain?

Posted in Statistics with tags , , , , , , , on February 6, 2013 by xi'an

tree next to INSEE building, Malakoff, Jan. 31, 2012I was attending a lecture this morning at CREST by Patrice Bertail where he was using estimated renewal parameters on a Markov chain to build (asymptotically) convergent bootstrap procedures. Estimating renewal parameters is obviously of interest in MCMC algorithms as they can be used to assess the convergence of the associated Markov chain: That is, if the estimation does not induce a significant bias. Another question that came to me during the talk is that; since those convergence assessments techniques are formally holding for any small set, choosing the small set in order to maximise the renewal rate also maximises the number of renewal events and hence the number of terms in the control sequence: Thus, the maximal renewal rate þ is definitely a quantity of interest: Now, is this quantity þ an intrinsic parameter of the chain, i.e. a quantity that drives its mixing and/or converging behaviour(s)? For instance; an iid sequence has a renewal rate of 1; because the whole set is a “small” set. Informally, the time between two consecutive renewal events is akin to the time between two simulations from the target and stationary distribution, according to the Kac’s representation we used in our AAP paper with Jim Hobert. So it could be that þ is directly related with the effective sample size of the chain, hence the autocorrelation. (A quick web search did not produce anything relevant:) Too bad this question did not pop up last week when I had the opportunity to discuss it with Sean Meyn in Gainesville!

efficient learning in ABC

Posted in Statistics with tags , , , , , , on October 11, 2012 by xi'an

Jean-Michel Marin just posted on arXiv a joint paper of ours, Efficient learning in ABC algorithms. This paper, to which elaboration [if not redaction] I contributed to, is one of the chapters of Mohammed Sedki’s thesis. (Mohammed is about to defend this thesis, currently with reviewers. A preliminary version of this paper was presented at ABC in London and it is in revision with Statistics and Computing. Hence missing the special issue!)

The paper builds on the sequential ABC scheme of Del Moral et al. (2012), already discussed in this post, where the tolerance level at each step is adapted from the previous iterations as a quantile of the distances. (The quantile level is based on a current effective sample size.) In a “systematic” step, the particles that are closest to the observations are preserved and duplicated, while those farther away are sampled randomly. The resulting population of particles is then perturbed by an adaptive (random walk) kernel and the algorithm stops when the tolerance level does not decrease any longer or when the acceptance rate of the Metropolis step is too low. Pierre Pudlo and Mohammed Sedki experimented a parallel implementation of the algorithm, with an almost linear improvement in the number of cores. It is less clear the same would work on a GPU because of the communication requirements. Overall, the new algorithm brings a significant improvement in computing time when compared with earlier versions, mainly because the number of simulations from the model is minimised. (It was tested on a rather complex population scenario retracing the invasion of honeybees in Europe (in connection with the previous post!)

Effective sample size

Posted in Books, R, Statistics with tags , , on September 24, 2010 by xi'an

In the previous days I have received several emails asking for clarification of the effective sample size derivation in “Introducing Monte Carlo Methods with R” (Section 4.4, pp. 98-100). Formula (4.3) gives the Monte Carlo estimate of the variance of a self-normalised importance sampling estimator (note the change from the original version in Introducing Monte Carlo Methods with R ! The weight W is unnormalised and hence the normalising constant \kappa appears in the denominator.)

\frac{1}{n}\,\mathrm{var}_f (h(X)) \left\{1+\dfrac{\mathrm{var}_g(W)}{\kappa^2}\right\}


\dfrac{\sum_{i=1}^n \omega_i \left\{ h(x_i) - \delta_h^n \right\}^2 }{n\sum_{i=1}^n \omega_i} \, \left\{ 1 + n^2\,\widehat{\mathrm{var}}(W)\Bigg/ \left(\sum_{i=1}^n \omega_i \right)^2 \right\}\,.

Now, the front term is somehow obvious so let us concentrate on the bracketed part. The empirical variance of the \omega_i‘s is

\frac{1}{n}\,\sum_{i=1}^n\omega_i^2-\frac{1}{n^2}\left(\sum_{i=1}^n\omega_i\right)^2 \,,

the coefficient 1+\widehat{\mathrm{var}}_g(W)/\kappa^2 is thus estimated by

n\,\sum_{i=1}^n \omega_i^2 \bigg/ \left(\sum_{i=1}^n \omega_i\right)^2\,.

which leads to the definition of the effective sample size


The confusing part in the current version is whether or not we use normalised W’s and \omega_i‘s. I hope this clarifies the issue!