Archive for importance sampling

rethinking the ESS

Posted in Statistics with tags , , , , , , , , , on September 14, 2018 by xi'an

Following Victor Elvira‘s visit to Dauphine, one and a half year ago, where we discussed the many defects of ESS as a default measure of efficiency for importance sampling estimators, and then some more efforts (mostly from Victor!) to formalise these criticisms, Victor, Luca Martino and I wrote a paper on this notion, now arXived. (Victor most kindly attributes the origin of the paper to a 2010 ‘Og post on the topic!) The starting thread of the (re?)analysis of this tool introduced by Kong (1992) is that the ESS used in the literature is an approximation to the “true” ESS, generally unavailable. Approximation that is pretty crude and hence impacts the relevance of using it as the assessment tool for comparing importance sampling methods. In the paper, we re-derive (with the uttermost precision) the resulting approximation and list the many assumptions that [would] validate this approximation. The resulting drawbacks are many, from the absurd property of always being worse than direct sampling, to being independent from the target function and from the sample per se. Since only importance weights matter. This list of issues is not exactly brand new, but we think it is worth signaling given the fact that this approximation has been widely used in the last 25 years, due to its simplicity, as a practical rule of thumb [!] in a wide variety of importance sampling methods. In continuation of the directions drafted in Martino et al. (2017), we also indicate some alternative notions of importance efficiency. Note that this paper does not cover the use of ESS for MCMC algorithms, where it is somewhat more legit, if still too rudimentary to really catch convergence or lack thereof! [Note: I refrained from the post title resinking the ESS…]

optimal approximations for importance sampling

Posted in Mountains, pictures, Statistics, Travel with tags , , , , , , , , , , , on August 17, 2018 by xi'an

“…building such a zero variance estimator is most of the times not practical…”

As I was checking [while looking at Tofino inlet from my rental window] on optimal importance functions following a question on X validated, I came across this arXived note by Pantaleoni and Heitz, where they suggest using weighted sums of step functions to reach minimum variance. However, the difficulty with probability densities that are step functions is that they necessarily have a compact support, which thus make them unsuitable for targeted integrands with non-compact support. And making the purpose of the note and the derivation of the optimal weights moot. It points out its connection with the reference paper of Veach and Guibas (1995) as well as He and Owen (2014), a follow-up to the other reference paper by Owen and Zhou (2000).

efficient adaptive importance sampling

Posted in Books, Statistics with tags , , , , , , , on June 22, 2018 by xi'an

Bernard Delyon and François Portier just recently arXived a paper on population or evolutionary importance sampling, pointed out to me by Víctor Elvira. Changing the proposal or importance sampler at each iteration. And averaging the estimates across iterations, but also mentioning AMIS. While drawing a distinction that I do not understand, since the simulation cost remains the same, while improving the variance of the resulting estimator. (But the paper points out later that their martingale technique of proof does not apply in this AMIS case.) Some interesting features of the paper are that

  • convergence occurs when the total number of simulations grows to infinity, which is the most reasonable scale for assessing the worth of the method;
  • some optimality in the oracle sense is established for the method;
  • an improvement is found by eliminating outliers and favouring update rate over simulation rate (at a constant cost). Unsurprisingly, the optimal weight of the t-th estimator is given by its inverse variance (with eqn (13) missing an inversion step). Although it relies on the normalised versions of the target and proposal densities, since it assumes the expectation of the ratio is equal to one.

When updating the proposal or importance distribution, the authors consider a parametric family with the update in the parameter being driven by moment or generalised moment matching, or Kullback reduction as in our population Monte Carlo paper. The interesting technical aspects of the paper include the use of martingale and empirical risk arguments. All in all, quite a pleasant surprise to see some follow-up to our work on that topic, more than 10 years later.

new estimators of evidence

Posted in Books, Statistics with tags , , , , , , , , , , , , on June 19, 2018 by xi'an

In an incredible accumulation of coincidences, I came across yet another paper about evidence and the harmonic mean challenge, by Yu-Bo Wang, Ming-Hui Chen [same as in Chen, Shao, Ibrahim], Lynn Kuo, and Paul O. Lewis this time, published in Bayesian Analysis. (Disclaimer: I was not involved in the reviews of any of these papers!)  Authors who arelocated in Storrs, Connecticut, in geographic and thematic connection with the original Gelfand and Dey (1994) paper! (Private joke about the Old Man of Storr in above picture!)

“The working parameter space is essentially the constrained support considered by Robert and Wraith (2009) and Marin and Robert (2010).”

The central idea is to use a more general function than our HPD restricted prior but still with a known integral. Not in the sense of control variates, though. The function of choice is a weighted sum of indicators of terms of a finite partition, which implies a compact parameter set Ω. Or a form of HPD region, although it is unclear when the volume can be derived. While the consistency of the estimator of the inverse normalising constant [based on an MCMC sample] is unsurprising, the more advanced part of the paper is about finding the optimal sequence of weights, as in control variates. But it is also unsurprising in that the weights are proportional to the inverses of the inverse posteriors over the sets in the partition. Since these are hard to derive in practice, the authors come up with a fairly interesting alternative, which is to take the value of the posterior at an arbitrary point of the relevant set.

The paper also contains an extension replacing the weights with functions that are integrable and with known integrals. Which is hard for most choices, even though it contains the regular harmonic mean estimator as a special case. And should also suffer from the curse of dimension when the constraint to keep the target almost constant is implemented (as in Figure 1).

The method, when properly calibrated, does much better than harmonic mean (not a surprise) and than Petris and Tardella (2007) alternative, but no other technique, on toy problems like Normal, Normal mixture, and probit regression with three covariates (no Pima Indians this time!). As an aside I find it hard to understand how the regular harmonic mean estimator takes longer than this more advanced version, which should require more calibration. But I find it hard to see a general application of the principle, because the partition needs to be chosen in terms of the target. Embedded balls cannot work for every possible problem, even with ex-post standardisation.

 

the [not so infamous] arithmetic mean estimator

Posted in Books, Statistics with tags , , , , , , , , , on June 15, 2018 by xi'an

“Unfortunately, no perfect solution exists.” Anna Pajor

Another paper about harmonic and not-so-harmonic mean estimators that I (also) missed came out last year in Bayesian Analysis. The author is Anna Pajor, whose earlier note with Osiewalski I also spotted on the same day. The idea behind the approach [which belongs to the branch of Monte Carlo methods requiring additional simulations after an MCMC run] is to start as the corrected harmonic mean estimator on a restricted set A as to avoid tails of the distributions and the connected infinite variance issues that plague the harmonic mean estimator (an old ‘Og tune!). The marginal density p(y) then satisfies an identity involving the prior expectation of the likelihood function restricted to A divided by the posterior coverage of A. Which makes the resulting estimator unbiased only when this posterior coverage of A is known, which does not seem realist or efficient, except if A is an HPD region, as suggested in our earlier “safe” harmonic mean paper. And efficient only when A is well-chosen in terms of the likelihood function. In practice, the author notes that P(A|y) is to be estimated from the MCMC sequence and that the set A should be chosen to return large values of the likelihood, p(y|θ), through importance sampling, hence missing somehow the double opportunity of using an HPD region. Hence using the same default choice as in Lenk (2009), an HPD region which lower bound is derived as the minimum likelihood in the MCMC sample, “range of the posterior sampler output”. Meaning P(A|y)=1. (As an aside, the paper does not produce optimality properties or even heuristics towards efficiently choosing the various parameters to be calibrated in the algorithm, like the set A itself. As another aside, the paper concludes with a simulation study on an AR(p) model where the marginal may be obtained in closed form if stationarity is not imposed, which I first balked at, before realising that even in this setting both the posterior and the marginal do exist for a finite sample size, and hence the later can be estimated consistently by Monte Carlo methods.) A last remark is that computing costs are not discussed in the comparison of methods.

The final experiment in the paper is aiming at the marginal of a mixture model posterior, operating on the galaxy benchmark used by Roeder (1990) and about every other paper on mixtures since then (incl. ours). The prior is pseudo-conjugate, as in Chib (1995). And label-switching is handled by a random permutation of indices at each iteration. Which may not be enough to fight the attraction of the current mode on a Gibbs sampler and hence does not automatically correct Chib’s solution. As shown in Table 7 by the divergence with Radford Neal’s (1999) computations of the marginals, which happen to be quite close to the approximation proposed by the author. (As an aside, the paper mentions poor performances of Chib’s method when centred at the posterior mean, but this is a setting where the posterior mean is meaningless because of the permutation invariance. As another, I do not understand how the RMSE can be computed in this real data situation.) The comparison is limited to Chib’s method and a few versions of arithmetic and harmonic means. Missing nested sampling (Skilling, 2006; Chopin and X, 2011), and attuned importance sampling as in Berkoff et al. (2003), Marin, Mengersen and X (2005), and the most recent Lee and X (2016) in Bayesian Analysis.

another version of the corrected harmonic mean estimator

Posted in Books, pictures, Statistics, University life with tags , , , , , on June 11, 2018 by xi'an

A few days ago I came across a short paper in the Central European Journal of Economic Modelling and Econometrics by Pajor and Osiewalski that proposes a correction to the infamous harmonic mean estimator that is essentially the one Darren and I made in 2009, namely to restrict the evaluations of the likelihood function to a subset A of the simulations from the posterior. Paper that relates to an earlier 2009 paper by Peter Lenk, which investigates the same object with this same proposal and that we had missed for all that time. The difference is that, while we examine an arbitrary HPD region at level 50% or 80% as the subset A, Lenk proposes to derive a minimum likelihood value from the MCMC run and to use the associated HPD region, which means using all simulations, hence producing the same object as the original harmonic mean estimator, except that it is corrected by a multiplicative factor P(A). Or rather an approximation. This correction thus maintains the infinite variance of the original, a point apparently missed in the paper.

Metropolis-Hastings importance sampling

Posted in Books, Statistics, University life with tags , , , , , , , , , on June 6, 2018 by xi'an

[Warning: As I first got the paper from the authors and sent them my comments, this paper read contains their reply as well.]

In a sort of crazy coincidence, Daniel Rudolf and Björn Sprungk arXived a paper on a Metropolis-Hastings importance sampling estimator that offers similarities with  the one by Ingmar Schuster and Ilja Klebanov posted on arXiv the same day. The major difference in the construction of the importance sampler is that Rudolf and Sprungk use the conditional distribution of the proposal in the denominator of their importance weight, while Schuster and Klebanov go for the marginal (or a Rao-Blackwell representation of the marginal), mostly in an independent Metropolis-Hastings setting (for convergence) and for a discretised Langevin version in the applications. The former use a very functional L² approach to convergence (which reminded me of the early Schervish and Carlin, 1990, paper on the convergence of MCMC algorithms), not all of it necessary in my opinion. As for instance the extension of convergence properties to the augmented chain, namely (current, proposed), is rather straightforward since the proposed chain is a random transform of the current chain. An interesting remark at the end of the proof of the CLT is that the asymptotic variance of the importance sampling estimator is the same as with iid realisations from the target. This is a point we also noticed when constructing population Monte Carlo techniques (more than ten years ago), namely that dependence on the past in sequential Monte Carlo does not impact the validation and the moments of the resulting estimators, simply because “everything cancels” in importance ratios. The mean square error bound on the Monte Carlo error (Theorem 20) is not very surprising as the term ρ(y)²/P(x,y) appears naturally in the variance of importance samplers.

The first illustration where the importance sampler does worse than the initial MCMC estimator for a wide range of acceptance probabilities (Figures 2 and 3, which is which?) and I do not understand the opposite conclusion from the authors.

[Here is an answer from Daniel and Björn about this point:]

Indeed the formulation in our paper is unfortunate. The point we want to stress is that we observed in the numerical experiments certain ranges of step-sizes for which MH importance sampling shows a better performance than the classical MH algorithm with optimal scaling. Meaning that the MH importance sampling with optimal step-size can outperform MH sampling, without using additional computational resources. Surprisingly, the optimal step-size for the MH importance sampling estimator seems to remain constant for an increasing dimension in contrast to the well-known optimal scaling of the MH algorithm (given by a constant optimal acceptance rate).

The second uses the Pima Indian diabetes benchmark, amusingly (?) referring to Chopin and Ridgway (2017) who warn against the recourse to this dataset and to this model! The loss in mean square error due to the importance sampling may again be massive (Figure 5) and setting for an optimisation of the scaling factor in Metropolis-Hastings algorithms sounds unrealistic.

[And another answer from Daniel and Björn about this point:]

Indeed, Chopin and Ridgway suggest more complex problems with a larger number of covariates as benchmarks. However, the well-studied PIMA data set is a sufficient example in order to illustrate the possible benefits but also the limitations of the MH importance sampling approach. The latter are clearly (a) the required knowledge about the optimal step-size—otherwise the performance can indeed be dramatically worse than for the MH algorithm—and (b) the restriction to a small or at most moderate number of covariates. As you are indicating, optimizing the scaling factor is a challenging task. However, the hope is to derive some simple rule of thumb for the MH importance sampler similar to the well-known acceptance rate tuning for the standard MCMC estimator.