As I was still musing about the posts of last week around infinite variance importance sampling and its potential corrections, I wondered at whether or not there was a fundamental difference between “just” having a [finite] variance and “just” having none. In conjunction with Aki’s post. To get a better feeling, I ran a quick experiment with Exp(1) as the target and Exp(a) as the importance distribution. When estimating E[X]=1, the above graph opposes a=1.95 to a=2.05 (variance versus no variance, bright yellow versus wheat), a=2.95 to a=3.05 (third moment versus none, bright yellow versus wheat), and a=3.95 to a=4.05 (fourth moment versus none, bright yellow versus wheat). The graph below is the same for the estimation of E[exp(X/2)]=2, which has an integrand that is not square integrable under the target. Hence seems to require higher moments for the importance weight. Hard to derive universal theories from those two graphs, however they show that protection against sudden drifts in the estimation sequence. As an aside [not really!], apart from our rather confidential Confidence bands for Brownian motion and applications to Monte Carlo simulation with Wilfrid Kendall and Jean-Michel Marin, I do not know of many studies that consider the sequence of averages time-wise rather than across realisations at a given time and still think this is a more relevant perspective for simulation purposes.
Archive for Monte Carlo Statistical Methods
“Within this unified context, it is possible to interpret that all the MIS algorithms draw samples from a equal-weighted mixture distribution obtained from the set of available proposal pdfs.”
In a very special (important?!) week for importance sampling!, Elvira et al. arXived a paper about generalized multiple importance sampling. The setting is the same as in earlier papers by Veach and Gibas (1995) or Owen and Zhou (2000) [and in our AMIS paper], namely a collection of importance functions and of simulations from those functions. However, there is no adaptivity for the construction of the importance functions and no Markov (MCMC) dependence on the generation of the simulations.
One first part deals with the fact that a random point taken from the conjunction of those samples is distributed from the equiweighted mixture. Which was a fact I had much appreciated when reading Owen and Zhou (2000). From there, the authors discuss the various choices of importance weighting. Meaning the different degrees of Rao-Blackwellisation that can be applied to the sample. As we discovered in our population Monte Carlo research [which is well-referred within this paper], conditioning too much leads to useless adaptivity. Again a sort of epiphany for me, in that a whole family of importance functions could be used for the same target expectation and the very same simulated value: it all depends on the degree of conditioning employed for the construction of the importance function. To get around the annoying fact that self-normalised estimators are never unbiased, the authors borrow Liu’s (2000) notion of proper importance sampling estimators, where the ratio of the expectations is returning the right quantity. (Which amounts to recover the correct normalising constant(s), I believe.) They then introduce five (5!) different possible importance weights that all produce proper estimators. However, those weights correspond to different sampling schemes, so do not apply to the same sample. In other words, they are not recycling weights as in AMIS. And do not cover the adaptive cases where the weights and parameters of the different proposals change along iterations. Unsurprisingly, the smallest variance estimator is the one based on sampling without replacement and an importance weight made of the entire mixture. But this result does not apply for the self-normalised version, whose variance remains intractable.
I find this survey of existing and non-existing multiple importance methods quite relevant and a must-read for my students (and beyond!). My reservations (for reservations there must be!) are that the study stops short of pushing further the optimisation. Indeed, the available importance functions are not equivalent in terms of the target and hence weighting them equally is sub-efficient. The adaptive part of the paper broaches upon this issue but does not conclude.
While the deadline for Breaking News! submission is now over, with close to 20 submissions!, there is a new opening for cheaper lodging: The CUBE hotel in Savognin (20km away) has an offer at 110 CHF per person and per night, including breakfast, dinner, and skipass in a room for 3 people (or more). Be sure to mention MCMski in the subject of your email. As mentioned in the previous post, there are other opportunities in nearby villages, for instance Tiefencastel, 11km away with a 19mn bus connection, or Chur, 18km away with a slower 39mn bus connection, but with a very wide range of offers.
Another (!) Cross Validated question that shed some light on the difficulties of explaining the convergence of MCMC algorithms. Or in understanding conditioning and hierarchical models. The author wanted to know why a data augmentation of his did not converge: In a simplified setting, given an observation y that he wrote as y=h(x,θ), he had built a Gibbs sampler by reconstructing x=g(y,θ) and simulating θ given x: at each iteration t,
- compute xt=g(y,θt-1)
- simulate θt~π(θ|xt)
and he attributed the lack of convergence to a possible difficulty with the Jacobian. My own interpretation of the issue was rather that condition on the unobserved x was not the same as conditioning on the observed y and hence that y was missing from step 2. And that the simulation of x is useless. Unless one uses it in an augmented scheme à la Xiao-Li… Nonetheless, I like the problem, if only because my very first reaction was to draw a hierarchical dependence graph and to conclude this should be correct, before checking on a toy example that it was not!
Last week A while ago, Aki Vehtari and Andrew Gelman arXived a paper on self-normalised importance sampling estimators, Pareto smoothed importance sampling. That I commented almost immediately and then sat on, waiting for the next version. Since the two A’s are still working on that revision, I eventually decided to post the comments, before a series of posts on the same issue. Disclaimer: the above quote from and picture of Pareto are unrelated with the paper!
A major drawback with importance samplers is that they can produce infinite variance estimators. Aki and Andrew compare in this study the behaviour of truncated importance weights, following a paper of Ionides (2008) Andrew and I had proposed as a student project last year, project that did not conclude. The truncation is of order √S, where S is the number of simulation, rescaled by the average weight (which should better be the median weight in the event of infinite variance weights). While this truncation leads to finite variance, it also induces a possibly far from negligible bias, bias that the paper suggests to reduce via a Pareto modelling of the largest or extreme weights. Three possible conclusions come from the Pareto modelling and the estimation of the Pareto shape k. If k<½, there is no variance issue and truncation is not necessary; if ½<k<1, the estimator has a mean but no variance, and if k>1, it does not even has a mean. The latter case sounds counter-intuitive since the self-normalised importance sampling estimator is the ratio of an estimate of a finite integral by an estimate of a positive constant… Aki and Andrew further use the Pareto estimation to smooth out the largest weights as estimated quantiles. They also eliminate the largest weights when k comes close to 1 or higher values.
On a normal toy example, simulated with too small a variance, the method is seen to reduce the variability if not the bias. In connection with my above remark, k does never appear as significantly above 1 in this example. A second toy example uses a shifted t distribution as proposal. This setting should not induce a infinite variance problem since the inverse of a t density remains integrable under a normal distribution, but the variance grows with the bias in the t proposal and the Pareto index k as well, exceeding the boundary value 1 in the end. Similar behaviour is observed on a multidimensional example.
The issue I have with this approach is the same I set to Andrew last year, namely why would one want to use a poor importance sampler and run the risk of ending up with a worthless approximation? Detecting infinite variance estimation is obviously an essential first step step to produce reliable approximation but a second step would to seek a substitute for the proposal in an automated manner, possibly by increasing the tails of the original one, or in running a reparameterisation of the original problem with the same proposal. Towards thinner tails of the target. Automated sounds unrealistic, obviously, but so does trusting an infinite variance estimate. If worse comes to worse, we should acknowledge and signal that the current sampler cannot be trusted. As in statistical settings, we should be able to state we cannot produce a satisfactory solution (and hence need more data or different models).
This edition of the MCMSki conference will include a Breaking News! session, covering the latest developments in the field, latest enough to be missed by the scientific committee when building the program. To be considered for this special session, please indicate you wish to compete for this distinction when submitting your poster. The deadline for submission is November 15, 2015. The selection will be made by the scientific committee and the time allocated to each talk will depend on the number of selected talks. Selected presenters will be notified by December 02, 2015, and they are expected to participate in the poster session to ensure maximal dissemination of their breaking news.
And since I got personal enquiries yesterday, the number of talks during that session will be limited to have real talks and not flash oral presentations of incoming posters. Unless the scientific committee cannot make its mind on which news to break..!
A new arXival today by Abigail Arnold and Jason Loeppky that discusses how simulations studies are and should be conducted when assessing statistical methods.
“Obviously there is no one model that will universally outperform the rest. Recognizing the “No Free Lunch” theorem, the logical question to ask is whether one model will perform best over a given class of problems. Again, we feel that the answer to this question is of course no. But we do feel that there are certain methods that will have a better chance than other methods.”
I find the assumptions or prerequisites of the paper arguable [in the sense of 2. open to disagreement; not obviously correct]—not even mentioning the switch from models to methods in the above—in that I will not be convinced that a method outperforms another method by simply looking at a series of simulation experiments. (Which is why I find some machine learning papers unconvincing, when they introduce a new methodology and run it through a couple benchmarks.) This also reminds me of Samaniego’s Comparison of the Bayesian and frequentist approaches, which requires a secondary prior to run the comparison. (And hence is inconclusive.)
“The papers above typically show the results as a series of side-by-side boxplots (…) for each method, with one plot for each test function and sample size. Conclusions are then drawn from looking at a handful of boxplots which often look very cluttered and usually do not provide clear evidence as to the best method(s). Alternatively, the results will be summarized in a table of average performance (…) These tables are usually overwhelming to look at and interpretations are incredibly inefficient.”
Agreed boxplots are terrible (my friend Jean-Michel is forever arguing against them!). Tables are worse. But why don’t we question RMSE as well? This is most often a very reductive way of comparing methods. I also agree with the point that the design of the simulation studies is almost always overlooked and induces a false sense of precision, while failing to cover a wide enough range of cases. However, and once more, I question the prerequisites for comparing methods through simulations for the purpose of ranking those methods. (Which is not the perspective adopted by James and Nicolas when criticising the use of the Pima Indian dataset.)
“The ECDF allows for quick assessments of methods over a large array of problems to get an overall view while of course not precluding comparisons on individual functions (…) We hope that readers of this paper agree with our opinions and strongly encourage everyone to rely on the ECDF, at least as a starting point, to display relevant statistical information from simulations.”
Drawing a comparison with the benchmarking of optimisation methods, the authors suggest to rank statistical methods via the empirical cdf of their performances or accuracy across (benchmark) problems. Arguing that “significant benefit is gained by [this] collapsing”. I am quite sceptical [as often] of the argument, first because using a (e)cdf means the comparison is unidimensional, second because I see no reason why two cdfs should be easily comparable, third because the collapsing over several problems only operates when the errors for those different problems do not overlap.