Archive for efficiency measures

rethinking the ESS

Posted in Statistics with tags , , , , , , , , , on September 14, 2018 by xi'an

Following Victor Elvira‘s visit to Dauphine, one and a half year ago, where we discussed the many defects of ESS as a default measure of efficiency for importance sampling estimators, and then some more efforts (mostly from Victor!) to formalise these criticisms, Victor, Luca Martino and I wrote a paper on this notion, now arXived. (Victor most kindly attributes the origin of the paper to a 2010 ‘Og post on the topic!) The starting thread of the (re?)analysis of this tool introduced by Kong (1992) is that the ESS used in the literature is an approximation to the “true” ESS, generally unavailable. Approximation that is pretty crude and hence impacts the relevance of using it as the assessment tool for comparing importance sampling methods. In the paper, we re-derive (with the uttermost precision) the resulting approximation and list the many assumptions that [would] validate this approximation. The resulting drawbacks are many, from the absurd property of always being worse than direct sampling, to being independent from the target function and from the sample per se. Since only importance weights matter. This list of issues is not exactly brand new, but we think it is worth signaling given the fact that this approximation has been widely used in the last 25 years, due to its simplicity, as a practical rule of thumb [!] in a wide variety of importance sampling methods. In continuation of the directions drafted in Martino et al. (2017), we also indicate some alternative notions of importance efficiency. Note that this paper does not cover the use of ESS for MCMC algorithms, where it is somewhat more legit, if still too rudimentary to really catch convergence or lack thereof! [Note: I refrained from the post title resinking the ESS…]

nested sampling when prior and likelihood clash

Posted in Books, Statistics with tags , , , , , , , , , on April 3, 2018 by xi'an

A recent arXival by Chen, Hobson, Das, and Gelderblom makes the proposal of a new nested sampling implementation when prior and likelihood disagree, making simulations from the prior inefficient. The paper holds the position that a single given prior is used over and over all datasets that come along:

“…in applications where one wishes to perform analyses on many thousands (or even millions) of different datasets, since those (typically few) datasets for which the prior is unrepresentative can absorb a large fraction of the computational resources.” Chen et al., 2018

My reaction to this situation, provided (a) I want to implement nested sampling and (b) I realise there is a discrepancy, would be to resort to an importance sampling resolution, as we proposed in our Biometrika paper with Nicolas. Since one objection [from the authors] is that identifying outlier datasets is complicated (it should not be when the likelihood function can be computed) and time-consuming, sequential importance sampling could be implemented.

“The posterior repartitioning (PR) method takes advantage of the fact that nested sampling makes use of the likelihood L(θ) and prior π(θ) separately in its exploration of the parameter space, in contrast to Markov chain Monte Carlo (MCMC) sampling methods or genetic algorithms which typically deal solely in terms of the product.” Chen et al., 2018

The above salesman line does not ring a particularly convincing chime in that nested sampling is about as myopic as MCMC since based on the similar notion of a local proposal move, starting from the lowest likelihood argument (the minimum likelihood estimator!) in the nested sample.

“The advantage of this extension is that one can choose (π’,L’) so that simulating from π’ under the constraint L'(θ) > l is easier than simulating from π under the constraint L(θ) > l. For instance, one may choose an instrumental prior π’ such that Markov chain Monte Carlo steps adapted to the instrumental constrained prior are easier to implement than with respect to the actual constrained prior. In a similar vein, nested importance sampling facilitates contemplating several priors at once, as one may compute the evidence for each prior by producing the same nested sequence, based on the same pair (π’,L’), and by simply modifying the weight function.” Chopin & Robert, 2010

Since the authors propose to switch to a product (π’,L’) such that π’.L’=π.L, the solution appears like a special case of importance sampling, with the added drwaback that when π’ is not normalised, its normalised constant must be estimated as well. (With an extra nested sampling implementation?) Furthermore, the advocated solution is to use tempering, which is not so obvious as it seems in small dimensions. As the mass does not always diffuse to relevant parts of the space. A more “natural” tempering would be to use a subsample in the (sub)likelihood for nested sampling and keep the remainder of the sample for weighting the evaluation of the evidence.

two correlated pseudo-marginals for the price of one!

Posted in Books, Statistics, University life with tags , , , , , , , , , on November 30, 2015 by xi'an

Within two days, two papers on using correlated auxiliary random variables for pseudo-marginal Metropolis-Hastings, one by George Deligiannidis, Arnaud Doucet, Michael Pitt, and Robert Kohn, and another one by Johan Dahlin, Fredrik Lindsten, Joel Kronander, and Thomas Schön! They both make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk or an autoregressive model of sorts, and possibly transform these auxiliary variables by inverse cdf or something else.

“Experimentally, the efficiency of computations is increased relative to the standard pseudo-marginal algorithm by up to 180-fold in our simulations.” Deligiannidis et al. 

The first paper by Deligiannidis et al. aims at reducing the variance of the Metropolis-Hastings acceptance ratio by correlating the auxiliary variables. While the auxiliary variable can be of any dimension, all that matters is its transform into a (univariate) estimate of the density, used as pseudo-marginal at each iteration. However, if a Markov kernel is used for proposing new auxiliary variables, the sequence of the pseudo-marginal is no longer a Markov chain. Which implies looking at the auxiliary variables. Nonetheless, they manage to derive a CLT on the log pseudo-marginal estimate, for a latent variable model with sample size growing to infinity. Based on this CLT, they control the correlation in the Crank-Nicholson proposal so that the asymptotic variance  is of order one:  this correlation has to converge to 1 as 1-exp(-χN/T), where N is the number of Monte Carlo samples for T time intervals. Those results extend to the bootstrap particle filter. Upon reflection, it makes much sense aiming for a high correlation since the unbiased estimator of the target hardly changes from one iteration to the next (but needs to move for the method to be validated by Metropolis-Hastings arguments). The two simulation experiments showed massive gains following this scheme, as reported in the above quote.

“[ABC] can be used to approximate the log-likelihood using importance sampling and particle filtering. However, in practice these estimates suffer from a large variance with results in bad mixing for the Markov chain.” Dahlin et al.

In the second paper, which came a day later, presumably induced by the first paper, acknowledged from the start, the authors also make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk, and possibly transform these auxiliary variables by inverse cdf or something else. The central part of the paper is about tuning the scale in the Crank-Nicholson proposal, in the spirit of Gelman, Gilks and Roberts (1996). Since all that matters is the (univariate) estimate of the density used as pseudo-marginal, The authors approximate the law of the log-density by a Gaussian distribution, despite the difficulty with the “projected” Markov chain, thus looking for the optimal scaling but only achieving a numerical optimisation rather than the equivalent of the golden number of MCMC acceptance probabilities, 0.234. Although in a sense the above should be the goal for the auxiliary variable acceptance rate, when those are of high enough dimension. One thing I could not find in this paper is how much (generic) improvement is gathered from this modification from an iid version. (Another is linked with the above quote, which I find difficult to understand as ABC is not primarily intended as a pseudo-marginal method. In a sense it is the worst possible pseudo-marginal estimator in that it uses estimators taking values in {0,1}…)