Archive for autocorrelation

thinning a Markov chain, statistically

Posted in Books, pictures, R, Statistics with tags , , , , , , on June 13, 2017 by xi'an

Art Owen has arXived a new version of his thinning MCMC paper, where he studies how thinning or subsampling can improve computing time in MCMC chains. I remember quite well the message set by Mark Berliner and Steve MacEachern in an early 1990’s paper that subsampling was always increasing the variance of the resulting estimators. We actually have this result in our Monte Carlo Statistical Methods book. Now, there are other perspectives on this, as for instance cases when thinning can be hard-wired by simulating directly a k-step move, delaying rejection or acceptance, prefetching, or simulating directly the accepted values as in our vanilla Rao-Blackwellisation approach. Here, Art considers the case when there is a cost θ of computing a transform of the simulation [when the transition cost a unit] and when those transforms are positively correlated with correlation ρ. Somewhat unsurprisingly, when θ is large enough, thinning becomes worth implementing. But requires extra computations in evaluating the correlation ρ and the cost θ, which is rarely comparable with the cost of computing the likelihood itself, a requirement for the Metropolis-Hastings or Hamiltonian Monte Carlo step(s).  Subsampling while keeping the right target (which is a hard constraint!) should thus have a much more effective impact on computing budgets.

two correlated pseudo-marginals for the price of one!

Posted in Books, Statistics, University life with tags , , , , , , , , , on November 30, 2015 by xi'an

Within two days, two papers on using correlated auxiliary random variables for pseudo-marginal Metropolis-Hastings, one by George Deligiannidis, Arnaud Doucet, Michael Pitt, and Robert Kohn, and another one by Johan Dahlin, Fredrik Lindsten, Joel Kronander, and Thomas Schön! They both make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk or an autoregressive model of sorts, and possibly transform these auxiliary variables by inverse cdf or something else.

“Experimentally, the efficiency of computations is increased relative to the standard pseudo-marginal algorithm by up to 180-fold in our simulations.” Deligiannidis et al. 

The first paper by Deligiannidis et al. aims at reducing the variance of the Metropolis-Hastings acceptance ratio by correlating the auxiliary variables. While the auxiliary variable can be of any dimension, all that matters is its transform into a (univariate) estimate of the density, used as pseudo-marginal at each iteration. However, if a Markov kernel is used for proposing new auxiliary variables, the sequence of the pseudo-marginal is no longer a Markov chain. Which implies looking at the auxiliary variables. Nonetheless, they manage to derive a CLT on the log pseudo-marginal estimate, for a latent variable model with sample size growing to infinity. Based on this CLT, they control the correlation in the Crank-Nicholson proposal so that the asymptotic variance  is of order one:  this correlation has to converge to 1 as 1-exp(-χN/T), where N is the number of Monte Carlo samples for T time intervals. Those results extend to the bootstrap particle filter. Upon reflection, it makes much sense aiming for a high correlation since the unbiased estimator of the target hardly changes from one iteration to the next (but needs to move for the method to be validated by Metropolis-Hastings arguments). The two simulation experiments showed massive gains following this scheme, as reported in the above quote.

“[ABC] can be used to approximate the log-likelihood using importance sampling and particle filtering. However, in practice these estimates suffer from a large variance with results in bad mixing for the Markov chain.” Dahlin et al.

In the second paper, which came a day later, presumably induced by the first paper, acknowledged from the start, the authors also make use of the Crank-Nicholson proposal on the auxiliary variables, which is a shrunk Gaussian random walk, and possibly transform these auxiliary variables by inverse cdf or something else. The central part of the paper is about tuning the scale in the Crank-Nicholson proposal, in the spirit of Gelman, Gilks and Roberts (1996). Since all that matters is the (univariate) estimate of the density used as pseudo-marginal, The authors approximate the law of the log-density by a Gaussian distribution, despite the difficulty with the “projected” Markov chain, thus looking for the optimal scaling but only achieving a numerical optimisation rather than the equivalent of the golden number of MCMC acceptance probabilities, 0.234. Although in a sense the above should be the goal for the auxiliary variable acceptance rate, when those are of high enough dimension. One thing I could not find in this paper is how much (generic) improvement is gathered from this modification from an iid version. (Another is linked with the above quote, which I find difficult to understand as ABC is not primarily intended as a pseudo-marginal method. In a sense it is the worst possible pseudo-marginal estimator in that it uses estimators taking values in {0,1}…)

deep learning ABC summary statistics

Posted in Books, Statistics, University life with tags , , , , , , , , on October 19, 2015 by xi'an

“The main task of this article is to construct low-dimensional and informative summary statistics for ABC methods.”

The idea in the paper “Learning Summary Statistic for ABC via Deep Neural Network”, arXived a few days ago, is to start from the raw data and build a “deep neural network” (meaning a multiple layer neural network) to provide a non-linear regression of the parameters over the data. (There is a rather militant tone to the justification of the approach, not that unusual with proponents of deep learning approaches, I must add…) Whose calibration never seems an issue. The neural construct is called to produce an estimator (function) of θ, θ(x). Which is then used as the summary statistics. Meaning, if Theorem 1 is to be taken as the proposal, that a different ABC needs to be run for every function of interest. Or, in other words, that the method is not reparameterisation invariant.

The paper claims to achieve the same optimality properties as in Fearnhead and Prangle (2012). These are however moderate optimalities in that they are obtained for the tolerance ε equal to zero. And using the exact posterior expectation as a summary statistic, instead of a non-parametric estimate.  And an infinite functional basis in Theorem 2. I thus see little added value in results like Theorem 2 and no real optimality: That the ABC distribution can be arbitrarily close to the exact posterior is not an helpful statement when implementing the method.

The first example in the paper is the posterior distribution associated with the Ising model, which enjoys a sufficient statistic of dimension one. The issue of generating pseudo-data from the Ising model is evacuated by a call to a Gibbs sampler, but remains an intrinsic problem as the convergence of the Gibbs sampler depends on the value of the parameter θ and especially its location wrt the critical point. Both ABC posteriors are shown to be quite close.

The second example is the posterior distribution associated with an MA(2) model, apparently getting into a benchmark in the ABC literature. The comparison between an ABC based on the first two autocorrelations, an ABC based on the semi-automatic solution of Fearnhead and Prangle (2012) [for which collection of summaries?], and the neural network proposal, leads to the dismissal of the semi-automatic solution and the neural net being closest to the exact posterior [with the same tolerance quantile ε for all approaches].

A discussion crucially missing from the paper—from my perspective—is an accounting for size: First, what is the computing cost of fitting and calibrating and storing a neural network for the sole purpose of constructing a summary statistic? Once the neural net is constructed, I would assume most users would see little need in pursuing the experiment any further. (This was also why we stopped at our random forest output rather than using it as a summary statistic.) Second, how do cost and performances evolve as the dimension of the parameter θ grows? I would deem necessary to understand when the method fails. As for instance in latent variable models such as HMMs. Third, how does the size of the sample impact cost and performances? In many realistic cases when ABC applies, it is not possible to use the raw data, given its size, and summary statistics are a given. For such examples, neural networks should be compared with other ABC solutions, using the same reference table.

Graphical comparison of MCMC performance [arXiv:1011.445]

Posted in Books, R, Statistics with tags , , , , , , on November 22, 2010 by xi'an


A new posting on arXiv by Madeleine Thompson on a graphical tool for assessing performance. She has developed a software called SamplerCompare, implemented in R and C. The graphical evaluation plots “log density evaluations per iteration times autocorrelation time against a tuning parameter in a grid of plots where rows represent distributions and columns represent methods”. The autocorrelation time is evaluated in the same way as coda, which is the central package used in the convergence assessment chapter of Introducing Monte Carlo Methods with R because of its array of partial (if imperfect) indicators. Note that there is an approximation factor in the evaluation of the autocorrelation time because the MCMC output is represented as an AR(p) series, with a possible divergence artifact in the corresponding confidence interval if the AR(p) process is found to be non-stationary. When the simulation method (corresponding to columns in the above graphs) allows for an optimal value of its (cyber-)parameters, the performances exhibit a clear parabolic pattern (right graph), but this is not always the case (left graph). Graphical tools are always to be preferred to tables (a point Andrew would not rebuke!), However I do not see the point in simultaneously graphing the performances of different MCMC algorithms for different targets. This “wasted” dimension could instead be used for increasing to at least three the number of cyber-parameters evaluated by the method.