Archive for subsampling

is there such a thing as optimal subsampling?

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 12, 2020 by xi'an

This idea of optimal thinnin and burnin has been around since the early days of the MCMC revolution and did not come up with a definite answer. For instance, from a pure estimation perspective, subsampling always increases the variance of the resulting estimator. My personal approach is to ignore both burnin and thinnin and rather waste time on running several copies of the code to check for potential discrepancies and get a crude notion of the variability. And to refuse to answer to questions like is 5000 iterations long enough for burnin?

A recent arXival by Riabiz et al. readdresses the issue. In particular concerning this notion that the variance of the subsampled version is higher: this only applies to a deterministic subsampling, as opposed to an MCMC-based subsampling (although this intricacy only makes the problem harder!). I however fail to understand the argument in favour of subsampling based on storage issues (p.4), as a dynamic storage of the running mean for all quantities of interest does not cost anything if the integrand is not particularly demanding. I also disagree at the pessimistic view that the asymptotic variance of the MCMC estimate is hard to estimate: papers by Flegal, Hobert, Jones, Vat and others have rather clearly shown how batch means can produce converging estimates of this asymptotic variance.

“We do not to attempt to solve a continuous optimisation problem for selection of the next point [in the sample]. Such optimisation problems are fundamentally difficult and can at best be approximately solved. Instead, we exactly solve the discrete optimisation problem of selecting a suitable element from a supplied MCMC output.”

One definitely positive aspect of the paper is that the (thinning) method is called Stein thinning, in connection with Stein’s discrepancy, and this honours Charles Stein. The method looks at the optimal subsample, with optimality defined in terms of minimising Stein’s discrepancy from the true target over a reproducible kernel Hilbert space. And then over a subsample to minimise the distance from the empirical distribution to the theoretical distribution. The kernel (11) is based on the gradient of the target log density and the solution is determined by greedy algorithms that determine which next entry to add to the empirical distribution. Which is of complexity O(nm2) if the subsample is of size m. Some entries may appear more than once and the burnin step could be automatically included as (relatively) unlikely values are never selected (at least this was my heuristic understanding). While the theoretical backup for the construct is present and backed by earlier papers of some of the authors, I do wonder at the use of the most rudimentary representation of an approximation to the target when smoother versions could have been chosen and optimised on the same ground. And I am also surprised at the dependence of both estimators and discrepancies on the choice of the (sort-of) covariance matrix in the inner kernel, as the ODE examples provided in the paper (see, e.g., Figure 7). (As an aside and at a shallow level, the approach also reminded me of the principal points of my late friend Bernhard Flury…) Storage of all MCMC simulations for a later post-processing is of course costly in terms of storage, at O(nm). Unless a “secretary problem” approach can be proposed to get sequential. Another possible alternate would be to consider directly the chain of the accepted values (à la vanilla Rao-Blackwellisation). Overall, since the stopping criterion is based on a fixed sample size, and hence depends on the sub-efficiency of evaluating the mass of different modes, I am unsure the method is anything but what-you-get-is-what-you-see, i.e. prone to get misled by a poor exploration of the complete support of the target.

“This paper focuses on nonuniform subsampling and shows that it is more efficiency than uniform subsampling.”

Two weeks later, Guanyu Hu and Hai Ying Wang arXived their Most Likely Optimal Subsampled Markov Chain Monte Carlo, in what I first thought as an answer to the above! But both actually have little in common as this second paper considers subsampling on the data, rather than the MCMC output, towards producing scalable algorithms. Building upon Bardenet et al. (2014) and Korattikara et al. (2014).  Replacing thus the log-likelihood with a random sub-sampled version and deriving the sample size based on a large deviation inequality. By a Cauchy-Schwartz inequality, the authors find sampling probabilities proportional to the individual log-likelihooods. Which depend on the running value of the MCMC’ed parameters. And thus replaced with the values at a fixed parameter, with cost O(n) but only once, but no so much optimal. (The large deviation inequality therein is only concerned with an approximation to the log-likelihood, without examining the long term impact on the convergence of the approximate Markov chain as this is no longer pseudo-marginal MCMC. For instance, both current and prospective log-likelihoods are re-estimated at each iteration. The paper compares with uniform sampling on toy examples,  to demonstrate a smaller estimation error for the statistical problem, rather than convergence to the true posterior.)

revisiting the Gelman-Rubin diagnostic

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on January 23, 2019 by xi'an

Just before Xmas, Dootika Vats (Warwick) and Christina Knudson arXived a paper on a re-evaluation of the ultra-popular 1992 Gelman and Rubin MCMC convergence diagnostic. Which compares within-variance and between-variance on parallel chains started from hopefully dispersed initial values. Or equivalently an under-estimating and an over-estimating estimate of the MCMC average. In this paper, the authors take advantage of the variance estimators developed by Galin Jones, James Flegal, Dootika Vats and co-authors, which are batch mean estimators consistently estimating the asymptotic variance. They also discuss the choice of a cut-off on the ratio R of variance estimates, i.e., how close to one need it be? By relating R to the effective sample size (for which we also have reservations), which gives another way of calibrating the cut-off. The main conclusion of the study is that the recommended 1.1 bound is too large for a reasonable proximity to the true value of the Bayes estimator (Disclaimer: The above ABCruise header is unrelated with the paper, apart from its use of the Titanic dataset!)

In fact, I have other difficulties than setting the cut-off point with the original scheme as a way to assess MCMC convergence or lack thereof, among which

  1. its dependence on the parameterisation of the chain and on the estimation of a specific target function
  2. its dependence on the starting distribution which makes the time to convergence not absolutely meaningful
  3. the confusion between getting to stationarity and exploring the whole target
  4. its missing the option to resort to subsampling schemes to attain pseudo-independence or scale time to convergence (albeit see 3. above)
  5. a potential bias brought by the stopping rule.

more multiple proposal MCMC

Posted in Books, Statistics with tags , , , , , , , on July 26, 2018 by xi'an

Luo and Tjelmeland just arXived a paper on a new version of multiple-try Metropolis Hastings, the addendum being in defining the additional proposed copies via a dependence graph like (a) above, with one version from the target and all others from operational and conditional proposal kernels. Respecting the dependence graph, as in (b). As I did, you may then wonder where both the graph and the conditional do come from. Which reminds me of the pseudo-posteriors of Carlin and Chib (1995), even though this is not terribly connected. Green (1995).) (But not disconnected either since the authors mention But, given the graph, following a Gibbs scheme, one of the 17 nodes is chosen as generated from the target, using the proper conditional on that index [which is purely artificial from the point of view of the original simulation problem!]. As noted above, the graph is an issue, but since it is artificial, it can be devised to either allow for quasi-independence between the proposed values or on the opposite to induce long range dependence, which corresponds to conducting multiple MCMC steps before reaching the end nodes, a feature that is very appealing in my opinion. And reminds me of prefetching. (As I am listening to Nicolas Chopin’s lecture in Warwick at the moment, there also seems to be a connection with pMCMC.) Still, I remain unclear as to the devising of the graph of dependent proposals, as its depth should be somehow connected with the mixing properties of the original proposal. Gains in convergence may thus come at a high cost at the construction stage.

minibatch acceptance for Metropolis-Hastings

Posted in Books, Statistics with tags , , , , , on January 12, 2018 by xi'an

An arXival that appeared last July by Seita, Pan, Chen, and Canny, and that relates to my current interest in speeding up MCMC. And to 2014 papers by  Korattikara et al., and Bardenet et al. Published in Uncertainty in AI by now. The authors claim that their method requires less data per iteration than earlier ones…

“Our test is applicable when the variance (over data samples) of the log probability ratio between the proposal and the current state is less than one.”

By test, the authors mean a mini-batch formulation of the Metropolis-Hastings acceptance ratio in the (special) setting of iid data. First they use Barker’s version of the acceptance probability instead of Metropolis’. Second, they use a Gaussian approximation to the distribution of the logarithm of the Metropolis ratio for the minibatch, while the Barker acceptance step corresponds to comparing a logistic perturbation of the logarithm of the Metropolis ratio against zero. Which amounts to compare the logarithm of the Metropolis ratio for the minibatch, perturbed by a logistic minus Normal variate. (The cancellation of the Normal in eqn (13) is a form of fiducial fallacy, where the Normal variate has two different meanings. In other words, the difference of two Normal variates is not equal to zero.) However, the next step escapes me as the authors seek to optimise the distribution of this logistic minus Normal variate. Which I thought was uniquely defined as such a difference. Another constraint is that the estimated variance of the log-likelihood ratio gets below one. (Why one?) The argument is that the average of the individual log-likelihoods is approximately Normal by virtue of the Central Limit Theorem. Even when randomised. While the illustrations on a Gaussian mixture and on a logistic regression demonstrate huge gains in computational time, it is unclear to me to which amount one can trust the approximation for a given model and sample size…

likelihood inflating sampling algorithm

Posted in Books, Statistics, University life with tags , , , , , , , , on May 24, 2016 by xi'an

My friends from Toronto Radu Craiu and Jeff Rosenthal have arXived a paper along with Reihaneh Entezari on MCMC scaling for large datasets, in the spirit of Scott et al.’s (2013) consensus Monte Carlo. They devised an likelihood inflated algorithm that brings a novel perspective to the problem of large datasets. This question relates to earlier approaches like consensus Monte Carlo, but also kernel and Weierstrass subsampling, already discussed on this blog, as well as current research I am conducting with my PhD student Changye Wu. The approach by Entezari et al. is somewhat similar to consensus Monte Carlo and the other solutions in that they consider an inflated (i.e., one taken to the right power) likelihood based on a subsample, with the full sample being recovered by importance sampling. Somewhat unsurprisingly this approach leads to a less dispersed estimator than consensus Monte Carlo (Theorem 1). And the paper only draws a comparison with that sub-sampling method, rather than covering other approaches to the problem, maybe because this is the most natural connection, one approach being the k-th power of the other approach.

“…we will show that [importance sampling] is unnecessary in many instances…” (p.6)

An obvious question that stems from the approach is the call for importance sampling, since the numerator of the importance sampler involves the full likelihood which is unavailable in most instances when sub-sampled MCMC is required. I may have missed the part of the paper where the above statement is discussed, but the only realistic example discussed therein is the Bayesian regression tree (BART) of Chipman et al. (1998). Which indeed constitutes a challenging if one-dimensional example, but also one that requires delicate tuning that leads to cancelling importance weights but which may prove delicate to extrapolate to other models.