Archive for subsampling

practical PDMP

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on December 9, 2021 by xi'an

While in Warwick, last month, I attended a reading group on PDMPs where Filippo Pagani talked about practical PDMP, connected with a recent arXival by Bertazzi, Bierkens and Dobson. The central question when implementing PDMP is to find a realistic way of solving

\int_0^\tau \lambda(x+tv,v)\text dt = \epsilon\quad\epsilon\sim\mathcal Exp(1)

to decide on the stopping time (when the process ceases to be deterministic). The usual approach is to use Poisson thinning by finding an upper bound on λ, but this is either difficult or potentially inefficient (and sometimes both).

“finding a sharp bound M(s) [for Poisson thinning] can be an extremely challenging problem in most practical settings (…) In order to overcome this problem, we introduce discretisation schemes for PDMPs which make their
approximate simulation possible.”

Some of the solutions proposed in Bertazzi et al. are relying on

  1. using a frozen (fixed) λ
  2. discretising time and the integral (first order scheme)
  3. allowing for more than a jump over a time interval (higher order schemes)
  4. going through control variates (when gradient is Lipschitz and Hessian bounded, with known constants) as it produces a linear rate λ
  5. subsampling (at least for Zig Zag)

with theoretical guarantees that the approximations are convergent, as the time step goes to zero. They (almost obviously) remain model dependent solutions (with illustrations for the Zig Zag and bouncy particle versions), with little worse case scenarios, but this is an extended investigation into making PDMPs more manageable!

averaged acceptance ratios

Posted in Statistics with tags , , , , , , , , , , , , , on January 15, 2021 by xi'an

In another recent arXival, Christophe Andrieu, Sinan Yıldırım, Arnaud Doucet, and Nicolas Chopin study the impact of averaging estimators of acceptance ratios in Metropolis-Hastings algorithms. (It is connected with the earlier arXival rephrasing Metropolis-Hastings in terms of involutions discussed here.)

“… it is possible to improve performance of this algorithm by using a modification where the acceptance ratio r(ξ) is integrated with respect to a subset of the proposed variables.”

This interpretation of the current proposal makes it a form of Rao-Blackwellisation, explicitly mentioned on p.18, where, using a mixture proposal, with an adapted acceptance probability, it depends on the integrated acceptance ratio only. Somewhat magically using this ratio and its inverse with probability ½. And it increases the average Metropolis-Hastings acceptance probability (albeit with a larger number of simulations). Since the ideal averaging is rarely available, the authors implement a Monte Carlo averaging version. With applications to the exchange algorithm and to reversible jump MCMC. The major application is to pseudo-marginal settings with a high complexity (in the number T of terms) and where the authors’ approach does scale efficiently with T. There is even an ABC side to the story as one illustration is made of the ABC approximation to the posterior of an α-stable sample. As an encompassing proposal for handling Metropolis-Hastings environments with latent variables and several versions of the acceptance ratios, this is quite an interesting paper that I think we will study in further detail with our students.

is there such a thing as optimal subsampling?

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 12, 2020 by xi'an

This idea of optimal thinnin and burnin has been around since the early days of the MCMC revolution and did not come up with a definite answer. For instance, from a pure estimation perspective, subsampling always increases the variance of the resulting estimator. My personal approach is to ignore both burnin and thinnin and rather waste time on running several copies of the code to check for potential discrepancies and get a crude notion of the variability. And to refuse to answer to questions like is 5000 iterations long enough for burnin?

A recent arXival by Riabiz et al. readdresses the issue. In particular concerning this notion that the variance of the subsampled version is higher: this only applies to a deterministic subsampling, as opposed to an MCMC-based subsampling (although this intricacy only makes the problem harder!). I however fail to understand the argument in favour of subsampling based on storage issues (p.4), as a dynamic storage of the running mean for all quantities of interest does not cost anything if the integrand is not particularly demanding. I also disagree at the pessimistic view that the asymptotic variance of the MCMC estimate is hard to estimate: papers by Flegal, Hobert, Jones, Vat and others have rather clearly shown how batch means can produce converging estimates of this asymptotic variance.

“We do not to attempt to solve a continuous optimisation problem for selection of the next point [in the sample]. Such optimisation problems are fundamentally difficult and can at best be approximately solved. Instead, we exactly solve the discrete optimisation problem of selecting a suitable element from a supplied MCMC output.”

One definitely positive aspect of the paper is that the (thinning) method is called Stein thinning, in connection with Stein’s discrepancy, and this honours Charles Stein. The method looks at the optimal subsample, with optimality defined in terms of minimising Stein’s discrepancy from the true target over a reproducible kernel Hilbert space. And then over a subsample to minimise the distance from the empirical distribution to the theoretical distribution. The kernel (11) is based on the gradient of the target log density and the solution is determined by greedy algorithms that determine which next entry to add to the empirical distribution. Which is of complexity O(nm2) if the subsample is of size m. Some entries may appear more than once and the burnin step could be automatically included as (relatively) unlikely values are never selected (at least this was my heuristic understanding). While the theoretical backup for the construct is present and backed by earlier papers of some of the authors, I do wonder at the use of the most rudimentary representation of an approximation to the target when smoother versions could have been chosen and optimised on the same ground. And I am also surprised at the dependence of both estimators and discrepancies on the choice of the (sort-of) covariance matrix in the inner kernel, as the ODE examples provided in the paper (see, e.g., Figure 7). (As an aside and at a shallow level, the approach also reminded me of the principal points of my late friend Bernhard Flury…) Storage of all MCMC simulations for a later post-processing is of course costly in terms of storage, at O(nm). Unless a “secretary problem” approach can be proposed to get sequential. Another possible alternate would be to consider directly the chain of the accepted values (à la vanilla Rao-Blackwellisation). Overall, since the stopping criterion is based on a fixed sample size, and hence depends on the sub-efficiency of evaluating the mass of different modes, I am unsure the method is anything but what-you-get-is-what-you-see, i.e. prone to get misled by a poor exploration of the complete support of the target.

“This paper focuses on nonuniform subsampling and shows that it is more efficiency than uniform subsampling.”

Two weeks later, Guanyu Hu and Hai Ying Wang arXived their Most Likely Optimal Subsampled Markov Chain Monte Carlo, in what I first thought as an answer to the above! But both actually have little in common as this second paper considers subsampling on the data, rather than the MCMC output, towards producing scalable algorithms. Building upon Bardenet et al. (2014) and Korattikara et al. (2014).  Replacing thus the log-likelihood with a random sub-sampled version and deriving the sample size based on a large deviation inequality. By a Cauchy-Schwartz inequality, the authors find sampling probabilities proportional to the individual log-likelihooods. Which depend on the running value of the MCMC’ed parameters. And thus replaced with the values at a fixed parameter, with cost O(n) but only once, but no so much optimal. (The large deviation inequality therein is only concerned with an approximation to the log-likelihood, without examining the long term impact on the convergence of the approximate Markov chain as this is no longer pseudo-marginal MCMC. For instance, both current and prospective log-likelihoods are re-estimated at each iteration. The paper compares with uniform sampling on toy examples,  to demonstrate a smaller estimation error for the statistical problem, rather than convergence to the true posterior.)

revisiting the Gelman-Rubin diagnostic

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on January 23, 2019 by xi'an

Just before Xmas, Dootika Vats (Warwick) and Christina Knudson arXived a paper on a re-evaluation of the ultra-popular 1992 Gelman and Rubin MCMC convergence diagnostic. Which compares within-variance and between-variance on parallel chains started from hopefully dispersed initial values. Or equivalently an under-estimating and an over-estimating estimate of the MCMC average. In this paper, the authors take advantage of the variance estimators developed by Galin Jones, James Flegal, Dootika Vats and co-authors, which are batch mean estimators consistently estimating the asymptotic variance. They also discuss the choice of a cut-off on the ratio R of variance estimates, i.e., how close to one need it be? By relating R to the effective sample size (for which we also have reservations), which gives another way of calibrating the cut-off. The main conclusion of the study is that the recommended 1.1 bound is too large for a reasonable proximity to the true value of the Bayes estimator (Disclaimer: The above ABCruise header is unrelated with the paper, apart from its use of the Titanic dataset!)

In fact, I have other difficulties than setting the cut-off point with the original scheme as a way to assess MCMC convergence or lack thereof, among which

  1. its dependence on the parameterisation of the chain and on the estimation of a specific target function
  2. its dependence on the starting distribution which makes the time to convergence not absolutely meaningful
  3. the confusion between getting to stationarity and exploring the whole target
  4. its missing the option to resort to subsampling schemes to attain pseudo-independence or scale time to convergence (albeit see 3. above)
  5. a potential bias brought by the stopping rule.

more multiple proposal MCMC

Posted in Books, Statistics with tags , , , , , , , on July 26, 2018 by xi'an

Luo and Tjelmeland just arXived a paper on a new version of multiple-try Metropolis Hastings, the addendum being in defining the additional proposed copies via a dependence graph like (a) above, with one version from the target and all others from operational and conditional proposal kernels. Respecting the dependence graph, as in (b). As I did, you may then wonder where both the graph and the conditional do come from. Which reminds me of the pseudo-posteriors of Carlin and Chib (1995), even though this is not terribly connected. Green (1995).) (But not disconnected either since the authors mention But, given the graph, following a Gibbs scheme, one of the 17 nodes is chosen as generated from the target, using the proper conditional on that index [which is purely artificial from the point of view of the original simulation problem!]. As noted above, the graph is an issue, but since it is artificial, it can be devised to either allow for quasi-independence between the proposed values or on the opposite to induce long range dependence, which corresponds to conducting multiple MCMC steps before reaching the end nodes, a feature that is very appealing in my opinion. And reminds me of prefetching. (As I am listening to Nicolas Chopin’s lecture in Warwick at the moment, there also seems to be a connection with pMCMC.) Still, I remain unclear as to the devising of the graph of dependent proposals, as its depth should be somehow connected with the mixing properties of the original proposal. Gains in convergence may thus come at a high cost at the construction stage.