## is there such a thing as optimal subsampling?

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on June 12, 2020 by xi'an

This idea of optimal thinnin and burnin has been around since the early days of the MCMC revolution and did not come up with a definite answer. For instance, from a pure estimation perspective, subsampling always increases the variance of the resulting estimator. My personal approach is to ignore both burnin and thinnin and rather waste time on running several copies of the code to check for potential discrepancies and get a crude notion of the variability. And to refuse to answer to questions like is 5000 iterations long enough for burnin?

A recent arXival by Riabiz et al. readdresses the issue. In particular concerning this notion that the variance of the subsampled version is higher: this only applies to a deterministic subsampling, as opposed to an MCMC-based subsampling (although this intricacy only makes the problem harder!). I however fail to understand the argument in favour of subsampling based on storage issues (p.4), as a dynamic storage of the running mean for all quantities of interest does not cost anything if the integrand is not particularly demanding. I also disagree at the pessimistic view that the asymptotic variance of the MCMC estimate is hard to estimate: papers by Flegal, Hobert, Jones, Vat and others have rather clearly shown how batch means can produce converging estimates of this asymptotic variance.

“We do not to attempt to solve a continuous optimisation problem for selection of the next point [in the sample]. Such optimisation problems are fundamentally difficult and can at best be approximately solved. Instead, we exactly solve the discrete optimisation problem of selecting a suitable element from a supplied MCMC output.”

One definitely positive aspect of the paper is that the (thinning) method is called Stein thinning, in connection with Stein’s discrepancy, and this honours Charles Stein. The method looks at the optimal subsample, with optimality defined in terms of minimising Stein’s discrepancy from the true target over a reproducible kernel Hilbert space. And then over a subsample to minimise the distance from the empirical distribution to the theoretical distribution. The kernel (11) is based on the gradient of the target log density and the solution is determined by greedy algorithms that determine which next entry to add to the empirical distribution. Which is of complexity O(nm2) if the subsample is of size m. Some entries may appear more than once and the burnin step could be automatically included as (relatively) unlikely values are never selected (at least this was my heuristic understanding). While the theoretical backup for the construct is present and backed by earlier papers of some of the authors, I do wonder at the use of the most rudimentary representation of an approximation to the target when smoother versions could have been chosen and optimised on the same ground. And I am also surprised at the dependence of both estimators and discrepancies on the choice of the (sort-of) covariance matrix in the inner kernel, as the ODE examples provided in the paper (see, e.g., Figure 7). (As an aside and at a shallow level, the approach also reminded me of the principal points of my late friend Bernhard Flury…) Storage of all MCMC simulations for a later post-processing is of course costly in terms of storage, at O(nm). Unless a “secretary problem” approach can be proposed to get sequential. Another possible alternate would be to consider directly the chain of the accepted values (à la vanilla Rao-Blackwellisation). Overall, since the stopping criterion is based on a fixed sample size, and hence depends on the sub-efficiency of evaluating the mass of different modes, I am unsure the method is anything but what-you-get-is-what-you-see, i.e. prone to get misled by a poor exploration of the complete support of the target.

“This paper focuses on nonuniform subsampling and shows that it is more efficiency than uniform subsampling.”

Two weeks later, Guanyu Hu and Hai Ying Wang arXived their Most Likely Optimal Subsampled Markov Chain Monte Carlo, in what I first thought as an answer to the above! But both actually have little in common as this second paper considers subsampling on the data, rather than the MCMC output, towards producing scalable algorithms. Building upon Bardenet et al. (2014) and Korattikara et al. (2014).  Replacing thus the log-likelihood with a random sub-sampled version and deriving the sample size based on a large deviation inequality. By a Cauchy-Schwartz inequality, the authors find sampling probabilities proportional to the individual log-likelihooods. Which depend on the running value of the MCMC’ed parameters. And thus replaced with the values at a fixed parameter, with cost O(n) but only once, but no so much optimal. (The large deviation inequality therein is only concerned with an approximation to the log-likelihood, without examining the long term impact on the convergence of the approximate Markov chain as this is no longer pseudo-marginal MCMC. For instance, both current and prospective log-likelihoods are re-estimated at each iteration. The paper compares with uniform sampling on toy examples,  to demonstrate a smaller estimation error for the statistical problem, rather than convergence to the true posterior.)

## [Nature on] simulations driving the world’s response to COVID-19

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on April 30, 2020 by xi'an

Nature of 02 April 2020 has a special section on simulation methods used to assess and predict the pandemic evolution. Calling for caution as the models used therein, like the standard ODE S(E)IR models, which rely on assumptions on the spread of the data and very rarely on data, especially in the early stages of the pandemic. One epidemiologist is quote stating “We’re building simplified representations of reality” but this is not dire enough, as “simplified” evokes “less precise” rather than “possibly grossly misleading”. (The graph above is unrelated to the Nature cover and appears to me as particularly appalling in mixing different types of data, time-scale, population at risk, discontinuous updates, and essentially returning no information whatsoever.)

“[the model] requires information that can be only loosely estimated at the start of an epidemic, such as the proportion of infected people who die, and the basic reproduction number (…) rough estimates by epidemiologists who tried to piece together the virus’s basic properties from incomplete information in different countries during the pandemic’s early stages. Some parameters, meanwhile, must be entirely assumed.”

The report mentions that the team at Imperial College, which predictions impacted the UK Government decisions, also used an agent-based model, with more variability or stochasticity in individual actions, which require even more assumptions or much more refined, representative, and trustworthy data.

“Unfortunately, during a pandemic it is hard to get data — such as on infection rates — against which to judge a model’s projections.”

Unfortunately, the paper was written in the early days of the rise of cases in the UK, which means predictions were not much opposed to actual numbers of deaths and hospitalisations. The following quote shows how far off they can fall from reality:

“the British response, Ferguson said on 25 March, makes him “reasonably confident” that total deaths in the United Kingdom will be held below 20,000.”

since the total number as of April 29 is above 21,000 24,000 29,750 and showing no sign of quickly slowing down… A quite useful general public article, nonetheless.

## likelihood-free Bayesian design [SimStat 2019 discussion]

Posted in Statistics with tags , , , , , , , , , , on September 5, 2019 by xi'an

## consistency of ABC

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on August 25, 2015 by xi'an

Along with David Frazier and Gael Martin from Monash University, Melbourne, we have just completed (and arXived) a paper on the (Bayesian) consistency of ABC methods, producing sufficient conditions on the summary statistics to ensure consistency of the ABC posterior. Consistency in the sense of the prior concentrating at the true value of the parameter when the sample size and the inverse tolerance (intolerance?!) go to infinity. The conditions are essentially that the summary statistics concentrates around its mean and that this mean identifies the parameter. They are thus weaker conditions than those found earlier consistency results where the authors considered convergence to the genuine posterior distribution (given the summary), as for instance in Biau et al. (2014) or Li and Fearnhead (2015). We do not require here a specific rate of decrease to zero for the tolerance ε. But still they do not hold all the time, as shown for the MA(2) example and its first two autocorrelation summaries, example we started using in the Marin et al. (2011) survey. We further propose a consistency assessment based on the main consistency theorem, namely that the ABC-based estimates of the marginal posterior densities for the parameters should vary little when adding extra components to the summary statistic, densities estimated from simulated data. And that the mean of the resulting summary statistic is indeed one-to-one. This may sound somewhat similar to the stepwise search algorithm of Joyce and Marjoram (2008), but those authors aim at obtaining a vector of summary statistics that is as informative as possible. We also examine the consistency conditions when using an auxiliary model as in indirect inference. For instance, when using an AR(2) auxiliary model for estimating an MA(2) model. And ODEs.

## hypothesis testing for MCMC

Posted in Books, Statistics, University life with tags , , , , on October 6, 2014 by xi'an

A recent arXival by Benjamin Gyori and Daniel Paulin considers sequential testing based on MCMC simulation. The test is about an expectation under the target and stationary distribution of the Markov chain (i.e., the posterior in a Bayesian setting). Hence testing whether or not the posterior expectation is below a certain bound is not directly relevant from a Bayesian perspective. One would test instead whether or not the parameter itself is below the bound… The paper is then more a study of sequential tests when the data is a Markov chain than in any clear connection with MCMC topics. Despite the paper including an example of a Metropolis-Hastings scheme for approximating the posterior on the parameters of an ODE. I am a bit puzzled by the purpose of the test, as I was rather expecting tests connected with the convergence of the Markov chain or of the empirical mean. (But, given the current hour, I may also have missed a crucial point!)