## coupling, donkeys, coins & fish meet in Paris

*Related*

This entry was posted on March 22, 2021 at 2:24 pm and is filed under Statistics with tags #LedByDonkeys, Arianna Rosenbluth, Arthur Dempster, Bernoulli-Laplace diffusion, Charlie Geyer, coupling, coupling inequality, Dempster-Shafer theory, Gérard Letac, Gibbs sampler, Markov chains, MCMC convergence, number of iterations, optimal coupling, Paris, perfect sampling, random walk, rate of convergence, Séminaire Parisien de Statistique, seminar, slide, STAN, Valencia conferences. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

### 10 Responses to “coupling, donkeys, coins & fish meet in Paris”

### Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

March 23, 2021 at 8:36 am

c’est enregistré qq part?

March 23, 2021 at 8:45 am

je crains que non…

March 23, 2021 at 9:11 am

Hello, j’ai mis les slides ici https://www.slideshare.net/pierrejacob/couplings-of-markov-chains-and-the-poisson-equation

March 22, 2021 at 5:52 pm

I think everybody agrees with Geyer that if you can’t get a good estimate with one chain, you won’t get a good estimate with multiple chains. The only issue is whether running multiple chains gives you a better chance of diagnosing sampling problems that aren’t caught by a single-chain ESS estimate. This is obviously helpful with multimodality. Contra Gelman, it’s not so clear to me that it’s useful for much else with HMC, which doesn’t have the pathologically slow mixing of Metropolis or Gibbs where two chains both appear to be sampling just because they’re diffusing in a small region of the posterior. We’ll be able to diagnose that with a single HMC chain that is clearly non-stationary (for instance by applying the split R-hat we use in Stan to both halves of the chain). Four chains sampling in parallel from the target distribution are equally likely to find a bad spot where curvature is high and they get stuck as one long chain (assuming they’ve hit stationarity). Of course, running parallel chains is cheaper in wall time per iteration if you have the spare cores and memory bandwidth.

We set the default to 2K iterations in Stan heuristically based on a bunch of models. That’s only 1K MCMC steps after 1K adaptation steps. But we default to running 4 chains, for a total sample size of 4K posterior draws. If models are fast and mix well, this is cheap. Some, but not very many models take longer than this if they’re going to mix. So like with a lot of our decisions, it was a fairly conservative one.

Our target is to get an ESS of 25 per chain, for a total ESS on the order of 100. Much smaller than ESS 25 per chain and we don’t trust our ESS estimator. Much more than ESS = 100 total seems wasteful unless you really need precise expectations for something (100 estimates an expectation to within 1/10 of a posterior sd). But it’s hard convincing users and editors that ESS = 100 is sufficient.

A simple iterative deepening algorithm lets you run to a given tolerance. Say you want ESS = 100. Set an initial number of iterations and chains, let’s say 4 chains of 50 iterations each so that we get 4 * 50 / 2 = 100 draws. Measure the ESS that comes out and if it’s less than your target, double the number of iterations. Repeat until it is. Because of the exponential doubling, you need at most logaritmic steps, so this costs about double what it’d cost with an oracle that told you the right number of iterations. There’ll be some minor bias from using a target threshold, but I don’t see that being a problem unless the ESS requested is very low. Unfortunately, Stan runs three sequential phases of warmup, then sampling, so it’s not so easy to take up where you left off on the previous run. I built a prototype of the irterative deepening approach, but nobody seemed to like it (for reasons that still escape me).

March 23, 2021 at 10:21 am

Bob, in your view this problem of the choice of the number of iterations in MCMC is solved then? Or do you still see it as mostly open?

March 23, 2021 at 5:49 pm

Not at all! I just think running multiple chains vs. running one long chain is a non-issue, especially in light of parallel computing. I also see no harm in running convergence diagnostics vs. just blindly trusting an MCMC run. I also don’t see much danger in using something like iterative deepening to run until you get the estimated ESS you’re after. It just automates what people do outside the box anyway. It could stand to be analyzed theoretically in terms of possible bias due to variance of the ESS estimates.

I think your work on coupling provides a bound on how much time is required for an independent draw. But it’s still empirical in practice. And the results seem to require quite a few iterations relative to the number we use in practice to get fairly reliable accuracy. So there’s a practice/theory gap to explore there, I think.

What would be really great is some way to inspect a model and figure out how many iterations we’ll need to get within a given error. Even if that took some MCMC to figure out.

But even for that, we’d need to be able to establish something like geometric ergodicity to get an MCMC central limit theorem, which can be a matter of data plus model plus initialization point, not just model alone.

The real problem in practice is that convergence diagnostics don’t have 100% sensitivity. A common misdiagnosis is the limit of a hierarchical model with no data, Neal’s funnel, coded with a centered parameterization. HMC (and NUTS) fail to explore the posterior and wind up with biased log scale estimates, yet nevertheless still look OK based on effective sample size and R-hat convergence estimates. This is a practical hurdle for modelers who have to recogize the problem on first principles or in slow movement of hierarchical parameters, then figure out how to code a reparameterization with appropriate Jacobians.

The fact that our convergence diagnostics are imperfect and there’s no way to theoretically guarantee geometric ergodicity for a given model and data combination is what motivates three steps in practical model fitting: simulation-based calibration to make sure the algorithm can fit well specified simulated data with appropriate posterior coverage, posterior predictive checks to make sure the model can fit relevant quantities of interest in the actual data, and cross validation or held-out validation to make sure the model can fit relevant quantities of interest in held out data. And even then, none of these things guarantee you get the right answer.

March 22, 2021 at 3:27 pm

Les editeurs n’avaient pas voulu du titre ‘Donkey business.’

March 23, 2021 at 8:44 am

Editors are killjoys!, except those who accepted “GIbbs for pigs”.

March 23, 2021 at 9:12 am

Haha. Très bon papier en tout cas, merci !

March 23, 2021 at 3:54 pm

J’ai mis des explications sur le lien Dempster Donkey ici https://statisfaction.wordpress.com/2021/03/23/dempsters-analysis-and-donkeys/