## more air for MCMC

**A**ki Vehtari, Andrew Gelman, Dan Simpson, Bob Carpenter, and Paul-Christian Bürkner have just published a Bayesian Analysis paper about using an improved R factor for MCMC convergence assessment. From the early days of MCMC, convergence assessment has been a recurring (and recurrent!) question in the community. First leading to a flurry of proposals, [which Kerrie, Chantal, and myself reviewwwed in the Valencia 1998 proceedings], and then slowly disintegrating under the onslaughts of reality—i.e. that none could not be 100% foolproof in full generality—…. This included the (possibly now forgotten) *single-versus-multiple-chains* debate between Charlie Geyer [for single] and Andrew Gelman and Don Rubin [for multiple]. The later introduced an analysis-of-variance R factor, which remains quite popular up to this day, in part for being part of most MCMC software, like BUGS. That this R may fail to identify convergence issues, even in the more recent split version, does not come as a major surprise, since any situation with a long-term influence of the starting distribution may well fail to identify missing (significant) parts of the posterior support. (It is thus somewhat disconcerting to me to see that the main recommendation is to move the bound on R from 1.1 to 1.01, reminding me to some extent of a recent proposal to move the null rejection boundary from 0.05 to 0.005…) Similarly, the ESS may prove a poor signal for convergence or lack thereof, especially because the approximation of the asymptotic variance relies on stationarity assumptions. While multiplying the monitoring tools (as in CODA) helps with identifying convergence issues, looking at a single convergence indicator is somewhat like looking only at a frequentist estimator! (And with greater automation comes greater responsibility—in keeping a critical perspective.)

Looking for a broader perspective, I thus wonder at what we would instead need to assess the lack of convergence of an MCMC chain without much massaging of the said chain. An evaluation of the (Kullback, Wasserstein, or else) distance between the distribution of the chain at iteration n or across iterations, and the true target? A percentage of the mass of the posterior visited so far, which relates to estimating the normalising constant, with a relatively vast array of solutions made available in the recent years? I remain perplexed and frustrated by the fact that, 30 years later, the computed values of the visited likelihoods are not better exploited. Through for instance machine-learning approximations of the target. that could themselves be utilised for approximating the normalising constant and potential divergences from other approximations.

June 10, 2021 at 4:23 am

Two quick points:

* By coincidence (and for a different problem), I’ve just been looking at the work of Gorham & Mackey that I believe Pierre is referring to. This is probably the relevant paper: “Measuring Sample Quality with Kernels” (https://arxiv.org/abs/1703.01717).

* Besides their new rank-based R-hat, bloggers on Gelman’s blog have also pointed to another R-hat replacement, R*, developed by some Stan team members; it is “based on how well a machine learning classifier model can successfully discriminate the individual chains.” See: “R*: A robust MCMC convergence diagnostic with uncertainty using decision tree classifiers” (https://arxiv.org/abs/2003.07900).

In addition, here’s an anecdote regarding your comment, “I remain perplexed and frustrated by the fact that, 30 years later, the computed values of the visited likelihoods are not better exploited.”

That has long bothered me, too. During a SAMSI program around 2006, I spent time working on one approach that tried to use the prior*likelihood (I call it q(theta), for “quasiposterior” and because it’s next to “p”!) to compute the marginal likelihood. It would take posterior samples (from MCMC or another approach) and find their Delaunay triangulation. Then, using q(theta) on the nodes of the simplices comprising the triangulation, it used a simplicial cubature rule to approximate the integral of q(theta) over the volume spanned by the samples.

As I recall, I only explored it with multivariate normal and Student-t targets. It failed, but in an interesting way. It worked well in low dimensions, but gave increasingly poor estimates as dimension grew. The problem appeared related to concentration of measure (or the location of the typical set), with the points not sufficiently covering the center or the large volume in the tails (or both; I can’t remember what diagnostics said exactly).

Another problem is that Delaunay triangulation gets expensive quickly with growing dimension. This method doesn’t need an optimal triangulation, so I wondered if there was a faster sub-optimal triangulation algorithm, but I couldn’t find one.

An interesting aspect of this approach is that the fact that the points are drawn from the prior doesn’t matter. Any set of points is a valid set of points for approximating the integral (in the spanned volume). I just used posterior samples because I presumed those would be available from MCMC. I briefly did some experiments taking the samples, and reweighting them to draw a subset for the cubature that was either over- or under-dispersed vs. the target. And one could improve things this way (I can’t remember what choice was better). This suggests that points drawn from q(theta) aren’t optimal for such cubature, but I never tried looking formally for the optimal choice.

I called the approach “adaptive simplicial cubature,” adaptive in the sense that the points are chosen in a way that depends on the integrand.

The only related work I could find at the time was work by *you* and Anne Philippe on Riemanns sums with MCMC (https://doi.org/10.1023/A:1008926514119). I later stumbled upon a paper on “random Riemann sum estimators” as an alternative to Monte Carlo that seems related but that I didn’t explore further (https://doi.org/10.1016/j.csda.2006.09.041).

I still find it hard to believe that the q values aren’t useful. Admittedly, in an n-dimensional distribution, it’s just 1 more quantity available beyond the n that comprise the sample location. But it’s a qualitatively different type of information from the sample *location*, and I can’t help but think there’s some clever way to use it (besides emulating the response surface).

June 1, 2021 at 10:12 pm

Relating to the evaluation of a notion of distance between the samples and the target distribution, the works of Jackson Gorham and Lester Mackey seem relevant.

May 30, 2021 at 8:13 am

[…] article was first published on R – Xi'an's Og, and kindly contributed to R-bloggers]. (You can report issue about the content on this page here) […]