Archive for convergence assessment

more air [&q’s] for MCMC [comments]

Posted in Books, pictures, Statistics with tags , , , , , , , , , on June 11, 2021 by xi'an

[A rich set of comments by Tom Loredo about convergence assessments for MCMC that I feel needs reposting:]

Two quick points:

  • By coincidence (and for a different problem), I’ve just been looking at the work of Gorham & Mackey that I believe Pierre is referring to. This is probably the relevant paper: “Measuring Sample Quality with Kernels“.
  • Besides their new rank-based R-hat, bloggers on Gelman’s blog have also pointed to another R-hat replacement, R, developed by some Stan team members; it is “based on how well a machine learning classifier model can successfully discriminate the individual chains.” See: “R: A robust MCMC convergence diagnostic with uncertainty using decision tree classifiers”.

In addition, here’s an anecdote regarding your comment, “I remain perplexed and frustrated by the fact that, 30 years later, the computed values of the visited likelihoods are not better exploited.”

That has long bothered me, too. During a SAMSI program around 2006, I spent time working on one approach that tried to use the prior*likelihood (I call it q(θ), for “quasiposterior” and because it’s next to “p”!) to compute the marginal likelihood. It would take posterior samples (from MCMC or another approach) and find their Delaunay triangulation. Then, using q(θ) on the nodes of the simplices comprising the triangulation, it used a simplicial cubature rule to approximate the integral of q(theta) over the volume spanned by the samples.

As I recall, I only explored it with multivariate normal and Student-t targets. It failed, but in an interesting way. It worked well in low dimensions, but gave increasingly poor estimates as dimension grew. The problem appeared related to concentration of measure (or the location of the typical set), with the points not sufficiently covering the center or the large volume in the tails (or both; I can’t remember what diagnostics said exactly).

Another problem is that Delaunay triangulation gets expensive quickly with growing dimension. This method doesn’t need an optimal triangulation, so I wondered if there was a faster sub-optimal triangulation algorithm, but I couldn’t find one.

An interesting aspect of this approach is that the fact that the points are drawn from the prior doesn’t matter. Any set of points is a valid set of points for approximating the integral (in the spanned volume). I just used posterior samples because I presumed those would be available from MCMC. I briefly did some experiments taking the samples, and reweighting them to draw a subset for the cubature that was either over- or under-dispersed vs. the target. And one could improve things this way (I can’t remember what choice was better). This suggests that points drawn from q(theta) aren’t optimal for such cubature, but I never tried looking formally for the optimal choice.

I called the approach “adaptive simplicial cubature,” adaptive in the sense that the points are chosen in a way that depends on the integrand.

The only related work I could find at the time was work by you and Anne Philippe on Riemanns sums with MCMC (https://doi.org/10.1023/A:1008926514119). I later stumbled upon a paper on “random Riemann sum estimators” as an alternative to Monte Carlo that seems related but that I didn’t explore further (https://doi.org/10.1016/j.csda.2006.09.041).

I still find it hard to believe that the q values aren’t useful. Admittedly, in an n-dimensional distribution, it’s just 1 more quantity available beyond the n that comprise the sample location. But it’s a qualitatively different type of information from the sample location, and I can’t help but think there’s some clever way to use it (besides emulating the response surface).

more air for MCMC

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , , on May 30, 2021 by xi'an

Aki Vehtari, Andrew Gelman, Dan Simpson, Bob Carpenter, and Paul-Christian Bürkner have just published a Bayesian Analysis paper about using an improved R factor for MCMC convergence assessment. From the early days of MCMC, convergence assessment has been a recurring (and recurrent!) question in the community. First leading to a flurry of proposals, [which Kerrie, Chantal, and myself reviewwwed in the Valencia 1998 proceedings], and then slowly disintegrating under the onslaughts of reality—i.e. that none could not be 100% foolproof in full generality—…. This included the (possibly now forgotten) single-versus-multiple-chains debate between Charlie Geyer [for single] and Andrew Gelman and Don Rubin [for multiple]. The later introduced an analysis-of-variance R factor, which remains quite popular up to this day, in part for being part of most MCMC software, like BUGS. That this R may fail to identify convergence issues, even in the more recent split version, does not come as a major surprise, since any situation with a long-term influence of the starting distribution may well fail to identify missing (significant) parts of the posterior support. (It is thus somewhat disconcerting to me to see that the main recommendation is to move the bound on R from 1.1 to 1.01, reminding me to some extent of a recent proposal to move the null rejection boundary from 0.05 to 0.005…) Similarly, the ESS may prove a poor signal for convergence or lack thereof, especially because the approximation of the asymptotic variance relies on stationarity assumptions. While multiplying the monitoring tools (as in CODA) helps with identifying convergence issues, looking at a single convergence indicator is somewhat like looking only at a frequentist estimator! (And with greater automation comes greater responsibility—in keeping a critical perspective.)

Looking for a broader perspective, I thus wonder at what we would instead need to assess the lack of convergence of an MCMC chain without much massaging of the said chain. An evaluation of the (Kullback, Wasserstein, or else) distance between the distribution of the chain at iteration n or across iterations, and the true target? A percentage of the mass of the posterior visited so far, which relates to estimating the normalising constant, with a relatively vast array of solutions made available in the recent years? I remain perplexed and frustrated by the fact that, 30 years later, the computed values of the visited likelihoods are not better exploited. Through for instance machine-learning approximations of the target. that could themselves be utilised for approximating the normalising constant and potential divergences from other approximations.

assessing MCMC convergence

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on June 6, 2019 by xi'an

When MCMC became mainstream in the 1990’s, there was a flurry of proposals to check, assess, and even guarantee convergence to the stationary distribution, as discussed in our MCMC book. Along with Chantal Guihenneuc and Kerrie Mengersen, we also maintained for a while a reviewww webpage categorising theses. Niloy Biswas and Pierre Jacob have recently posted a paper where they propose the use of couplings (and unbiased MCMC) towards deriving bounds on different metrics between the target and the current distribution of the Markov chain. Two chains are created from a given kernel and coupled with a lag of L, meaning that after a while, the two chains become one with a time difference of L. (The supplementary material contains many details on how to induce coupling.) The distance to the target can then be bounded by a sum of distances between the two chains until they merge. The above picture from the paper is a comparison a Polya-Urn sampler with several HMC samplers for a logistic target (not involving the Pima Indian dataset!). The larger the lag L the more accurate the bound. But the larger the lag the more expensive the assessment of how many steps are needed to convergence. Especially when considering that the evaluation requires restarting the chains from scratch and rerunning until they couple again, rather than continuing one run which can only brings the chain closer to stationarity and to being distributed from the target. I thus wonder at the possibility of some Rao-Blackwellisation of the simulations used in this assessment (while realising once more than assessing convergence almost inevitably requires another order of magnitude than convergence itself!). Without a clear idea of how to do it… For instance, keeping the values of the chain(s) at the time of coupling is not directly helpful to create a sample from the target since they are not distributed from that target.

[Pierre also wrote a blog post about the paper on Statisfaction that is definitely much clearer and pedagogical than the above.]

EntropyMCMC [R package]

Posted in Statistics with tags , , , , , , , , , , , , on March 26, 2019 by xi'an

My colleague from the Université d’Orléans, Didier Chauveau, has just published on CRAN a new R package called EntropyMCMC, which contains convergence assessment tools for MCMC algorithms, based on non-parametric estimates of the Kullback-Leibler divergence between current distribution and target. (A while ago, quite a while ago!, we actually collaborated with a few others on the Springer-Verlag Lecture Note #135 Discretization and MCMC convergence assessments.) This follows from a series of papers by Didier Chauveau and Pierre Vandekerkhove that started with a nearest neighbour entropy estimate. The evaluation of this entropy is based on N iid (parallel) chains, which involves a parallel implementation. While the missing normalising constant is overwhelmingly unknown, the authors this is not a major issue “since we are mostly interested in the stabilization” of the entropy distance. Or in the comparison of two MCMC algorithms. [Disclaimer: I have not experimented with the package so far, hence cannot vouch for its performances over large dimensions or problematic targets, but would as usual welcome comments and feedback on readers’ experiences.]

unbiased MCMC

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , on August 25, 2017 by xi'an

Two weeks ago, Pierre Jacob, John O’Leary, and Yves F. Atchadé arXived a paper on unbiased MCMC with coupling. Associating MCMC with unbiasedness is rather challenging since MCMC are rarely producing simulations from the exact target, unless specific tools like renewal can be produced in an efficient manner. (I supported the use of such renewal techniques as early as 1995, but later experiments led me to think renewal control was too rare an occurrence to consider it as a generic convergence assessment method.)

This new paper makes me think I had given up too easily! Here the central idea is coupling of two (MCMC) chains, associated with the debiasing formula used by Glynn and Rhee (2014) and already discussed here. Having the coupled chains meet at some time with probability one implies that the debiasing formula does not need a (random) stopping time. The coupling time is sufficient. Furthermore, several estimators can be derived from the same coupled Markov chain simulations, obtained by starting the averaging at a later time than the first iteration. The average of these (unbiased) averages results into a weighted estimate that weights more the later differences. Although coupling is also at the basis of perfect simulation methods, the analogy between this debiasing technique and perfect sampling is hard to fathom, since the coupling of two chains is not a perfect sampling instant. (Something obvious only in retrospect for me is that the variance of the resulting unbiased estimator is at best the variance of the original MCMC estimator.)

When discussing the implementation of coupling in Metropolis and Gibbs settings, the authors give a simple optimal coupling algorithm I was not aware of. Which is a form of accept-reject also found in perfect sampling I believe. (Renewal based on small sets makes an appearance on page 11.) I did not fully understood the way two random walk Metropolis steps are coupled, in that the normal proposals seem at odds with the boundedness constraints. But coupling is clearly working in this setting, while renewal does not. In toy examples like the (Efron and Morris!) baseball data and the (Gelfand and Smith!) pump failure data, the parameters k and m of the algorithm can be optimised against the variance of the averaged averages. And this approach comes highly useful in the case of the cut distribution,  a problem which I became aware of during MCMskiii and on which we are currently working with Pierre and others.