more air [&q’s] for MCMC [comments]

[A rich set of comments by Tom Loredo about convergence assessments for MCMC that I feel needs reposting:]

Two quick points:

  • By coincidence (and for a different problem), I’ve just been looking at the work of Gorham & Mackey that I believe Pierre is referring to. This is probably the relevant paper: “Measuring Sample Quality with Kernels“.
  • Besides their new rank-based R-hat, bloggers on Gelman’s blog have also pointed to another R-hat replacement, R, developed by some Stan team members; it is “based on how well a machine learning classifier model can successfully discriminate the individual chains.” See: “R: A robust MCMC convergence diagnostic with uncertainty using decision tree classifiers”.

In addition, here’s an anecdote regarding your comment, “I remain perplexed and frustrated by the fact that, 30 years later, the computed values of the visited likelihoods are not better exploited.”

That has long bothered me, too. During a SAMSI program around 2006, I spent time working on one approach that tried to use the prior*likelihood (I call it q(θ), for “quasiposterior” and because it’s next to “p”!) to compute the marginal likelihood. It would take posterior samples (from MCMC or another approach) and find their Delaunay triangulation. Then, using q(θ) on the nodes of the simplices comprising the triangulation, it used a simplicial cubature rule to approximate the integral of q(theta) over the volume spanned by the samples.

As I recall, I only explored it with multivariate normal and Student-t targets. It failed, but in an interesting way. It worked well in low dimensions, but gave increasingly poor estimates as dimension grew. The problem appeared related to concentration of measure (or the location of the typical set), with the points not sufficiently covering the center or the large volume in the tails (or both; I can’t remember what diagnostics said exactly).

Another problem is that Delaunay triangulation gets expensive quickly with growing dimension. This method doesn’t need an optimal triangulation, so I wondered if there was a faster sub-optimal triangulation algorithm, but I couldn’t find one.

An interesting aspect of this approach is that the fact that the points are drawn from the prior doesn’t matter. Any set of points is a valid set of points for approximating the integral (in the spanned volume). I just used posterior samples because I presumed those would be available from MCMC. I briefly did some experiments taking the samples, and reweighting them to draw a subset for the cubature that was either over- or under-dispersed vs. the target. And one could improve things this way (I can’t remember what choice was better). This suggests that points drawn from q(theta) aren’t optimal for such cubature, but I never tried looking formally for the optimal choice.

I called the approach “adaptive simplicial cubature,” adaptive in the sense that the points are chosen in a way that depends on the integrand.

The only related work I could find at the time was work by you and Anne Philippe on Riemanns sums with MCMC (https://doi.org/10.1023/A:1008926514119). I later stumbled upon a paper on “random Riemann sum estimators” as an alternative to Monte Carlo that seems related but that I didn’t explore further (https://doi.org/10.1016/j.csda.2006.09.041).

I still find it hard to believe that the q values aren’t useful. Admittedly, in an n-dimensional distribution, it’s just 1 more quantity available beyond the n that comprise the sample location. But it’s a qualitatively different type of information from the sample location, and I can’t help but think there’s some clever way to use it (besides emulating the response surface).

4 Responses to “more air [&q’s] for MCMC [comments]”

  1. Pierre Jacob Says:

    Probably a trivial remark, but when considering MCMC on target distributions that are defined as uniform distributions on some set, for example some convex set defined by linear constraints (e.g. https://arxiv.org/abs/1710.08165), it’s clear that the values of the target density (which are all equal to one another) can’t help in assessing convergence or improving estimates. So at least we can’t expect these ‘q-values’ to be useful in general.

    • This may be the exceptional counter-example, though!

      • However, isn’t there a class of functions which are not flat but have nearly flat plateaus which are comparably problematic? For example, this is a known numerical problem for Maximum Likelihood, per

        Lima, Verônica MC, and Francisco Cribari–Neto. “Penalized maximum likelihood estimation in the modified extended Weibull distribution.” Communications in Statistics-Simulation and Computation 48, no. 2 (2019): 334-349

        and is also emphasized in

        Konishi, Sadanori, and Genshiro Kitagawa. Information criteria and statistical modeling. Springer Science & Business Media, 2008.

        In particular Konishi and Kitagawa make the point that if likelihoods are monotonic in a region but have sufficiently small absolute slopes, the numerical uncertainty in the MLE is high, even if it exists, because the likelihood itself can be perturbed by variations in parameters.

      • Indeed. But, unless the prior is flat as well, it should normalise the likelihood to some extent. And if not MCMC should have a heyday running around the parameter space…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: