## more air for MCMC

Posted in Books, R, Statistics with tags , , , , , , , , , , , , , , on May 30, 2021 by xi'an

Aki Vehtari, Andrew Gelman, Dan Simpson, Bob Carpenter, and Paul-Christian Bürkner have just published a Bayesian Analysis paper about using an improved R factor for MCMC convergence assessment. From the early days of MCMC, convergence assessment has been a recurring (and recurrent!) question in the community. First leading to a flurry of proposals, [which Kerrie, Chantal, and myself reviewwwed in the Valencia 1998 proceedings], and then slowly disintegrating under the onslaughts of reality—i.e. that none could not be 100% foolproof in full generality—…. This included the (possibly now forgotten) single-versus-multiple-chains debate between Charlie Geyer [for single] and Andrew Gelman and Don Rubin [for multiple]. The later introduced an analysis-of-variance R factor, which remains quite popular up to this day, in part for being part of most MCMC software, like BUGS. That this R may fail to identify convergence issues, even in the more recent split version, does not come as a major surprise, since any situation with a long-term influence of the starting distribution may well fail to identify missing (significant) parts of the posterior support. (It is thus somewhat disconcerting to me to see that the main recommendation is to move the bound on R from 1.1 to 1.01, reminding me to some extent of a recent proposal to move the null rejection boundary from 0.05 to 0.005…) Similarly, the ESS may prove a poor signal for convergence or lack thereof, especially because the approximation of the asymptotic variance relies on stationarity assumptions. While multiplying the monitoring tools (as in CODA) helps with identifying convergence issues, looking at a single convergence indicator is somewhat like looking only at a frequentist estimator! (And with greater automation comes greater responsibility—in keeping a critical perspective.)

Looking for a broader perspective, I thus wonder at what we would instead need to assess the lack of convergence of an MCMC chain without much massaging of the said chain. An evaluation of the (Kullback, Wasserstein, or else) distance between the distribution of the chain at iteration n or across iterations, and the true target? A percentage of the mass of the posterior visited so far, which relates to estimating the normalising constant, with a relatively vast array of solutions made available in the recent years? I remain perplexed and frustrated by the fact that, 30 years later, the computed values of the visited likelihoods are not better exploited. Through for instance machine-learning approximations of the target. that could themselves be utilised for approximating the normalising constant and potential divergences from other approximations.

## MCqMC 2014 [day #4]

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , on April 11, 2014 by xi'an

I hesitated in changing the above title for “MCqMSmaug” as the plenary talk I attended this morning was given by Wenzel Jakob, who uses Markov chain Monte Carlo methods in image rendering and light simulation. The talk was low-tech’, with plenty of pictures and animations (incl. excerpts from recent blockbusters!), but it stressed how much proper rending relies on powerful MCMC techniques. One point particularly attracted my attention, namely the notion of manifold exploration as it seemed related to my zero measure recent post. (A related video is available on Jakob’s webpage.) You may then wonder where the connection with Smaug could be found: Wenzel Jakob is listed in the credits of both Hobbit movies for his contributions to the visual effects! (Hey, MCMC made Smaug [visual effects the way they are], a cool argument for selling your next MCMC course! I will for sure include a picture of Smaug in my next R class presentation…) The next sessions of the morning opposed Sobol’s memorial to more technical light rendering and I chose Sobol, esp. because I had missed Art Owen’s tutorial on Sunday, as he gave a short presentation on using Sobol’s criteria to identify variables contributing the most to the variability or extreme values of a function, an extreme value kind of ANOVA, most interesting if far from my simulation area… The afternoon sessions saw MCMC talks by Luke Bornn and Scott Schmidler, both having connection with the Wang-Landau algorithm. Actually, Scott’s talk was the one generating the most animated discussion among all those I attended in MCqMC! (To the point of the chairman getting rather rudely making faces…)

## informative hypotheses (book review)

Posted in Books, R, Statistics with tags , , , , , , on September 19, 2013 by xi'an

The title of this book Informative Hypotheses somehow put me off from the start: the author, Hebert Hoijtink, seems to distinguish between informative and uninformative (deformative? disinformative?) hypotheses. Namely, something like

H0: μ1234

is “very informative” and unrealistic, and the alternative Ha is completely uninformative, while the “alternative null”

H1: μ1<μ23<μ4

is informative. (Hence the < signs on the cover. One of my book reviews idiosyncrasies is to find hidden meaning behind the cover design…) The idea is thus to have the researcher give some input in the construction of the null hypothesis (as if hypothesis tests usually were not about questions that mattered….).

In fact, this distinction put me off so much that I only ended up reading chapters 1 (an introduction), 3 (an introduction [to the Bayesian processing of such hypotheses]) and 10 (on Bayesian foundations of testing informative hypotheses). Hence a very biased review of Informative Hypotheses that follows….

Given an existing (but out of print?) reference like Robertson, Wright and Dykjstra (1988), that I particularly enjoyed when working on isotonic regression in the mid 90’s, I do not see much of an added value in the present book. The important references are mostly centred on works by the author and his co-authors or students (often Unpublished or In Press), which gives me the impression the book was hurriedly gathered from those papers.

“The Bayes factor (…) is default, objective, based on an appropriate quantification of complexity.” (p.197)

The first chapter of Informative Hypotheses is a motivation for the study of those informative hypotheses, with a focus on ANOVA models. There is not much in the chapter that explains what is so special about those ordering (null) hypotheses and why a whole book is required to cover their processing. A noteworthy specificity of the approach, nonetheless, is that point null hypotheses seem to be replaced with “about equality constraints” (p.9), |μ23|<d, where d is specified by the researcher as significant. This chapter also gives illustrations of ordered (or informative) hypotheses in the settings of analysis of covariance (ANCOVA) and regression models, but does not indicate (yet) how to run the tests. The concluding section is about the epistemological focus of the book, quoting Popper, Sober and Carnap, although I do not see much of a support in those quotes.

“Objective means that Bayes factors based on this prior distribution are essentially independent of this prior distribution.” (p.53)

Chapter 3 starts the introduction to Bayesian statistics with the strange idea of calling the likelihood the “density of the data”. It is indeed the probability density of the model evaluated at the data but… it conveys a confusing meaning since it is not a density when plotted against the parameters (as in Figure 1, p. 44, where, incidentally the exact probability model is not specified). The prior distribution is defined as a normal x inverse chi-square distribution on the vector of the means (in the ANOVA model) and the common variance. Due to the classification of the variance as a nuisance parameter, the author can get away with putting an improper prior on this parameter (p.46). The normal prior is chosen to be “neutral”, i.e. to give the same prior weight to the null and the alternative hypotheses. This seems logical at some initial level, but constructing such a prior for convoluted hypotheses may simply be impossible… Because the null hypothesis has a positive mass (maybe .5) under the “unconstrained prior” (p.48), the author can also get away with projecting this prior onto the constrained space of the null hypothesis. Even when setting the prior variance to oo (p.50). The Bayes factor is then the ratio of the (posterior and prior) normalising constants over the constrained parameter space. The book still mentions the Lindley-Bartlett paradox (p.60) in the case of the about equality hypotheses. The appendix to this chapter mentions the issue of improper priors and the need for accommodating infinite mass with training samples, providing a minimum training sample solution using mixtures that sound fairly ad hoc to me.

“Bayes factors for the evaluation of informative hypotheses have a simple form.” (p. 193)

Chapter 10 is the final chapter of Informative Hypotheses, on “Foundations of Bayesian evaluation of informative hypotheses”, and I was expecting a more in-depth analysis of those special hypotheses, but it is mostly a repetition of what is found in Chapter 3, the wider generality being never exploited to a useful depth. There is also this gem quoted above  that, because Bayes factors are the ratio of two (normalising) constants, fm/cm, they have a “simple form”. The reference to Carlin and Chib (1995) for computing other cases then sounds pretty obscure. (Another tiny gem is that I spotted the R software contingency spelled with three different spellings.)  The book mentions the Savage-Dickey representation of the Bayes factor, but I could not spot the connection from the few lines (p.193) dedicated to this ratio. More generally, I do not find the generality of this chapter particularly convincing, most of it replicating the notions found in Chapter 3., like the use of posterior priors. The numerical approximation of Bayes factors is proposed via simulation from the unconstrained prior and posterior (p.207) then via a stepwise decomposition of the Bayes factor (p.208) and a Gibbs sampler that relies on inverse cdf sampling.

Overall, I feel that this book came out too early, without a proper basis and dissemination of the ideas of the author: to wit, a large number of references are connected to the author, some In Press, other Unpublished (which leads to a rather abstract “see Hoijtink (Unpublished) for a related theorem” (p.195)). From my incomplete reading, I did not gather a sense of novel perspective but rather of a topic that seemed too narrow for a whole book.