## Monte Carlo Markov chains

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , on May 12, 2020 by xi'an

Darren Wraith pointed out this (currently free access) Springer book by Massimiliano Bonamente [whose family name means good spirit in Italian] to me for its use of the unusual Monte Carlo Markov chain rendering of MCMC.  (Google Trend seems to restrict its use to California!) This is a graduate text for physicists, but one could nonetheless expect more rigour in the processing of the topics. Particularly of the Bayesian topics. Here is a pot-pourri of memorable quotes:

“Two major avenues are available for the assignment of probabilities. One is based on the repetition of the experiments a large number of times under the same conditions, and goes under the name of the frequentist or classical method. The other is based on a more theoretical knowledge of the experiment, but without the experimental requirement, and is referred to as the Bayesian approach.”

“The Bayesian probability is assigned based on a quantitative understanding of the nature of the experiment, and in accord with the Kolmogorov axioms. It is sometimes referred to as empirical probability, in recognition of the fact that sometimes the probability of an event is assigned based upon a practical knowledge of the experiment, although without the classical requirement of repeating the experiment for a large number of times. This method is named after the Rev. Thomas Bayes, who pioneered the development of the theory of probability.”

“The likelihood P(B/A) represents the probability of making the measurement B given that the model A is a correct description of the experiment.”

“…a uniform distribution is normally the logical assumption in the absence of other information.”

“The Gaussian distribution can be considered as a special case of the binomial, when the number of tries is sufficiently large.”

“This clearly does not mean that the Poisson distribution has no variance—in that case, it would not be a random variable!”

“The method of moments therefore returns unbiased estimates for the mean and variance of every distribution in the case of a large number of measurements.”

“The great advantage of the Gibbs sampler is the fact that the acceptance is 100 %, since there is no rejection of candidates for the Markov chain, unlike the case of the Metropolis–Hastings algorithm.”

Let me then point out (or just whine about!) the book using “statistical independence” for plain independence, the use of / rather than Jeffreys’ | for conditioning (and sometimes forgetting \ in some LaTeX formulas), the confusion between events and random variables, esp. when computing the posterior distribution, between models and parameter values, the reliance on discrete probability for continuous settings, as in the Markov chain chapter, confusing density and probability, using Mendel’s pea data without mentioning the unlikely fit to the expected values (or, as put more subtly by Fisher (1936), “the data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations”), presenting Fisher’s and Anderson’s Iris data [a motive for rejection when George was JASA editor!] as a “a new classic experiment”, mentioning Pearson but not Lee for the data in the 1903 Biometrika paper “On the laws of inheritance in man” (and woman!), and not accounting for the discrete nature of this data in the linear regression chapter, the three page derivation of the Gaussian distribution from a Taylor expansion of the Binomial pmf obtained by differentiating in the integer argument, spending endless pages on deriving standard properties of classical distributions, this appalling mess of adding over the conditioning atoms with no normalisation in a Poisson experiment

$P(X=4|\mu=0,1,2) = \sum_{\mu=0}^2 \frac{\mu^4}{4!}\exp\{-\mu\}$,

botching the proof of the CLT, which is treated before the Law of Large Numbers, restricting maximum likelihood estimation to the Gaussian and Poisson cases and muddling its meaning by discussing unbiasedness, confusing a drifted Poisson random variable with a drift on its parameter, as well as using the pmf of the Poisson to define an area under the curve (Fig. 5.2), sweeping the improperty of a constant prior under the carpet, defining a null hypothesis as a range of values for a summary statistic, no mention of Bayesian perspectives in the hypothesis testing, model comparison, and regression chapters, having one-dimensional case chapters followed by two-dimensional case chapters, reducing model comparison to the use of the Kolmogorov-Smirnov test, processing bootstrap and jackknife in the Monte Carlo chapter without a mention of importance sampling, stating recurrence results without assuming irreducibility, motivating MCMC by the intractability of the evidence, resorting to the term link to designate the current value of a Markov chain, incorporating the need for a prior distribution in a terrible description of the Metropolis-Hastings algorithm, including a discrete proof for its stationarity, spending many pages on early 1990’s MCMC convergence tests rather than discussing the adaptive scaling of proposal distributions, the inclusion of numerical tables [in a 2017 book] and turning Bayes (1763) into Bayes and Price (1763), or Student (1908) into Gosset (1908).

[Usual disclaimer about potential self-plagiarism: this post or an edited version of it could possibly appear later in my Books Review section in CHANCE. Unlikely, though!]

## Markov Chains [not a book review]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , on January 14, 2019 by xi'an

As Randal Douc and Éric Moulines are both very close friends and two authors of this book on Markov chains,  I cannot engage into a regular book review! Judging from the table of contents, the coverage is not too dissimilar to the now classic Markov chain Stochastic Stability book by Sean Meyn and the late Richard Tweedie (1994), called the Bible of Markov chains by Peter Glynn, with more emphasis on convergence matters and a more mathematical perspective. The 757 pages book also includes a massive appendix on maths and probability background. As indicated in the preface, “the reason [the authors] thought it would be useful to write a new book is to survey some of the developments made during the 25 years that have elapsed since the publication of Meyn and Tweedie (1993b).” Connecting with the theoretical developments brought by MCMC methods. Like subgeometric rates of convergence to stationarity, sample paths, limit theorems, and concentration inequalities. The book also reflects on the numerous contributions of the authors to the field. Hence a perfect candidate for teaching Markov chains to mathematically well-prepared. graduate audiences. Congrats to the authors!

## Gibbs for incompatible kids

Posted in Books, Statistics, University life with tags , , , , , , , , , , on September 27, 2018 by xi'an

In continuation of my earlier post on Bayesian GANs, which resort to strongly incompatible conditionals, I read a 2015 paper of Chen and Ip that I had missed. (Published in the Journal of Statistical Computation and Simulation which I first confused with JCGS and which I do not know at all. Actually, when looking at its editorial board,  I recognised only one name.) But the study therein is quite disappointing and not helping as it considers Markov chains on finite state spaces, meaning that the transition distributions are matrices, meaning also that convergence is ensured if these matrices have no null probability term. And while the paper is motivated by realistic situations where incompatible conditionals can reasonably appear, the paper only produces illustrations on two and three states Markov chains. Not that helpful, in the end… The game is still afoot!

## Gibbs for kidds

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , , , on February 12, 2018 by xi'an

A chance (?) question on X validated brought me to re-read Gibbs for Kids, 25 years after it was written (by my close friends George and Ed). The originator of the question had difficulties with the implementation, apparently missing the cyclic pattern of the sampler, as in equations (2.3) and (2.4), and with the convergence, which is only processed for a finite support in the American Statistician paper. The paper [which did not appear in American Statistician under this title!, but inspired an animal bredeer, Dan Gianola, to write a “Gibbs for pigs” presentation in 1993 at the 44th Annual Meeting of the European Association for Animal Production, Aarhus, Denmark!!!] most appropriately only contains toy examples since those can be processed and compared to know stationary measures. This is for instance the case for the auto-exponential model

$f(x,y) \propto exp(-xy)$

which is only defined as a probability density for a compact support. (The paper does not identify the model as a special case of auto-exponential model, which apparently made the originator of the model, Julian Besag in 1974, unhappy, as George and I found out when visiting Bath, where Julian was spending the final year of his life, many years later.) I use the limiting case all the time in class to point out that a Gibbs sampler can be devised and operate without a stationary probability distribution. However, being picky!, I would like to point out that, contrary, to a comment made in the paper, the Gibbs sampler does not “fail” but on the contrary still “converges” in this case, in the sense that a conditional ergodic theorem applies, i.e., the ratio of the frequencies of visits to two sets A and B with finite measure do converge to the ratio of these measures. For instance, running the Gibbs sampler 10⁶ steps and ckecking for the relative frequencies of x’s in (1,2) and (1,3) gives 0.685, versus log(2)/log(3)=0.63, since 1/x is the stationary measure. One important and influential feature of the paper is to stress that proper conditionals do not imply proper joints. George would work much further on that topic, in particular with his PhD student at the time, my friend Jim Hobert.

With regard to the convergence issue, Gibbs for Kids points out to Schervish and Carlin (1990), which came quite early when considering Gelfand and Smith published their initial paper the very same year, but which also adopts a functional approach to convergence, along the paper’s fixed point perspective, somehow complicating the matter. Later papers by Tierney (1994), Besag (1995), and Mengersen and Tweedie (1996) considerably simplified the answer, which is that irreducibility is a necessary and sufficient condition for convergence. (Incidentally, the reference list includes a technical report of mine’s on latent variable model MCMC implementation that never got published.)

## lemma 7.3

Posted in Statistics with tags , , , , , , , , , , , on November 14, 2012 by xi'an

As Xiao-Li Meng accepted to review—and I am quite grateful he managed to fit this review in an already overflowing deanesque schedule!— our 2004 book  Monte Carlo Statistical Methods as part of a special book review issue of CHANCE honouring the memory of George thru his books—thanks to Sam Behseta for suggesting this!—, he sent me the following email about one of our proofs—demonstrating how much efforts he had put into this review!—:

I however have a question about the proof of Lemma 7.3
on page 273. After the expression of
E[h(x^(1)|x_0], the proof stated "and substitute
Eh(x) for h(x_1)".  I cannot think of any
justification for this substitution, given the whole
purpose is to show h(x) is a constant.

I put it on hold for a while and only looked at it in the (long) flight to Chicago. Lemma 7.3 in Monte Carlo Statistical Methods is the result that the Metropolis-Hastings algorithm is Harris recurrent (and not only recurrent). The proof is based on the characterisation of Harris recurrence as having only constants for harmonic functions, i.e. those satisfying the identity

$h(x) = \mathbb{E}[h(X_t)|X_{t-1}=x]$

The chain being recurrent, the above implies that harmonic functions are almost everywhere constant and the proof steps from almost everywhere to everywhere. The fact that the substitution above—and I also stumbled upon that very subtlety when re-reading the proof in my plane seat!—is valid is due to the fact that it occurs within an integral: despite sounding like using the result to prove the result, the argument is thus valid! Needless to say, we did not invent this (elegant) proof but took it from one of the early works on the theory of Metropolis-Hastings algorithms, presumably Luke Tierney’s foundational Annals paper work that we should have quoted…

As pointed out by Xiao-Li, the proof is also confusing for the use of two notations for the expectation (one of which is indexed by f and the other corresponding to the Markov transition) and for the change in the meaning of f, now the stationary density, when compared with Theorem 6.80.