Archive for JASA

Monte Carlo Markov chains

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , on May 12, 2020 by xi'an

Darren Wraith pointed out this (currently free access) Springer book by Massimiliano Bonamente [whose family name means good spirit in Italian] to me for its use of the unusual Monte Carlo Markov chain rendering of MCMC.  (Google Trend seems to restrict its use to California!) This is a graduate text for physicists, but one could nonetheless expect more rigour in the processing of the topics. Particularly of the Bayesian topics. Here is a pot-pourri of memorable quotes:

“Two major avenues are available for the assignment of probabilities. One is based on the repetition of the experiments a large number of times under the same conditions, and goes under the name of the frequentist or classical method. The other is based on a more theoretical knowledge of the experiment, but without the experimental requirement, and is referred to as the Bayesian approach.”

“The Bayesian probability is assigned based on a quantitative understanding of the nature of the experiment, and in accord with the Kolmogorov axioms. It is sometimes referred to as empirical probability, in recognition of the fact that sometimes the probability of an event is assigned based upon a practical knowledge of the experiment, although without the classical requirement of repeating the experiment for a large number of times. This method is named after the Rev. Thomas Bayes, who pioneered the development of the theory of probability.”

“The likelihood P(B/A) represents the probability of making the measurement B given that the model A is a correct description of the experiment.”

“…a uniform distribution is normally the logical assumption in the absence of other information.”

“The Gaussian distribution can be considered as a special case of the binomial, when the number of tries is sufficiently large.”

“This clearly does not mean that the Poisson distribution has no variance—in that case, it would not be a random variable!”

“The method of moments therefore returns unbiased estimates for the mean and variance of every distribution in the case of a large number of measurements.”

“The great advantage of the Gibbs sampler is the fact that the acceptance is 100 %, since there is no rejection of candidates for the Markov chain, unlike the case of the Metropolis–Hastings algorithm.”

Let me then point out (or just whine about!) the book using “statistical independence” for plain independence, the use of / rather than Jeffreys’ | for conditioning (and sometimes forgetting \ in some LaTeX formulas), the confusion between events and random variables, esp. when computing the posterior distribution, between models and parameter values, the reliance on discrete probability for continuous settings, as in the Markov chain chapter, confusing density and probability, using Mendel’s pea data without mentioning the unlikely fit to the expected values (or, as put more subtly by Fisher (1936), “the data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations”), presenting Fisher’s and Anderson’s Iris data [a motive for rejection when George was JASA editor!] as a “a new classic experiment”, mentioning Pearson but not Lee for the data in the 1903 Biometrika paper “On the laws of inheritance in man” (and woman!), and not accounting for the discrete nature of this data in the linear regression chapter, the three page derivation of the Gaussian distribution from a Taylor expansion of the Binomial pmf obtained by differentiating in the integer argument, spending endless pages on deriving standard properties of classical distributions, this appalling mess of adding over the conditioning atoms with no normalisation in a Poisson experiment

P(X=4|\mu=0,1,2) = \sum_{\mu=0}^2 \frac{\mu^4}{4!}\exp\{-\mu\},

botching the proof of the CLT, which is treated before the Law of Large Numbers, restricting maximum likelihood estimation to the Gaussian and Poisson cases and muddling its meaning by discussing unbiasedness, confusing a drifted Poisson random variable with a drift on its parameter, as well as using the pmf of the Poisson to define an area under the curve (Fig. 5.2), sweeping the improperty of a constant prior under the carpet, defining a null hypothesis as a range of values for a summary statistic, no mention of Bayesian perspectives in the hypothesis testing, model comparison, and regression chapters, having one-dimensional case chapters followed by two-dimensional case chapters, reducing model comparison to the use of the Kolmogorov-Smirnov test, processing bootstrap and jackknife in the Monte Carlo chapter without a mention of importance sampling, stating recurrence results without assuming irreducibility, motivating MCMC by the intractability of the evidence, resorting to the term link to designate the current value of a Markov chain, incorporating the need for a prior distribution in a terrible description of the Metropolis-Hastings algorithm, including a discrete proof for its stationarity, spending many pages on early 1990’s MCMC convergence tests rather than discussing the adaptive scaling of proposal distributions, the inclusion of numerical tables [in a 2017 book] and turning Bayes (1763) into Bayes and Price (1763), or Student (1908) into Gosset (1908).

[Usual disclaimer about potential self-plagiarism: this post or an edited version of it could possibly appear later in my Books Review section in CHANCE. Unlikely, though!]

What the …?!

Posted in Books, Statistics with tags , , , , , , , , , on May 3, 2020 by xi'an

temp

informed proposals for local MCMC in discrete spaces

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , on April 17, 2020 by xi'an

Last year Giacomo Zanella published a paper entitled informed proposals for local MCMC in discrete spaces in JASA. Which I had missed somehow and only discovered through another paper, and which we recently discussed at Paris-Dauphine with graduate students, marooned by COVID-19 . Probability targets in discrete spaces are intrinsically hard[er] to simulate in my opinion if only because there is no natural distance, hence no natural neighbourhood. A random walk proposal like the reference kernel in the paper is not directly calibrated. Without demarginalisation there is neither a clear version of calculus for implementing MALA or HMC. What indeed is HMC on a discrete space? If this requires “embedding the binary space in a continuous space”, it does not sound very enticing if the construct is context dependent.

“This would allow for more moves to be accepted and longer moves to be performed, thus improving the algorithm’s efficiency.”

A interesting aspect of the paper is that for near atomic transition kernels K, informally for small σ’s, the proposal switch to Q finds target x normalising constant as new stationary and close to the actual target. Which incidentally reminded me of our vanilla Rao-Blackwellisation with Randal Douc. This however begets the worry that it may prove unwieldy in continuous cases, as except for Gaussian kernels, the  proposal switch to Q may prove intractable and requires further MCMC steps, in a form of infinite regress. Plus a musing that, were the original kernel K to be replaced with the new Q, another informed proposal transform could be applied to Q. Further infinite regress…

“[The optimality of the Metropolis-Hastings choice of acceptance probability] does not translate to the context of balancing functions.”

The paper indeed exhibits a setting that is rehabilitating Barker’ (1965) version of the acceptance probability, but I never  was very much convinced there was a significant difference in using one or the other. During our virtual (?) discussion, we also wondered at the adaptive abilities of the approach, e.g., selecting among a finite family of g’s (according to which criterion) or parameterising g towards an optimal choice of its parameter. And at the capacity for Rao-Blackwellisation since the proposal have to consider the entire set of neighbours prior to moving to a likely one.

Colin Blyth (1922-2019)

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , , , , , , , on March 19, 2020 by xi'an

While reading the IMS Bulletin (of March 2020), I found out that Canadian statistician Colin Blyth had died last summer. While we had never met in person, I remember his very distinctive and elegant handwriting in a few letters he sent me, including the above I have kept (along with an handwritten letter from Lucien Le Cam!). It contains suggestions about revising our Is Pitman nearness a reasonable criterion?, written with Gene Hwang and William Strawderman and which took three years to publish as it was deemed somewhat controversial. It actually appeared in JASA with discussions from Malay Ghosh, John Keating and Pranab K Sen, Shyamal Das Peddada, C. R. Rao, George Casella and Martin T. Wells, and Colin R. Blyth (with a much stronger wording than in the above letter!, like “What can be said but “It isn’t I, it’s you that are crazy?”). While I had used some of his admissibility results, including the admissibility of the Normal sample average in dimension one, e.g. in my book, I had not realised at the time that Blyth was (a) the first student of Erich Lehmann (b) the originator of [the name] Simpson’s paradox, (c) the scribe for Lehmann’s notes that would eventually lead to Testing Statistical Hypotheses and Theory of Point Estimation, later revised with George Casella. And (d) a keen bagpipe player and scholar.

non-reversibility in discrete spaces

Posted in Books, Statistics, University life with tags , , , , , , , , , on January 3, 2020 by xi'an

Following a recent JASA paper by Giacomo Zanella (which I have not yet read but is discussed on this blog), Sam Power and Jacob Goldman have recently arXived a paper on Accelerated sampling on discrete spaces with non-reversible Markov processes, where they use continuous-time, non-reversible algorithms à la PDMP, even though differential equations do not exist on discrete spaces. More specifically, they devise discrete versions of the coordinate sampler and of the Zig-Zag sampler, using Markov jump processes instead of differential equations, with detailed balance on the jump rate rather than the Markov kernel. A use of jump processes originating at least from Peskun (1973) and connected with MCMC algorithms in Matthew Stephens‘ 1999 PhD thesis. A neat thing about discrete settings is that the jump process can be implemented with no discretisation! However, as we noticed when working on birth-and-death processes with Olivier Cappé and Tobias Rydèn, there is a potential for disastrous implementation if an infinite sequence of instantaneous moves (out of zero probability states) is proposed.

The authors make the further assumption(s) that the discrete space is endowed with a graphical structure with a group G acting upon this graph, with an involution keeping the target (or a completion of the original target) invariant. In this framework, reversibility amounts to repeatedly using (group) generators þ with a low order (as in Bayesian variable selection, binary spin systems, where þ.þ=id, and other permutation problems), since they bring the chain back to its starting point. Their first sampler is called a Tabu sampler for avoiding such behaviour, forcing the next step to use other generators þ in the generator set Þ thanks to a binary auxiliary variable that partitions Þ into forward vs backward moves. For high order generators, the discrete coordinate and Zig-Zag samplers are instead repeatedly using the same generator (although it is unclear to me why this is beneficial, given that neither graph nor generator is not necessarily linked with the target). With the coordinate sampler being again much cheaper since it only looks at one direction in the generator group.

The paper contains a range of comparisons with (only) Zanella’s sampler, some presenting heavy gains in terms of ESS. Including one on hundreds of sensors in a football stadium. As I am not particularly familiar with these examples, except for the Bayesian variable selection one, I found it rather hard to determine whether or not the compared samplers were indeed exploring the entirety of the (highly complex and highly dimensional) target. The collection of examples is however quite rich and support the use of such non-reversible schemes. It may also be that the discrete nature of the target could facilitate the theoretical study of their convergence properties.