Archive for Andrei Kolmogorov

Amy in Randomland [book review]

Posted in Statistics with tags , , , , , , , , , , , , on August 15, 2022 by xi'an

Amy’s Luck is a short book by David Hand that I recently received for review in CHANCE. David, whom I have known for quite a while now, is professor at Imperial College London. This is not his first book, by far! But this may be the most unusual one, if not the shortest. Written as a pastiche of Alice’s Adventures in Wonderland, it tells of the adventures of a young girl named Amy in the pursuit of luck or at least of its meaning. It has about the same number of chapters as Carroll’s book and could easily be read on a leisurely boat trip from Oxford to Godstow. While non-sensical and playing on the imprecision of the English language, its probabilist is both correct and rational. References to the original Alice abound and I presumably missed a fair portion of them, having read Alice (in French) decades ago. The book also contains illustrations from the author, gathered into a charm bracelet printed on the cover and a most helpful appendix where David points out the real world stories behind those of Amy, which is also full of gems, like Kolmogorov being a train conductor in his youth. (Missing an addition about Galton’s quincunx, esp. when his cousin Darwin is more than mentioned.) Or Asimov creating the milihelen to measure how much beauty was required to launch a ship. Overall, it is quite charming and definitely enjoyable, if presumably not accessible by the same audience as Alice‘s. And unlikely to take over Alice‘s! But from “She could understand the idea that coins had heads”, to a Nightingale rose renamed after Miss Starling, to the permutation of Brown, Stein, and Bachelier into Braun, Stone, and a bachelor, David must have had fun writing it. As others will while reading it and trying to separate probabilistic sense from non-sense.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Book Review section in CHANCE.]

conditioning on insufficient statistics in Bayesian regression

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , on October 23, 2021 by xi'an

“…the prior distribution, the loss function, and the likelihood or sampling density (…) a healthy skepticism encourages us to question each of them”

A paper by John Lewis, Steven MacEachern, and Yoonkyung Lee has recently appeared in Bayesian Analysis. Starting with the great motivation of a misspecified model requiring the use of a (thus necessarily) insufficient statistic and moving to their central concern of simulating the posterior based on that statistic.

Model misspecification remains understudied from a B perspective and this paper is thus most welcome in addressing the issue. However, when reading through, one of my criticisms is in defining misspecification as equivalent to outliers in the sample. An outlier model is an easy case of misspecification, in the end, since the original model remains meaningful. (Why should there be “good” versus “bad” data) Furthermore, adding a non-parametric component for the unspecified part of the data would sound like a “more Bayesian” alternative. Unrelated, I also idly wondered at whether or not normalising flows could be used in this instance..

The problem in selecting a T (Darjeeling of course!) is not really discussed there, while each choice of a statistic T leads to a different signification to what misspecified means and suggests a comparison with Bayesian empirical likelihood.

“Acceptance rates of this [ABC] algorithm can be intolerably low”

Erm, this is not really the issue with ABC, is it?! Especially when the tolerance is induced by the simulations themselves.

When I reached the MCMC (Gibbs?) part of the paper, I first wondered at its relevance for the mispecification issues before realising it had become the focus of the paper. Now, simulating the observations conditional on a value of the summary statistic T is a true challenge. I remember for instance George Casella mentioning it in association with a Student’s t sample in the 1990’s and Kerrie and I having an unsuccessful attempt at it in the same period. Persi Diaconis has written several papers on the problem and I am thus surprised at the dearth of references here, like the rather recent Byrne and Girolami (2013), Florens and Simoni (2015), or Bornn et al. (2019). In the present case, the  linear model assumed as the true model has the exceptional feature that it leads to a feasible transform of an unconstrained simulation into a simulation with fixed statistics, with no measure theoretic worries if not free from considerable efforts to establish the operation is truly valid… And, while simulating (θ,y) makes perfect sense in an insufficient setting, the cost is then precisely the same as when running a vanilla ABC. Which brings us to the natural comparison with ABC. While taking ε=0 may sound as optimal for being “exact”, it is not from an ABC perspective since the convergence rate of the (summary) statistic should be roughly the one of the tolerance (Fearnhead and Liu, Frazier et al., 2018).

“[The Borel Paradox] shows that the concept of a conditional probability with regard to an isolated given hypothesis whose probability equals 0 is inadmissible.” A. Колмого́ров (1933)

As a side note for measure-theoretic purists, the derivation of the conditional of y given T(y)=T⁰ is arbitrary since the event has probability zero (ie, the conditioning set is of measure zero). See the Borel-Kolmogorov paradox. The computations in the paper are undoubtedly correct, but this is only one arbitrary choice of a transform (or conditioning σ-algebra).

Bayesian sufficiency

Posted in Books, Kids, Statistics with tags , , , , , , , , , on February 12, 2021 by xi'an

“During the past seven decades, an astonishingly large amount of effort and ingenuity has gone into the search fpr resonable answers to this question.” D. Basu

Induced by a vaguely related question on X validated, I re-read Basu’s 1977 great JASA paper on the elimination of nuisance parameters. Besides the limitations of competing definitions of conditional, partial, marginal sufficiency for the parameter of interest,  Basu discusses various notions of Bayesian (partial) sufficiency.

“After a long journey through a forest of confusing ideas and examples, we seem to have lost our way.” D. Basu

Starting with Kolmogorov’s idea (published during WW II) to impose to all marginal posteriors on the parameter of interest θ to only depend on a statistic S(x). But having to hold for all priors cancels the notion as the statistic need be sufficient jointly for θ and σ, as shown by Hájek in the early 1960’s. Following this attempt, Raiffa and Schlaifer then introduced a more restricted class of priors, namely where nuisance and interest are a priori independent. In which case a conditional factorisation theorem is a sufficient (!) condition for this Q-sufficiency.  But not necessary as shown by the N(θ·σ, 1) counter-example (when σ=±1 and θ>0). [When the prior on σ is uniform, the absolute average is Q-sufficient but is this a positive feature?] This choice of prior separation is somewhat perplexing in that it does not hold under reparameterisation.

Basu ends up with three challenges, including the multinomial M(θ·σ,½(1-θ)·(1+σ),½(1+θ)·(1-σ)), with (n¹,n²,n³) as a minimal sufficient statistic. And the joint observation of an Exponential Exp(θ) translated by σ and of an Exponential Exp(σ) translated by -θ, where the prior on σ gets eliminated in the marginal on θ.

Monte Carlo Markov chains

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , on May 12, 2020 by xi'an

Darren Wraith pointed out this (currently free access) Springer book by Massimiliano Bonamente [whose family name means good spirit in Italian] to me for its use of the unusual Monte Carlo Markov chain rendering of MCMC.  (Google Trend seems to restrict its use to California!) This is a graduate text for physicists, but one could nonetheless expect more rigour in the processing of the topics. Particularly of the Bayesian topics. Here is a pot-pourri of memorable quotes:

“Two major avenues are available for the assignment of probabilities. One is based on the repetition of the experiments a large number of times under the same conditions, and goes under the name of the frequentist or classical method. The other is based on a more theoretical knowledge of the experiment, but without the experimental requirement, and is referred to as the Bayesian approach.”

“The Bayesian probability is assigned based on a quantitative understanding of the nature of the experiment, and in accord with the Kolmogorov axioms. It is sometimes referred to as empirical probability, in recognition of the fact that sometimes the probability of an event is assigned based upon a practical knowledge of the experiment, although without the classical requirement of repeating the experiment for a large number of times. This method is named after the Rev. Thomas Bayes, who pioneered the development of the theory of probability.”

“The likelihood P(B/A) represents the probability of making the measurement B given that the model A is a correct description of the experiment.”

“…a uniform distribution is normally the logical assumption in the absence of other information.”

“The Gaussian distribution can be considered as a special case of the binomial, when the number of tries is sufficiently large.”

“This clearly does not mean that the Poisson distribution has no variance—in that case, it would not be a random variable!”

“The method of moments therefore returns unbiased estimates for the mean and variance of every distribution in the case of a large number of measurements.”

“The great advantage of the Gibbs sampler is the fact that the acceptance is 100 %, since there is no rejection of candidates for the Markov chain, unlike the case of the Metropolis–Hastings algorithm.”

Let me then point out (or just whine about!) the book using “statistical independence” for plain independence, the use of / rather than Jeffreys’ | for conditioning (and sometimes forgetting \ in some LaTeX formulas), the confusion between events and random variables, esp. when computing the posterior distribution, between models and parameter values, the reliance on discrete probability for continuous settings, as in the Markov chain chapter, confusing density and probability, using Mendel’s pea data without mentioning the unlikely fit to the expected values (or, as put more subtly by Fisher (1936), “the data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations”), presenting Fisher’s and Anderson’s Iris data [a motive for rejection when George was JASA editor!] as a “a new classic experiment”, mentioning Pearson but not Lee for the data in the 1903 Biometrika paper “On the laws of inheritance in man” (and woman!), and not accounting for the discrete nature of this data in the linear regression chapter, the three page derivation of the Gaussian distribution from a Taylor expansion of the Binomial pmf obtained by differentiating in the integer argument, spending endless pages on deriving standard properties of classical distributions, this appalling mess of adding over the conditioning atoms with no normalisation in a Poisson experiment

P(X=4|\mu=0,1,2) = \sum_{\mu=0}^2 \frac{\mu^4}{4!}\exp\{-\mu\},

botching the proof of the CLT, which is treated before the Law of Large Numbers, restricting maximum likelihood estimation to the Gaussian and Poisson cases and muddling its meaning by discussing unbiasedness, confusing a drifted Poisson random variable with a drift on its parameter, as well as using the pmf of the Poisson to define an area under the curve (Fig. 5.2), sweeping the improperty of a constant prior under the carpet, defining a null hypothesis as a range of values for a summary statistic, no mention of Bayesian perspectives in the hypothesis testing, model comparison, and regression chapters, having one-dimensional case chapters followed by two-dimensional case chapters, reducing model comparison to the use of the Kolmogorov-Smirnov test, processing bootstrap and jackknife in the Monte Carlo chapter without a mention of importance sampling, stating recurrence results without assuming irreducibility, motivating MCMC by the intractability of the evidence, resorting to the term link to designate the current value of a Markov chain, incorporating the need for a prior distribution in a terrible description of the Metropolis-Hastings algorithm, including a discrete proof for its stationarity, spending many pages on early 1990’s MCMC convergence tests rather than discussing the adaptive scaling of proposal distributions, the inclusion of numerical tables [in a 2017 book] and turning Bayes (1763) into Bayes and Price (1763), or Student (1908) into Gosset (1908).

[Usual disclaimer about potential self-plagiarism: this post or an edited version of it could possibly appear later in my Books Review section in CHANCE. Unlikely, though!]

Rao-Blackwellisation, a review in the making

Posted in Statistics with tags , , , , , , , , , , on March 17, 2020 by xi'an

Recently, I have been contacted by a mainstream statistics journal to write a review of Rao-Blackwellisation techniques in computational statistics, in connection with an issue celebrating C.R. Rao’s 100th birthday. As many many techniques can be interpreted as weak forms of Rao-Blackwellisation, as e.g. all auxiliary variable approaches, I am clearly facing an abundance of riches and would thus welcome suggestions from Og’s readers on the major advances in Monte Carlo methods that can be connected with the Rao-Blackwell-Kolmogorov theorem. (On the personal and anecdotal side, I only met C.R. Rao once, in 1988, when he came for a seminar at Purdue University where I was spending the year.)

%d bloggers like this: