## StanCon 2023 [20-23 June 2023]

Posted in Statistics with tags , , , , , , on April 8, 2023 by xi'an

## Bayes Rules! [book review]

Posted in Books, Kids, Mountains, pictures, R, Running, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on July 5, 2022 by xi'an

Bayes Rules! is a new introductory textbook on Applied Bayesian Model(l)ing, written by Alicia Johnson (Macalester College), Miles Ott (Johnson & Johnson), and Mine Dogucu (University of California Irvine). Textbook sent to me by CRC Press for review. It is available (free) online as a website and has a github site, as well as a bayesrule R package. (Which reminds me that both our own book R packages, bayess and mcsm, have gone obsolete on CRAN! And that I should find time to figure out the issue for an upgrading…)

As far as I can tell [from abroad and from only teaching students with a math background], Bayes Rules! seems to be catering to early (US) undergraduate students with very little exposure to mathematical statistics or probability, as it introduces basic probability notions like pmf, joint distribution, and Bayes’ theorem (as well as Greek letters!) and shies away from integration or algebra (a covariance matrix occurs on page 437 with a lot . For instance, the Normal-Normal conjugacy derivation is considered a “mouthful” (page 113). The exposition is somewhat stretched along the 500⁺ pages as a result, imho, which is presumably a feature shared with most textbooks at this level, and, accordingly, the exercises and quizzes are more about intuition and reproducing the contents of the chapter than technical. In fact, I did not spot there a mention of sufficiency, consistency, posterior concentration (almost made on page 113), improper priors, ergodicity, irreducibility, &tc., while other notions are not precisely defined, like ESS, weakly informative (page 234) or vague priors (page 77), prior information—which makes the negative answer to the quiz “All priors are informative”  (page 90) rather confusing—, R-hat, density plot, scaled likelihood, and more.

As an alternative to “technical derivations” Bayes Rules! centres on intuition and simulation (yay!) via its bayesrule R package. Itself relying on rstan. Learning from example (as R code is always provided), the book proceeds through conjugate priors, MCMC (Metropolis-Hasting) methods, regression models, and hierarchical regression models. Quite impressive given the limited prerequisites set by the authors. (I appreciated the representations of the prior-likelihood-posterior, especially in the sequential case.)

Regarding the “hot tip” (page 108) that the posterior mean always stands between the prior mean and the data mean, this should be made conditional on a conjugate setting and a mean parameterisation. Defining MCMC as a method that produces a sequence of realisations that are not from the target makes a point, except of course that there are settings where the realisations are from the target, for instance after a renewal event. Tuning MCMC should remain a partial mystery to readers after reading Chapter 7 as the Goldilocks principle is quite vague. Similarly, the derivation of the hyperparameters in a novel setting (not covered by the book) should prove a challenge, even though the readers are encouraged to “go forth and do some Bayes things” (page 509).

While Bayes factors are supported for some hypothesis testing (with no point null), model comparison follows more exploratory methods like X validation and expected log-predictive comparison.

The examples and exercises are diverse (if mostly US centric), modern (including cultural references that completely escape me), and often reflect on the authors’ societal concerns. In particular, their concern about a fair use of the inferred models is preminent, even though a quantitative assessment of the degree of fairness would require a much more advanced perspective than the book allows… (In that respect, Exercise 18.2 and the following ones are about book banning (in the US). Given the progressive tone of the book, and the recent ban of math textbooks in the US, I wonder if some conservative boards would consider banning it!) Concerning the Himalaya submitting running example (Chapters 18 & 19), where the probability to summit is conditional on the age of the climber and the use of additional oxygen, I am somewhat surprised that the altitude of the targeted peak is not included as a covariate. For instance, Ama Dablam (6848 m) is compared with Annapurna I (8091 m), which has the highest fatality-to-summit ratio (38%) of all. This should matter more than age: the Aosta guide Abele Blanc climbed Annapurna without oxygen at age 57! More to the point, the (practical) detailed examples do not bring unexpected conclusions, as for instance the fact that runners [thrice alas!] tend to slow down with age.

A geographical comment: Uluru (page 267) is not a city!, but an impressive sandstone monolith in the heart of Australia, a 5 hours drive away from Alice Springs. And historical mentions: Alan Turing (page 10) and the team at Bletchley Park indeed used Bayes factors (and sequential analysis) in cracking the Enigma, but this remained classified information for quite a while. Arianna Rosenbluth (page 10, but missing on page 165) was indeed a major contributor to Metropolis et al.  (1953, not cited), but would not qualify as a Bayesian statistician as the goal of their algorithm was a characterisation of the Boltzman (or Gibbs) distribution, not statistical inference. And David Blackwell’s (page 10) Basic Statistics is possibly the earliest instance of an introductory Bayesian and decision-theory textbook, but it never mentions Bayes or Bayesianism.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Book Review section in CHANCE.]

## efficiency of normalising over discrete parameters

Posted in Statistics with tags , , , , , , , , , on May 1, 2022 by xi'an

Yesterday, I noticed a new arXival entitled Investigating the efficiency of marginalising over discrete parameters in Bayesian computations written by Wen Wang and coauthors. The paper is actually comparing the simulation of a Gibbs sampler with an Hamiltonian Monte Carlo approach on Gaussian mixtures, when including and excluding latent variables, respectively. The authors missed the opposite marginalisation when the parameters are integrated.

While marginalisation requires substantial mathematical effort, folk wisdom in the Stan community suggests that fitting models with marginalisation is more efficient than using Gibbs sampling.

The comparison is purely experimental, though, which means it depends on the simulated data, the sample size, the prior selection, and of course the chosen algorithms. It also involves the [mostly] automated [off-the-shelf] choices made in the adopted software, JAGS and Stan. The outcome is only evaluated through ESS and the (old) R statistic. Which all depend on the parameterisation. But evacuates the label switching problem by imposing an ordering on the Gaussian means, which may have a different impact on marginalised and unmarginalised models. All in all, there is not much one can conclude about this experiment since the parameter values beyond the simulated data seem to impact the performances much more than the type of algorithm one implements.

## identifying mixtures

Posted in Books, pictures, Statistics with tags , , , , , , on February 27, 2022 by xi'an

I had not read this 2017 discussion of Bayesian mixture estimation by Michael Betancourt before I found it mentioned in a recent paper. Where he re-explores the issue of identifiability and label switching in finite mixture models. Calling somewhat abusively degenerate mixtures where all components share the same family, e.g., mixtures of Gaussians. Illustrated by Stan code and output. This is rather traditional material, in that the non-identifiability of mixture components has been discussed in many papers and at least as many solutions proposed to overcome the difficulties of exploring the posterior distribution. Including our 2000 JASA paper with Gilles Celeux and Merrilee Hurn. With my favourite approach being the label-free representations as a point process in the parameter space (following an idea of Peter Green) or as a collection of clusters in the latent variable space. I am much less convinced by ordering constraints: while they formally differentiate and therefore identify the individual components of a mixture, they partition the parameter space with no regard towards the geometry of the posterior distribution. With in turn potential consequences on MCMC explorations of this fragmented surface that creates barriers for simulated Markov chains. Plus further difficulties with inferior but attracting modes in identifiable situations.

## prior elicitation

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , , on January 13, 2022 by xi'an

“We believe that an elicitation method should support elicitation both in the parameter and observable space, should be model-agnostic, and should be sample-efficient since human effort is costly.”

Petrus Mikkola et al. arXived a long paper on prior elicitation addressing the (most relevant) question: Why are we not widely use prior elicitation? With a massive bibliography that could be (partly) commented (and corrected as some references are incomplete, as eg my book chapter on priors!). I think the paper would make a terrific discussion paper.

The absence of a general procedure for prior elicitation is indeed hindering the adoption of Bayesian methods outside our core community and is thus eventually detrimental to their wider development. It also carries the dangers of misled or misleading prior choices. The authors put forward the absence of “software that integrates well with the current probabilistic programming tools used for other parts of the modelling workflow.” This requires setting principles that avoid “just-press-key” solutions. (Aside: This reminds me of my very first prospective PhD student, who was then working in a startup [although the name was not yet in use in the early 1990’s!] and had build such a software in a discretised, low dimension, conjugate prior, environment by returning a form of decision-theoretic impact of the chosen hyperparameters. He alas aborted his PhD attempt due to the short-term pressing matters in the under-staffed company…)

“We inspect prior elicitation from the perspectives of (1) properties of the prior distribution itself, (2) the model family and the prior elicitation method’s dependence on it, (3) the underlying elicitation space, (4) how the method interprets the information provided by the expert, (5) computation, (6) the form and quantity of interaction with the expert(s), and (7) the assumed capability of the expert (…)”

Prior elicitation is indeed a delicate balance between incorporating expert opinion(s) and avoiding over-standardisation. In my limited experience, experts tend to be over-confident about their own opinion and unwilling to attach uncertainty to their assessments. Even when being inconsistent. When several experts are involved (as, very briefly, in Section 3.6), building a common prior quickly becomes a challenge, esp. if their interests (or utility functions) diverge. As illustrated in the case of the whaling commission analysed by Adrian Raftery in the late 1990’s. (The above quote involves a single expert.) Actually, I dislike the term expert altogether, as it comes without any grading of the reliability of the person.To hit (!) at an early statement in the paper (p.5), should the prior elicitation always depend on the (sampling) model, as experts may ignore or misapprehend the model? The posterior already accounts for the likelihood and the parameter may pre-exist wrt the model, as eg cosmological constants or vaccine efficiency… In a sense, the model should be involved as little as possible in the elicitation as the expert could confuse her beliefs about the parameter with those about the accuracy of the model. (I realise this is not necessarily a mainstream position as illustrated by this paper by Andrew and friends!)

And isn’t the first stumbling block the inability of most to represent one’s prior knowledge in probabilistic terms? Innumeracy is a shared shortcoming in the general population (and since everyone’s an expert!), as repeatedly demonstrated since the start of the Covid-19 pandemic. (See also the above point about inconsistency. Accounting for such inconsistencies in a Bayesian way is a natural answer, albeit requiring the degree of expertise and reliability to be tested.)

Is prior elicitation feasible beyond a few dimensions? Even when using the constrictive tool of copulas one hits a wall after a few dimensions, assuming the expert is willing to set a prior correlation matrix.  Most of the methods described in Section 3.1 only apply to textbook examples. In their third dimension (!), the authors mention neural network parameters but later fail to cover this type of issue. (This was the example I had in mind indeed.) And they move from parameter space to observable space. Distinguishing predictive elicitation from observational elicitation, the former being what I would have suggested from scratch. Obviously, the curse of dimensionality strikes again unless one considers summary statistics (like in ABC).

While I am glad conjugate priors do not get the lion’s share, using as in Section 3.3.. non-parametric or machine learning solutions to construct the prior sounds unrealistic. (And including maximum entropy priors into that category seems wrong since they are definitely parametric.)

The proposed Bayesian treatment of the expert’s “data” (Section 4.1) is rational but requires an additional model construct to link the expert’s data with the parameter to reach a Bayes formula like (4.1). Plus a primary prior (which could then be one of the reference priors.) Reducing the expert’s input to imaginary observations may prove too narrow, though. The notion of an iterative elicitation is most appealing and its sequential aspect may not be particularly problematic in opposition to posteriors relying on using the data twice or more. I am much less buying the hierarchical construct of Section 4.3 because they imply a return to conjugate priors and hyperpriors, are not necessarily correctly understood by experts, do not always cater to observational elicitation, and are not an answer to high-dimension challenges.

Given the state of the art, it sounds like we are still far from seeing prior elicitation as a natural part of Bayesian software and probabilistic programming. Even when using a modular, model-agnostic strategy. But this is most certainly a worthy prospect!