Archive for Jeffreys priors

statistical modeling with R [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on June 10, 2023 by xi'an

Statistical Modeling with R (A dual frequentist and Bayesian approach for life scientists) is a recent book written by Pablo Inchausti, from Uruguay. In a highly personal and congenial style (witness the preface), with references to (fiction) books that enticed me to buy them. The book was sent to me by the JASA book editor for review and I went through the whole of it during my flight back from Jeddah. [Disclaimer about potential self-plagiarism: this post or a likely edited version of it will eventually appear in JASA. If not CHANCE, for once.]

The very first sentence (after the preface) quotes my late friend Steve Fienberg, which is definitely starting on the right foot. The exposition of the motivations for writing the book is quite convincing, with more emphasis than usual put on the notion and limitations of modeling. The discourse is overall inspirational and contains many relevant remarks and links that make it worth reading it as a whole. While heavily connected with a few R packages like fitdist, fitistrplus, brms (a  front for Stan), glm, glmer, the book is wisely bypassing the perilous reef of recalling R bases. Similarly for the foundations of probability and statistics. While lacking in formal definitions, in my opinion, it reads well enough to somehow compensate for this very lack. I also appreciate the coherent and throughout continuation of the parallel description of Bayesian and non-Bayesian analyses, an attempt that often too often quickly disappear in other books. (As an aside, note that hardly anyone claims to be a frequentist, except maybe Deborah Mayo.) A new model is almost invariably backed by a new dataset, if a few being somewhat inappropriate as in the mammal sleep patterns of Chapter 5. Or in Fig. 6.1.

Given that the main motivation for the book (when compared with references like BDA) is heavily towards the practical implementation of statistical modelling via R packages, it is inevitable that a large fraction of Statistical Modeling with R is spent on the analysis of R outputs, even though it sometimes feels a wee bit too heavy for yours truly.  The R screen-copies are however produced in moderate quantity and size, even though the variations in typography/fonts (at least on my copy?!) may prove confusing. Obviously the high (explosive?) distinction between regression models may eventually prove challenging for the novice reader. The specific issue of prior input (or “defining priors”) is briefly addressed in a non-chapter (p.323), although mentions are made throughout preceding chapters. I note the nice appearance of hierarchical models and experimental designs towards the end, but would have appreciated some discussions on missing topics such as time series, causality, connections with machine learning, non-parametrics, model misspecification. As an aside, I appreciated being reminded about the apocryphal nature of Ockham’s much cited quotePluralitas non est ponenda sine necessitate“.

Typo Jeffries found in Fig. 2.1, along with a rather sketchy representation of the history of both frequentist and Bayesian statistics. And Jon Wakefield’s book (with related purpose of presenting both versions of parametric inference) was mistakenly entered as Wakenfield’s in the bibliography file. Some repetitions occur. I do not like the use of the equivalence symbol ≈ for proportionality. And I found two occurrences of the unavoidable “the the” typo (p.174 and p.422). I also had trouble with some sentences like “long-run, hypothetical distribution of parameter estimates known as the sampling distribution” (p.27), “maximum likelihood estimates [being] sufficient” (p.28), “Jeffreys’ (1939) conjugate priors” [which were introduced by Raiffa and Schlaifer] (p.35), “A posteriori tests in frequentist models” (p.130), “exponential families [having] limited practical implications for non-statisticians” (p.190), “choice of priors being correct” (p.339), or calling MCMC sample terms “estimates” (p.42), and issues with some repetitions, missing indices for acronyms, packages, datasets, but did not bemoan the lack homework sections (beyond suggesting new datasets for analysis).

A problematic MCMC entry is found when calibrating the choice of the Metropolis-Hastings proposal towards avoiding negative values “that will generate an error when calculating the log-likelihood” (p.43) since it suggests proposed values should not exceed the support of the posterior (and indicates a poor coding of the log-likelihood!). I also find the motivation for the full conditional decomposition behind the Gibbs sampler (p.47) unnecessarily confusing. (And automatically having a Metropolis-Hastings step within Gibbs as on Fig. 3.9 brings another magnitude of confusion.) The Bayes factor section is very terse. The derivation of the Kullback-Leibler representation (7.3) as an expected log likelihood ratio seems to be missing a reference measure. Of course, seeing a detailed coverage of DIC (Section 7.4) did not suit me either, even though the issue with mixtures was alluded to (with no detail whatsoever). The Nelder presentation of the generalised linear models felt somewhat antiquated, since the addition of the scale factor a(φ) sounds over-parameterized.

But those are minor quibble in relation to a book that should attract curious minds of various background knowledge and expertise in statistics, as well as work nicely to support an enthusiastic teacher of statistical modelling. I thus recommend this book most enthusiastically.

21w5107 [day 2]

Posted in Books, Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , on December 1, 2021 by xi'an

After a rich and local (if freezing) dinner on a rooftop facing the baroque Oaxaca cathedral, and an early invigorating outdoor swim in my case!, the morning session was mostly on mixtures, with Helen Ogden exploring X validation for (estimating the number k of components for) finite mixtures, when using the likelihood as an objective function. I was unclear of the goal however when considering that the data supporting the study was Uniform (0,1), nothing like a mixture of Normal distributions. And about the consistency attached to the objective function. The session ended with Diana Cai presenting a counter-argument in the sense that she proved, along with Trevor Campbell and Tamara Broderick, that the posterior on k diverges to infinity with the number n of observations if a mixture model is misspecified for said data. Which does not come as a major surprise since there is no properly defined value of k when the data is not generated from the adopted mixture. I would love to see an extension to the case when the k component mixture contains a non-parametric component! In-between, Alexander Ly discussed Bayes factors for multiple datasets, with some asymptotics showing consistency for some (improper!) priors if one sample size grows to infinity. With actually attaining the same rate under both hypotheses. Luis Nieto-Barajas presented an approach on uncertainty assessment through KL divergence for random probability measures, which requires a calibration of the KL in this setting, as KL does not enjoy a uniform scale, and a prior on a Pólya tree. And Chris Holmes presented a recent work with Edwin Fong and Steven Walker on a prediction approach to Bayesian inference. Which I had had on my reading list for a while. It is a very original proposal where likelihoods and priors are replaced by the sequence of posterior predictives and only parameters of interest get simulated. The Bayesian flavour of the approach is delicate to assess though, albeit a form of non-parametric Bayesian perspective… (I still need to read the paper carefully.)

In the afternoon session, Judith Rousseau presented her recent foray in cut posteriors for semi-parametric HMMs. With interesting outcomes for efficiently estimating the transition matrix, the component distributions, and the smoothing distribution. I wonder at the connection with safe Bayes in that cut posteriors induce a loss of information. Sinead Williamson spoke on distributed MCMC for BNP. Going back at the “theme of the day”, namely clustering and finding the correct (?) number of clusters. With a collapsed versus uncollapsed division that reminded me of the marginal vs. conditional María Gil-Leyva discussed yesterday. Plus a decomposition of a random measure into a finite mixture and an infinite one that also reminded me of the morning talk of Diana Cai. (And making me wonder at the choice of the number K of terms in the finite part.) Michele Guindani spoke about clustering distributions (with firecrackers as a background!). Using the nDP mixture model, which was show to suffer from degeneracy (as discussed by Frederico Camerlenghi et al. in BA). The subtle difference stands in using the same (common) atoms in all random distributions at the top of the hierarchy, with independent weights. Making the partitions partially exchangeable. The approach relies on Sylvia’s generalised mixtures of finite mixtures. With interesting applications to microbiome and calcium imaging (including a mice brain in action!). And Giovanni Rebaudo presented a generalised notion of clustering aligned on a graph, with some observations located between the nodes corresponding to clusters. Represented as a random measure with common parameters for the clusters and separated parameters outside. Interestingly playing on random partitions, Pólya urns, and species sampling.

21w5107 [day 1]

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on November 30, 2021 by xi'an

The workshop started by the bad news of our friend Michele Guindani being hit and mugged upon arrival in Oaxaca, Saturday night. Fortunately, he was not hurt, but lost both phone and wallet, always a major bummer when abroad… Still this did not cast a lasting pall on the gathering of long-time no-see friends, whom I had indeed not seen for at least two years. Except for those who came to the CIRMirror!

A few hours later, we got woken up by fairly loud firecrackers (palomas? cohetes?) at 5am, for no reason I can fathom (the Mexican Revolution day was a week ago) although it seemed correlated with the nearby church bells going on at full blast (for Lauds? Hanukkah? Cyber Monday? Chirac’s birthdate?). The above picture was taken the Santa María del Tule town with its super-massive Montezuma cypress tree, with remaining decorations from the Día de los Muertos.

Without launching (much) the debate on whether or not Bayesian non-parametrics qualified as “objective Bayesian” methods, Igor Prünster started the day with a non-parametric presentation of dependent random probability measures. With the always fascinating notion that a random discrete non-parametric prior is inducing a distribution on the partitions (EPPF). And applicability in mixtures and their generalisations. Realising that the highly discrete nature of such measures is not such an issue for a given sample size n, since there are at most n elements in the partition. Beatrice Franzolini discussed of specific ways to create dependent distributions based on independent samples, although her practical example based on one N(-10,1) sample and another (independently) N(10,1) sample seemed to fit in several of the dependent random measures she compared. And Marta Catalano (Warwick) presented her work on partial exchangeability and optimal transportation (which I had also heard in CIRM last June and in Warwick last week). One thing I had not realised earlier was the dependence of the Wasserstein distance on the parameterisation, although it now makes perfect sense. If only for the coupling.  I had alas to miss Isadora Antoniano-Villalobos’ talk as I had to teach my undergrad class in Paris Dauphine at the same time… This non-parametric session was quite homogeneous and rich in perspectives.

In an all-MCMC afternoon, Julyan Arbel talked about reference priors for extreme value distributions, with the “shocking” case of a restriction on the support of one parameter, ξ. Which means in fact that the Jeffreys prior is then undefined. This reminded me somewhat of the work of Clara Grazian on Jeffreys priors for mixtures, where some models were not allowing for Fisher information to exist. The second part of this talk was about modified local versions of Gelman & Rubin (1992) R hats. And the recent modification proposed by Aki and co-authors. Where I thought that a simplification of the multivariate challenge of defining ranks could be alleviated by considering directly the likelihood values of the chains. And Trevor Campbell gradually built an involved parallel tempering method where the powers of a geometric mixture are optimised as spline functions of the temperature. Next, María Gil-Leyva presented her original and ordered approach to mixture estimation, which I discussed in a blog published two days ago (!). She corrected my impressions that (i) the methods were all impervious to label switching and (ii) required some conjugacy to operate. The final talk of the day was by Anirban Bhattacharya on high-D Bayesian regression and coupling techniques for checking convergence, a paper that had been on my reading list for a long while. A very elaborate construct of coupling strategies within a Gibbs sampler, with some steps relying on optimal coupling and others on the use of common random generators.

Monte Carlo Markov chains

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , on May 12, 2020 by xi'an

Darren Wraith pointed out this (currently free access) Springer book by Massimiliano Bonamente [whose family name means good spirit in Italian] to me for its use of the unusual Monte Carlo Markov chain rendering of MCMC.  (Google Trend seems to restrict its use to California!) This is a graduate text for physicists, but one could nonetheless expect more rigour in the processing of the topics. Particularly of the Bayesian topics. Here is a pot-pourri of memorable quotes:

“Two major avenues are available for the assignment of probabilities. One is based on the repetition of the experiments a large number of times under the same conditions, and goes under the name of the frequentist or classical method. The other is based on a more theoretical knowledge of the experiment, but without the experimental requirement, and is referred to as the Bayesian approach.”

“The Bayesian probability is assigned based on a quantitative understanding of the nature of the experiment, and in accord with the Kolmogorov axioms. It is sometimes referred to as empirical probability, in recognition of the fact that sometimes the probability of an event is assigned based upon a practical knowledge of the experiment, although without the classical requirement of repeating the experiment for a large number of times. This method is named after the Rev. Thomas Bayes, who pioneered the development of the theory of probability.”

“The likelihood P(B/A) represents the probability of making the measurement B given that the model A is a correct description of the experiment.”

“…a uniform distribution is normally the logical assumption in the absence of other information.”

“The Gaussian distribution can be considered as a special case of the binomial, when the number of tries is sufficiently large.”

“This clearly does not mean that the Poisson distribution has no variance—in that case, it would not be a random variable!”

“The method of moments therefore returns unbiased estimates for the mean and variance of every distribution in the case of a large number of measurements.”

“The great advantage of the Gibbs sampler is the fact that the acceptance is 100 %, since there is no rejection of candidates for the Markov chain, unlike the case of the Metropolis–Hastings algorithm.”

Let me then point out (or just whine about!) the book using “statistical independence” for plain independence, the use of / rather than Jeffreys’ | for conditioning (and sometimes forgetting \ in some LaTeX formulas), the confusion between events and random variables, esp. when computing the posterior distribution, between models and parameter values, the reliance on discrete probability for continuous settings, as in the Markov chain chapter, confusing density and probability, using Mendel’s pea data without mentioning the unlikely fit to the expected values (or, as put more subtly by Fisher (1936), “the data of most, if not all, of the experiments have been falsified so as to agree closely with Mendel’s expectations”), presenting Fisher’s and Anderson’s Iris data [a motive for rejection when George was JASA editor!] as a “a new classic experiment”, mentioning Pearson but not Lee for the data in the 1903 Biometrika paper “On the laws of inheritance in man” (and woman!), and not accounting for the discrete nature of this data in the linear regression chapter, the three page derivation of the Gaussian distribution from a Taylor expansion of the Binomial pmf obtained by differentiating in the integer argument, spending endless pages on deriving standard properties of classical distributions, this appalling mess of adding over the conditioning atoms with no normalisation in a Poisson experiment

P(X=4|\mu=0,1,2) = \sum_{\mu=0}^2 \frac{\mu^4}{4!}\exp\{-\mu\},

botching the proof of the CLT, which is treated before the Law of Large Numbers, restricting maximum likelihood estimation to the Gaussian and Poisson cases and muddling its meaning by discussing unbiasedness, confusing a drifted Poisson random variable with a drift on its parameter, as well as using the pmf of the Poisson to define an area under the curve (Fig. 5.2), sweeping the improperty of a constant prior under the carpet, defining a null hypothesis as a range of values for a summary statistic, no mention of Bayesian perspectives in the hypothesis testing, model comparison, and regression chapters, having one-dimensional case chapters followed by two-dimensional case chapters, reducing model comparison to the use of the Kolmogorov-Smirnov test, processing bootstrap and jackknife in the Monte Carlo chapter without a mention of importance sampling, stating recurrence results without assuming irreducibility, motivating MCMC by the intractability of the evidence, resorting to the term link to designate the current value of a Markov chain, incorporating the need for a prior distribution in a terrible description of the Metropolis-Hastings algorithm, including a discrete proof for its stationarity, spending many pages on early 1990’s MCMC convergence tests rather than discussing the adaptive scaling of proposal distributions, the inclusion of numerical tables [in a 2017 book] and turning Bayes (1763) into Bayes and Price (1763), or Student (1908) into Gosset (1908).

[Usual disclaimer about potential self-plagiarism: this post or an edited version of it could possibly appear later in my Books Review section in CHANCE. Unlikely, though!]

O’Bayes in action

Posted in Books, Kids, Statistics, University life with tags , , , , , , , , , , , , on November 7, 2017 by xi'an

My next-door colleague [at Dauphine] François Simenhaus shared a paradox [to be developed in an incoming test!] with Julien Stoehr and I last week, namely that, when selecting the largest number between a [observed] and b [unobserved], drawing a random boundary on a [meaning that a is chosen iff a is larger than this boundary] increases the probability to pick the largest number above ½2…

When thinking about it in the wretched RER train [train that got immobilised for at least two hours just a few minutes after I went through!, good luck to the passengers travelling to the airport…] to De Gaulle airport, I lost the argument: if a<b, the probability [for this random bound] to be larger than a and hence for selecting b is 1-Φ(a), while, if a>b, the probability [of winning] is Φ(a). Hence the only case when the probability is ½ is when a is the median of this random variable. But, when discussing the issue further with Julien, I exposed an interesting non-informative prior characterisation. Namely, if I assume a,b to be iid U(0,M) and set an improper prior 1/M on M, the conditional probability that b>a given a is ½. Furthermore, the posterior probability to pick the right [largest] number with François’s randomised rule is also ½, no matter what the distribution of the random boundary is. Now, the most surprising feature of this coffee room derivation is that these properties only hold for the prior 1/M. Any other power of M will induce an asymmetry between a and b. (The same properties hold when a,b are iid Exp(M).)  Of course, this is not absolutely unexpected since 1/M is the invariant prior and since the “intuitive” symmetry only holds under this prior. Power to O’Bayes!

When discussing again the matter with François yesterday, I realised I had changed his wording of the puzzle. The original setting is one with two cards hiding the unknown numbers a and b and of a player picking one of the cards. If the player picks a card at random, there is indeed a probability of ½ of picking the largest number. If the decision to switch or not depends on an independent random draw being larger or smaller than the number on the observed card, the probability to get max(a,b) in the end hits 1 when this random draw falls into (a,b) and remains ½ outside (a,b). Randomisation pays.