Archive for CHANCE

statistical modeling with R [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on June 10, 2023 by xi'an

Statistical Modeling with R (A dual frequentist and Bayesian approach for life scientists) is a recent book written by Pablo Inchausti, from Uruguay. In a highly personal and congenial style (witness the preface), with references to (fiction) books that enticed me to buy them. The book was sent to me by the JASA book editor for review and I went through the whole of it during my flight back from Jeddah. [Disclaimer about potential self-plagiarism: this post or a likely edited version of it will eventually appear in JASA. If not CHANCE, for once.]

The very first sentence (after the preface) quotes my late friend Steve Fienberg, which is definitely starting on the right foot. The exposition of the motivations for writing the book is quite convincing, with more emphasis than usual put on the notion and limitations of modeling. The discourse is overall inspirational and contains many relevant remarks and links that make it worth reading it as a whole. While heavily connected with a few R packages like fitdist, fitistrplus, brms (a  front for Stan), glm, glmer, the book is wisely bypassing the perilous reef of recalling R bases. Similarly for the foundations of probability and statistics. While lacking in formal definitions, in my opinion, it reads well enough to somehow compensate for this very lack. I also appreciate the coherent and throughout continuation of the parallel description of Bayesian and non-Bayesian analyses, an attempt that often too often quickly disappear in other books. (As an aside, note that hardly anyone claims to be a frequentist, except maybe Deborah Mayo.) A new model is almost invariably backed by a new dataset, if a few being somewhat inappropriate as in the mammal sleep patterns of Chapter 5. Or in Fig. 6.1.

Given that the main motivation for the book (when compared with references like BDA) is heavily towards the practical implementation of statistical modelling via R packages, it is inevitable that a large fraction of Statistical Modeling with R is spent on the analysis of R outputs, even though it sometimes feels a wee bit too heavy for yours truly.  The R screen-copies are however produced in moderate quantity and size, even though the variations in typography/fonts (at least on my copy?!) may prove confusing. Obviously the high (explosive?) distinction between regression models may eventually prove challenging for the novice reader. The specific issue of prior input (or “defining priors”) is briefly addressed in a non-chapter (p.323), although mentions are made throughout preceding chapters. I note the nice appearance of hierarchical models and experimental designs towards the end, but would have appreciated some discussions on missing topics such as time series, causality, connections with machine learning, non-parametrics, model misspecification. As an aside, I appreciated being reminded about the apocryphal nature of Ockham’s much cited quotePluralitas non est ponenda sine necessitate“.

Typo Jeffries found in Fig. 2.1, along with a rather sketchy representation of the history of both frequentist and Bayesian statistics. And Jon Wakefield’s book (with related purpose of presenting both versions of parametric inference) was mistakenly entered as Wakenfield’s in the bibliography file. Some repetitions occur. I do not like the use of the equivalence symbol ≈ for proportionality. And I found two occurrences of the unavoidable “the the” typo (p.174 and p.422). I also had trouble with some sentences like “long-run, hypothetical distribution of parameter estimates known as the sampling distribution” (p.27), “maximum likelihood estimates [being] sufficient” (p.28), “Jeffreys’ (1939) conjugate priors” [which were introduced by Raiffa and Schlaifer] (p.35), “A posteriori tests in frequentist models” (p.130), “exponential families [having] limited practical implications for non-statisticians” (p.190), “choice of priors being correct” (p.339), or calling MCMC sample terms “estimates” (p.42), and issues with some repetitions, missing indices for acronyms, packages, datasets, but did not bemoan the lack homework sections (beyond suggesting new datasets for analysis).

A problematic MCMC entry is found when calibrating the choice of the Metropolis-Hastings proposal towards avoiding negative values “that will generate an error when calculating the log-likelihood” (p.43) since it suggests proposed values should not exceed the support of the posterior (and indicates a poor coding of the log-likelihood!). I also find the motivation for the full conditional decomposition behind the Gibbs sampler (p.47) unnecessarily confusing. (And automatically having a Metropolis-Hastings step within Gibbs as on Fig. 3.9 brings another magnitude of confusion.) The Bayes factor section is very terse. The derivation of the Kullback-Leibler representation (7.3) as an expected log likelihood ratio seems to be missing a reference measure. Of course, seeing a detailed coverage of DIC (Section 7.4) did not suit me either, even though the issue with mixtures was alluded to (with no detail whatsoever). The Nelder presentation of the generalised linear models felt somewhat antiquated, since the addition of the scale factor a(φ) sounds over-parameterized.

But those are minor quibble in relation to a book that should attract curious minds of various background knowledge and expertise in statistics, as well as work nicely to support an enthusiastic teacher of statistical modelling. I thus recommend this book most enthusiastically.

Number savvy [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on March 31, 2023 by xi'an

“This book aspires to contribute to overall numeracy through a tour de force presentation of the production, use, and evolution of data.”

Number Savvy: From the Invention of Numbers to the Future of Data is written by George Sciadas, a  statistician working at Statistics Canada. This book is mostly about data, even though it starts with the “compulsory” tour of the invention(s) of numbers and the evolution towards a mostly universal system and the issue of measurements (with a funny if illogical/anti-geographical confusion in “gare du midi in Paris and gare du Nord in Brussels” since Gare du Midi (south) is in Brussels while Gare du Nord (north) in in Paris). The chapter (Chap. 3) on census and demography is quite detailed about the hurdles preventing an exact count of a population, but much less about the methods employed to improve the estimation. (The request for me to fill the short form for the 2023 French Census actually came while I was reading the book!)

The next chapter links measurement with socio-economic notions or models, like unemployment rate, which depends on so many criteria (pp. 77-87) that its measurement sounds impossible or arbitrary. Almost as arbitrary as the reported number of protesters in a French demonstration! Same difficulty with the GDP, whose interpretation seems beyond the grasp of the common reader. And does not cover significantly missing (-not-at-random) data like tax evasion, money laundering, and the grey economy. (Nitpicking: if GDP got down by 0.5% one year and up by 0.5% the year after, this does not exactly compensate!) Chapter 5 reflects upon the importance of definitions and boundaries in creating official statistics and categorical data. A chapter (Chap 6) on the gathering of data in the past (read prior to the “Big Data” explosion) is preparing the ground to the chapter on the current setting. Mostly about surveys, presented as definitely from the past, “shadows of their old selves”. And with anecdotes reminding me of my only experience as a survey interviewer (on Xmas practices!). About administrative data, progressively moving from collected by design to available for any prospection (or “farming”). A short chapter compared with the one (Chap 7) on new data (types), mostly customer, private sector, data. Covering the data accumulated by big tech companies, but not particularly illuminating (with bar-room remarks like “Facebook users tend to portray their lives as they would like them to be. Google searches may reflect more truthfully what people are looking for.”)

The following Chapter 8 is somehow confusing in its defence of microdata, by which I understand keeping the raw data rather than averaging through summary statistics. Synthetic data is mentioned there, but without reference to a reference model, while machine learning makes a very brief appearance (p.222). In Chapter 9, (statistical) data analysis is [at last!] examined, but mostly through descriptive statistics. Except for a regression model and a discussion of the issues around hypothesis testing and Bayesian testing making its unique visit, albeit confusedly in-between references to Taleb’s Black swan, Gödel’s incompleteness theorem (which always seem to fascinate authors of general public science books!), and Kahneman and Tversky’s prospect theory. Somewhat surprisingly, the chapter also includes a Taoist tale about the farmer getting in turns lucky and unlucky… A tale that was already used in What are the chances? that I reviewed two years ago. As this is a very established parable dating back at least to the 2nd century B.C., there is no copyright involved, but what are the chances the story finds its way that quickly in another book?!

The last and final chapter is about the future, unsurprisingly. With prediction of “plenty of black boxes“, “statistical lawlessness“, “data pooling” and data as a commodity (which relates with some themes of our OCEAN ERC-Synergy grant). Although the solution favoured by the author is centralised, through a (national) statistics office or another “trusted third party“. The last section is about the predicted end of theory, since “simply looking at data can reveal patterns“, but resisting the prophets of doom and idealising the Rise of the (AI) machines… The lyrical conclusion that “With both production consolidation and use of data increasingly in the ‘hands’ of machines, and our wise interventions, the more distant future will bring complete integrations” sounds too much like Brave New World for my taste!

“…the privacy argument is weak, if not hypocritical. Logically, it’s hard to fathom what data that we share with an online retailer or a delivery company we wouldn’t share with others (…) A naysayer will say nay.” (p.190)

The way the book reads and unrolls is somewhat puzzling to this reader, as it sounds like a sequence of common sense remarks with a Guesstimation flavour on the side, and tiny historical or technical facts, some unknown and most of no interest to me, while lacking in the larger picture. For instance, the long-winded tale on evaluating the cumulated size of a neighbourhood lawns (p.34-38) does not seem to be getting anywhere. The inclusion of so many warnings, misgivings, and alternatives in the collection and definition of data may have the counter-effect of discouraging readers from making sense of numeric concepts and trusting the conclusions of data-based analyses. The constant switch in perspective(s) and the apparent absence of definite conclusions are also exhausting. Furthermore, I feel that the author and his rosy prospects are repeatedly minimizing the risks of data collection on individual privacy and freedom, when presenting the platforms as a solution to a real time census (as, e.g., p.178), as exemplified by the high social control exercised by some number savvy dictatures!  And he is highly critical of EU regulations such as GDPR, “less-than-subtle” (p.267), “with its huge impact on businesses” (p.268). I am thus overall uncertain which audience this book will eventually reach.

[Disclaimer about potential self-plagiarism: this post or an edited version will potentially appear in my Books Review section in CHANCE.]

The Effect [book review]

Posted in Books, R, Running, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on March 10, 2023 by xi'an

While it sounds like the title of a science-fiction catastrophe novel or of a (of course) convoluted nouveau roman, this book by Nick Huntington-Klein is a massive initiation to econometrics and causality. As explained by the subtitle, An Introduction to Research Design and Causality.

This is a hüûüge book, actually made of two parts that could have been books (volumes?). And covering three langages, R, Stata, and Python, which should have led to three independent books. (Seriously, why print three versions when you need at best one?!)  I carried it with me during my vacations in Central Québec, but managed to loose my notes on the first part, which means missing the opportunity for biased quotes! It was mostly written during the COVID lockdown(s), which may explain for a certain amount of verbosity and rambling around.

“My mom loved the first part of the book and she is allergic to statistics.”

The first half (which is in fact a third!) is conceptual (and chatty) and almost formula free, based on the postulate that “it’s a pretty slim portion of students who understand a method because of an equation” (p.xxii). For this reader (or rather reviewer) and on explanations through example, it makes the reading much harder as spotting the main point gets harder (and requires reading most sentences!). And a very slow start since notations and mathematical notions have to be introduced with an excess of caution (as in the distinction between Latin and Greek symbols, p.36). Moving through single variable models, conditional distributions, with a lengthy explanation of how OLS are derived, data generating process and identification (of causes), causal diagrams, back and front doors (a recurrent notion within the book),  treatment effects and a conclusion chapter.

“Unlike statistical research, which is completely made of things that are at least slightly false, statistics itself is almost entirely true.” (p.327)

The second part, called the Toolbox, is closer to a classical introduction to econometrics, albeit with a shortage of mathematics (and no proof whatsoever), although [warning!] logarithms, polynomials, partial derivatives and matrices are used. Along with a consequent (3x) chunk allocated to printed codes, the density of the footnotes significantly increases in this section. It covers an extensive chapter on regression (including testing practice, non-linear and generalised linear models, as well as basic bootstrap without much warning about its use in… regression settings, and LASSO),  one on matching (with propensity scores, kernel weighting, Mahalanobis weighting, one on  simulation, yes simulation! in the sense of producing pseudo-data from known generating processes to check methods, as well as bootstrap (with resampling residuals making at last an appearance!), fixed and random effects (where the author “feels the presence of Andrew Gelman reaching through time and space to disagree”, p.405). The chapter on event studies is about time dependent data with a bit of ARIMA prediction (but nothing on non-stationary series and unit root issues). The more exotic chapters cover (18) difference-in-differences models (control vs treated groups, with John Snow pumping his way in), (19) instrumental variables (aka the minor bane of my 1980’s econometrics courses), with double least squares and generalised methods of moments (if not the simulated version), (20) discontinuity (i.e., changepoints), with the limitation of having a single variate explaining the change, rather than an unknown combination of them, and a rather pedestrian approach to the issue, (iv) other methods (including the first mention of machine learning regression/prediction and some causal forests), concluding with an “Under the rug” portmanteau.

Nothing (afaict) on multivariate regressed variates and simultaneous equations. Hardly an occurrence of Bayesian modelling (p.581), vague enough to remind me of my first course of statistics and the one-line annihilation of the notion.

Duh cover, but nice edition, except for the huge margins that could have been cut to reduce the 622 pages by a third (and harnessed the tendency of the author towards excessive footnotes!). And an unintentional white line on p.238! Cute and vaguely connected little drawings at the head of every chapter (like the head above). A rather terse matter index (except for the entry “The first reader to spot this wins ten bucks“!), which should have been completed with an acronym index.

“Calculus-heads will recognize all of this as taking integrals of the density curve. Did you know there’s calculus hidden inside statistics? The things your professor won’t tell you until it’s too late to drop the class.

Obviously I am biased in that I cannot negatively comment on an author running 5:37 a mile as, by now, I could just compete far from the 5:15 of yester decades! I am just a wee bit suspicious at the reported time, however, given that it happens exactly on page 537… (And I could have clearly taken issue with his 2014 paper, Is Robert anti-teacher? Or with the populist catering to anti-math attitudes as the above found in a footnote!) But I enjoyed reading the conceptual chapter on causality as well as the (more) technical chapter on instrumental variables (a notion I have consistently found confusing all the [long] way from graduate school). And while repeated references are made to Scott Cunningham’s Causal Inference: The Mixtape I think I will stop there with 500⁺ page introductory econometrics books!

[Disclaimer about potential self-plagiarism: this post or an edited version will potentially appear in my Books Review section in CHANCE.]

Bayesian thinking for toddler & Bayesian probabilities for babies [book reviews]

Posted in Statistics with tags , , , , , , , , , , on January 27, 2023 by xi'an

My friend E.-J.  Wagenmakers sent me a copy of Bayesian Thinking for Toddlers, “a must-have for any toddler with even a passing interest in Ockham’s razor and the prequential principle.” E.-J. wrote the story and Viktor Beekman (of thesis’ cover fame!) drew the illustrations. The book can be read for free on https://psyarxiv.com/w5vbp/, but not purchased as publishers were not interested and self-publishing was not available at a high enough quality level. Hence, in the end, 200 copies were made as JASP material, with me being the happy owner of one of these. The story follows two young girls competing for dinosaur expertise, and being rewarded by cookies, in proportion to the probability of providing the correct answer to two dinosaur questions. Toddlers may get less enthusiastic than grown-ups about the message, but they will love the drawings (and the questions if they are into dinosaurs).

This reminded me of the Bayesian probabilities for babies book, by Chris Ferrie, which details the computation of the probability that a cookie contains candy when the first bite holds none. It is more genuinely intended for young kids, in shape and design, as can be checked on a YouTube video, with an hypothetical population of cookies (with and without candy) being the proxy for the prior distribution. I hope no baby will be traumatised from being exposed too early to the notions of prior and posterior. Only data can tell, twenty years from now, if the book induced a spike or a collapse in the proportion of Bayesian statisticians!

[Disclaimer about potential self-plagiarism: this post or an edited version will potentially appear in my Books Review section in CHANCE.

Casanova’s Lottery [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on January 12, 2023 by xi'an

This “history of a revolutionary game of chance” is the latest book by Stephen Stigler and is indeed of an historical nature, following the French Lottery from its inception as Loterie royale in 1758 to the Loterie Nationale in 1836 (with the intermediate names of Loterie de France, Loterie Nationale, Loterie impériale, Loterie royale reflecting the agitated history of the turn of that Century!).

The incentive for following this State lottery is that it is exceptional by its mathematical foundations. Contrary to other lotteries of the time, it was indeed grounded on the averaging of losses and gains on the long run (for the State). The French (Royal) State thus accepted the possibility of huge losses at some draws since they would be compensated by even larger gains. The reasoning proved most correct since the Loterie went providing as far as 4% of the overall State budget, despite the running costs of maintaining a network of betting places and employees, who had to be mathematically savy in order to compute the exact gains of the winners.This is rather amazing as the understanding of the Law of Large Numbers was quite fresh (on an historical scale) thanks to the considerable advances made by Pascal, Fermat, (Jakob) Bernoulli and a few others. (The book mentions the Encyclopedist and mathematician Jean d’Alembert as being present at the meeting that decided of the creation of the Loterie in 1757.)

One may wonder why Casanova gets the credit for this lottery. In true agreement with Stigler’s Law, it is directly connected with the Genoan lottery and subsequent avatars in some Italian cities, including Casanova’s Venezia. But jack-of-all-trades Casanova was instrumental in selling the notion to the French State, having landed in Paris after a daring flight from the Serenissima’s jails. After succeeding in convincing the King’s officers to launch the scheme crafted by a certain Ranieri (de’) Calzabig—not to be confused with the much maligned Salieri!—who would later collaborate with Gluck on Orfeo ed Eurydice and Alceste, Casanova received a salary from the Loterie administration and further run several betting offices. Until he left Paris for further adventures! Including an attempt to reproduce the lottery in Berlin, where Frederick II proved less receptive than Louis XIV. (Possibly due to Euler’s cautionary advice.) The final sentence of the book stands by its title: “It was indeed Casanova’s lottery” (p.210).

Unsurprisingly, given Stephen’s fascination for Pierre-Simon Laplace, the great man plays a role in the history, first by writing in 1774 one of his earliest papers on a lottery problem, namely the distribution of the number of draws needed for all 90 numbers to appear. His (correct) solution is an alternating sum whose derivation proved a numerical challenge. Thirty years later, Laplace came up with a good and manageable approximation (see Appendix Two). Laplace also contributed to the end of the Loterie by arguing on moral grounds against this “voluntary” tax, along Talleyrand, a fellow in perpetually adapting to the changing political regimes. It is a bit of a surprise to read that this rather profitable venture ended up in 1836, more under bankers’ than moralists´ pressure. (A new national lottery—based on printed tickets rather than bets on results—was created a century later, in 1933 and survived the second World War, with the French Loto appearing in 1974 as a direct successor to Casanova’s lottery.)

The book covers many fascinating aspects, from the daily run of the Loterie, to the various measures (successfully) taken against fraud, to the survival during the Révolution and its extension through (the Napoleonic) Empire, to tests for fairness thanks to numerous data from almanacs, to the behaviour of bettors and the sale of “helping” books. to (Daniel) Bernoulli, Buffon, Condorcet, and Laplace modelling rewards and supporting decreasing marginal utility. Note that there are hardly any mathematical formula, except for an appendix on the probabilities of wins and the returns, as well as Laplace’s (and Legendre’s) derivations. Which makes the book eminently suited for a large audience, the more thanks to Stephen Stigler’s perfect style.

This (paperback) book is also very pleasantly designed by the University of Chicago Press, with a plesant font (Adobe Calson Pro) and a very nice cover involving Laplace undercover, taken from a painting owned by the author. The many reproductions of epoch documents are well-done and easily readable. And, needless to say given the scholarship of Stephen, the reference list is impressive.

The book is testament to the remarkable skills of Stephen who searched for material over thirty years, from Parisian specialised booksellers to French, English, and American archives. He manages to bring into the story a wealth of connections and characters, as for instance Voltaire’s scheme to take advantage of an earlier French State lottery aimed at reimbursing State debtors. (Voltaire actually made a fortune of several millions francs out of this poorly designed lottery.) For my personal instructions, the book also put life to several Métro stations like Pereire and Duverney. But the book‘s contents will prove fascinating way beyond Parisian locals and francophiles. Enjoy!

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE. As appropriate for a book about capitalising on chance beliefs!]

%d bloggers like this: