## [The Art of] Regression and other stories

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , on July 23, 2020 by xi'an

CoI: Andrew sent me this new book [scheduled for 23 July on amazon] of his with Jennifer Hill and Aki Vehtari. Which I read in my garden over a few sunny morns. And as Andrew and Aki are good friends on mine, this review is definitely subjective and biased! Hence to take with a spoonful of salt.

The “other stories’ in the title is a very nice touch. And a clever idea. As the construction of regression models comes as a story to tell, from gathering and checking the data, to choosing the model specifications, to analysing the output and setting the safety lines on its interpretation and usages. I added “The Art of” in my own title as the exercise sounds very much like an art and very little like a technical or even less mathematical practice. Even though the call to the resident stat_glm R function is ubiquitous.

The style itself is very story-like, very far from a mathematical statistics book as, e.g., C.R. Rao’s Linear Statistical Inference and Its Applications. Or his earlier Linear Models which I got while drafted in the Navy. While this makes the “Stories” part most relevant, I also wonder how I could teach from this book to my own undergrad students without acquiring first (myself) the massive expertise represented by the opinions and advice on what is correct and what is not in constructing and analysing linear and generalised linear models. In the sense that I would find justifying or explaining opinionated sentences an amathematical challenge. On the other hand, it would make for a great remote course material, leading the students through the many chapters and letting them experiment with the code provided therein, creating new datasets and checking modelling assumptions. The debate between Bayesian and likelihood solutions is quite muted, with a recommendation for weakly informative priors superseded by the call for exploring the impact of one’s assumption. (Although the horseshoe prior makes an appearance, p.209!) The chapter on math and probability is somewhat superfluous as I hardly fathom a reader entering this book without a certain amount of math and stats background. (While the book warns about over-trusting bootstrap outcomes, I find the description in the Simulation chapter a wee bit too vague.) The final chapters about causal inference are quite impressive in their coverage but clearly require a significant amount of investment from the reader to truly ingest these 110 pages.

“One thing that can be confusing in statistics is that similar analyses can be performed in different ways.” (p.121)

Unsurprisingly, the authors warn the reader about simplistic and unquestioning usages of linear models and software, with a particularly strong warning about significance. (Remember Abandon Statistical Significance?!) And keep (rightly) arguing about the importance of fake data comparisons (although this can be overly confident at times). Great Chapter 11 on assumptions, diagnostics and model evaluation. And terrific Appendix B on 10 pieces of advice for improving one’s regression model. Although there are two or three pages on the topic, at the very end, I would have also appreciated a more balanced and constructive coverage of machine learning as it remains a form of regression, which can be evaluated by simulation of fake data and assessed by X validation, hence quite within the range of the book.

The document reads quite well, even pleasantly once one is over the shock at the limited amount of math formulas!, my only grumble being a terrible handwritten graph for building copters(Figure 1.9) and the numerous and sometimes gigantic square root symbols throughout the book. At a more meaningful level, it may feel as somewhat US centric, at least given the large fraction of examples dedicated to US elections. (Even though restating the precise predictions made by decent models on the eve of the 2016 election is worthwhile.) The Oscar for the best section title goes to “Cockroaches and the zero-inflated negative binomial model” (p.248)! But overall this is a very modern, stats centred, engaging and careful book on the most common tool of statistical modelling! More stories to come maybe?!

## Le Monde lacks data scientists!

Posted in Books, Statistics with tags , , , , , , , on July 11, 2017 by xi'an

In a paper in Le Monde today, a journalist is quite critical of statistical analyses of voting behaviours regressed on socio-economic patterns. Warning that correlation is not causation and so on and so forth…But the analysis of the votes as presented in the article is itself quite appalling! Just judging from the above graph, where the vertical and horizontal axes are somewhat inverted (as predicting the proportion of over 65 in the population from their votes does not seem that relevant), with an incomprehensible drop in the over 65 proportion within a district between the votes for the fascist party and the other ones, both indicators of an inversion of the axes!, where the curves are apparently derived from four points [correction at the end explaining they used the whole data collection to draw the curve],  where the variability in the curves is not opposed to the overall variability in the population, where more advanced tools than mere correlation are not broached upon, and so on… They should have asked Andrew. Or YouGov!

## gerrymandering detection by MCMC

Posted in Books, Statistics with tags , , , , , , , on June 16, 2017 by xi'an

In the latest issue of Nature I read (June 8), there is a rather long feature article on mathematical (and statistical) ways of measuring gerrymandering, that is the manipulation of the delimitations of a voting district toward improving the chances of a certain party. (The name comes from Elbridge Gerry (1812) and the salamander shape of the district he created.) The difficulty covered by the article is about detecting gerrymandering, which leads to the challenging and almost philosophical question of defining a “fair” partition of a region into voting districts, when those are not geographically induced. Since each partition does not break the principles of “one person, one vote” and of majority rule. Having a candidate or party win at the global level and loose at every local level seems to go against this majority rule, but with electoral systems like in the US, this frequently happens (with dire consequences in the latest elections). Just another illustration of Simpson’s paradox, essentially. And a damning drawback of multi-tiered electoral systems.

“In order to change the district boundaries, we use a Markov Chain Monte Carlo algorithm to produce about 24,000 random but reasonable redistrictings.”

In the arXiv paper that led to this Nature article (along with other studies), Bagiat et al. essentially construct a tail probability to assess how extreme the current district partition is against a theoretical distribution of such partitions. Finding that the actual redistrictings of 2012 and 2016 in North Carolina are “extremely atypical”.  (The generation of random partitions obeyed four rules, namely equal population, geographic compacity and connexity, proximity to county boundaries, and a majority of Afro-American voters in at least two districts, the latest being a requirement in North Carolina. A score function was built by linear combination of four corresponding scores, mostly χ² like, and turned into a density, simulated annealing style. The determination of the final temperature β=1 (p.18) [or equivalently of the weights (p.20)] remains unclear to me. As does the use of more than 10⁵ simulated annealing iterations to produce a single partition (p.18)…

From a broader perspective, agreeing on a method to produce random district allocations could be the way to go towards solving the judicial dilemma in setting new voting maps as what is currently under discussion in the US.