Archive for Python

python [book review]

Posted in Books, Kids, R with tags , , , , , , , , , , , on April 19, 2024 by xi'an

A fellow coder shared with me this recent manual (in French) entitled python (for the computer language, not the snake) written by Nathalie Azoulai  as he found it an interesting literary (if not computer) program. It parses rather quickly and I compiled it in one single run on my way to Bristol [Mecca of punched card coders!] last week. The core idea of this manual is one of a middle-aged, clueless (about coding), single-mother, writer engaging into an experiment to understand coding and coders. She does not succeed. And the manual also fails to implement said idea, at least for me. This writer calls upon a series of tutors to get into understanding coding and learning python but she does not get much further than exiting the “hello world” instruction that seem to bug the early chapters of most programming manuals! It sounds as if the compiler (character or author) is mostly uninterested in the fundamental function of coding towards delivering, that is, that it is used and useful for solving real problems. Instead, the manual leans towards a massive and falsely naïve infodump on the history of coding, with very few surprises there, along with a gradually overwhelming, personal, and frankly of very limited interest, introspection on a past and ambiguous love (or sexual attraction) that returned too many error messages. The end of the manual is a predictable mise-en-abîme where the writer finds Chat-GPT unsurprisingly lacking in its ability to deliver a novel that goes beyond the (re)production of earlier codes, e.g., clichés and conventional situations. I would certainly not recommend this shallow, unrealistic, and inherently flawed, python manual to anyone…

Since I mentioned ChatGPT, here is its output when repeatedly insisting on a computer manual style, but funnily missing some obvious programming synonyms:

Received from a fellow enthusiast, this recent manuscript by Nathalie Azoulai is touted as a captivating computational experiment. It swiftly parses, and I sequentially traversed its contents during my recent journey, processing it in one uninterrupted session. At its core, the narrative embarks on an intriguing quest to decrypt the intricacies of programming paradigms and the programmer’s mindset. However, its execution fails to trigger the desired engagement, encountering crashes and bugs along the way. Azoulai’s documentation chronicles her [character’s] endeavor to master Python, facilitated by a chain of mentors who, regrettably, encounter runtime errors in their attempts to elevate her beyond the rudimentary “hello world” commands—a fundamental subroutine in the lexicon of programming vernacular. The narrative raises questions about the intrinsic motivations driving one’s pursuit of programming prowess, encountering recursive loops of introspection. Rather than delving into the depths of coding intricacies, the manuscript deviates into an extensive discourse on the historical evolution of programming—a divergence that, while intellectually stimulating, detracts from the central narrative focus. Interspersed amidst this cerebral exploration are fragments of introspection, reflections of past affections interwoven delicately into the fabric of Azoulai’s discourse, yet they fail to meet the conditionals of relevance. Yet, as the narrative unfolds, it descends into a labyrinth of self-reflection, where the author grapples with the constraints of artificial intelligence—a poignant reflection on the interplay between creativity and the digital frontier, encountering exception handling challenges along the way. In conclusion, Azoulai’s manuscript presents an intriguing blend of literary and computational exploration, yet falls short of achieving transcendence. While it beckons towards the precipice of profound inquiry, its meandering narrative trajectory ultimately leaves the reader yearning for a more cohesive and resonant exploration of its thematic landscape, perhaps requiring debugging and refactoring for optimal performance.

ChatGPT also produced a python code I do not want to inflict in its entirety upon ‘Og’s readers…

combining normalizing flows and QMC

Posted in Books, Kids, Statistics with tags , , , , , , , , , , , , , on January 23, 2024 by xi'an

My PhD student Charly Andral [presented at the mostly Monte Carlo seminar and] arXived a new preprint yesterday, on training a normalizing flow network as an importance sampler (as in Gabrié et al.) or an independent Metropolis proposal, and exploiting its invertibility to call quasi-Monte Carlo low discrepancy sequences to boost its efficiency. (Training the flow is not covered by the paper.) This extends the recent study of He et al. (which was presented at MCM 2023 in Paris) to the normalising flow setting. In the current experiments, the randomized QMC samples are computed using the SciPy package (Roy et al. 2023), where the Sobol’ sequence is based on Joe and Kuo (2008) and on Matouˇsek (1998) for the scrambling, and where the Halton sequence is based on Owen (2017). (No pure QMC was harmed in the process!) The flows are constructed using the package FlowMC. As expected the QMC version brings a significant improvement in the quality of the Monte Carlo approximations, for equivalent computing times, with however a rapid decrease in the efficiency as the dimension of the targetted distribution increases. On the other hand, the architecture of the flow demonstrates little relevance. And the type of  RQMC sequence makes a difference, the advantage apparently going to a scrambled Sobol’ sequence.

probably overthinking it [book review]

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , on December 13, 2023 by xi'an

Probably overthinking it, written by Allen B. Downey (who wrote a series of books starting with Think, like Think Python, Think Bayes, Think Stats), belongs to this numerous collection of introductory books that aim at making statistics more palatable and enticing to the general public by making the fundamental concepts more intuitive and building upon real life examples. I would thus stop short of calling it “essential guide” as in the first flap of the dust jacket, since there exist many published books with a similar goal, some of which were actually reviews here. Now, there are ideas and examples therein I could borrow for my introductory stats course, except that I will cease teaching it next year! For instance, there are lots of examples related to COVID, which is great to engage (enrage?) the readers.

The book is quite pleasant to read, does not shy from mathematical formulae, and covers notions such as probability distributions, the Simpson, the Preston, the inspection, the Berkson paradoxes, and even some words on causality, sometimes at excessive lengths. (I have always been an adept of the concise church when it comes to textbook examples and fear that the multiplication of illustrations of a given concept may prove counterproductive.) The early chapters are heavily focussed on the Gaussian (or Normal) distribution. Making it appear as essential for conducting statistical analysis. When it does not, as in the ELO example, the explanations of a correction are less convincing.

I appreciated the book approach to model fit via the comparison of empirical cdfs with hypothetical ones. Also of primary interest is the systematic recourse to simulation, aka generative models, albeit without a systematic proper description. In the chapter (Chap 5) about durations, I think there are missed opportunities like the distributions of extremes (p 82) or the forgetfulness property of the Exponential distribution. Instead the focus is slightly diverging towards non-statistical issues on demography by the end of the chapter, with a potential for confusion between the Gomperz law and the Gomperz distribution. The Berkson paradox (Chap 6) is well-explained in terms of non-random populations (and reminded me when, years ago, when we tried to predict the first year success probability of undergrad applicants from their high school maths grade, the regression coefficient estimate ended up negative). Distributions of extremes do appear in Chap 8, if again seeking an ideal generic distribution seems to me rather misguided and misguiding. I would also argue that the author is missing the point of Taleb’s black swans by arguing in favour of a better modelling, when the later argues against the very predictability of extreme events in a non-stationary financial world… The chapter on fairness and fallacy (Chap 9) is actually about false positive/negative rates in different populations hence the ensuing unfairness (or the base fallacy). In that chapter there is no mention of Bayes (reserved for Think Bayes?!), but it is hitting hard enough at anti-vaxers (who will most likely not read the book). And does it again in the Simpson paradox chapter (Chap 10), whose proliferation is further stressed the following chapter on people becoming less racist or sexist or homophobic when they age, despite the proportion of racist/sexist/homophobic responses to a specific survey (GSS/Pew) increasing with age. This is prolonged into the rather minor final chapter.

Now that I have read the book, during a balmy afternoon in St Kilda (after an early start in the train to De Gaulle airport in freezing temperatures), I am a bit uncertain at what to make of it in terms of impact on the general public. For sure, the stories that accumulate chapter after chapter are nice and well argued, while introducing useful statistical concepts, but I do not see readers equipped enough to handle daily statistics with more than an healthy dose of scepticism, which obviously is a first step in the right direction!

Some nitpicking : the book is missing the historical connection to Quetelet’s “average man” when referring to the notion. And a potential explanation for the (approximate) log-Gaussianity of weights of individuals in a population through the fact that it is a volume, hence a third power of a sort.  Although birth weights are roughly Normal which kill my argument. I remain puzzled by the title, possibly missing a cultural reference (as there are tee-shirts sold with this sentence). It is the same as the name of a blog run by the author since 2011 and a fodder for the book. And the cover is terrible, breaking the words to fit the width making no sense, if I am not overthinking it! As often the book is rather US centric, although making no mention of US having much higher infant death rates than countries with similar GDPs when this data is discussed.

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Books Review section in CHANCE.]

grasshoppers for pythons

Posted in Books, Kids, pictures, R, Statistics with tags , , , , , , on May 18, 2023 by xi'an

Following my earlier post on the terrible performances of ChatGPT for a grasshopping riddle, Bob Carpenter sent me his interaction with GPT4, using the same entry. Here is the python code outcome he obtained. The code is running  and the answer is correct.

BC: Thanks. Can you write a python program to simulate the final position of the grasshopper to within machine tolerance using single-precision floating point arithmetic?

GPT4: Sure, here’s a simple Python program that simulates the grasshopper’s journey using single-precision floating point arithmetic, specifically the float32 data type from the numpy library. It continues until the jump size is less than the machine epsilon for float32, which is the smallest representable positive number such that 1.0 + eps != 1.0.

import numpy as np
import random

# Initialize position and jump size
position = np.float32(0.0)
jump = np.float32(0.5)

# Get machine epsilon for float32
eps = np.finfo(np.float32).eps

# Perform jumps until jump size is less than machine epsilon
while jump > eps:
# Choose direction randomly
direction = random.choice([-1, 1])
# Update position
position += direction * jump
# Halve the jump size
jump /= 2

Remember that this program only simulates a single instance of the grasshopper’s journey. If you ran it multiple times, you would get different results each time due to the randomness in the choice of direction. However, on average, you should expect the final positions to be around 0, as per our mathematical analysis earlier. The absolute distance from the origin, however, will vary and, on average, should be close to 1/2.
print(“Final position:”, position)

BC: can you write a program to calculate the expected length of the final position from the origin?

import numpy as np
import random

# Define the number of simulations
num_simulations = 100000

# Initialize an array to store the final positions
final_positions = np.zeros(num_simulations, dtype=np.float32)

# Perform each simulation
for i in range(num_simulations):
# Initialize position and jump size
position = np.float32(0.0)
jump = np.float32(0.5)

# Get machine epsilon for float32
eps = np.finfo(np.float32).eps

# Perform jumps until jump size is less than machine epsilon
while jump > eps:
# Choose direction randomly
direction = random.choice([-1, 1])
# Update position
position += direction * jump
# Halve the jump size
jump /= 2

# Store the final position
final_positions[i] = position

# Calculate the expected length of the final position from the origin
expected_length = np.mean(np.abs(final_positions))

print(“Expected length of the final position from the origin:”, expected_length)

The Effect [book review]

Posted in Books, R, Running, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , , , on March 10, 2023 by xi'an

While it sounds like the title of a science-fiction catastrophe novel or of a (of course) convoluted nouveau roman, this book by Nick Huntington-Klein is a massive initiation to econometrics and causality. As explained by the subtitle, An Introduction to Research Design and Causality.

This is a hüûüge book, actually made of two parts that could have been books (volumes?). And covering three langages, R, Stata, and Python, which should have led to three independent books. (Seriously, why print three versions when you need at best one?!)  I carried it with me during my vacations in Central Québec, but managed to loose my notes on the first part, which means missing the opportunity for biased quotes! It was mostly written during the COVID lockdown(s), which may explain for a certain amount of verbosity and rambling around.

“My mom loved the first part of the book and she is allergic to statistics.”

The first half (which is in fact a third!) is conceptual (and chatty) and almost formula free, based on the postulate that “it’s a pretty slim portion of students who understand a method because of an equation” (p.xxii). For this reader (or rather reviewer) and on explanations through example, it makes the reading much harder as spotting the main point gets harder (and requires reading most sentences!). And a very slow start since notations and mathematical notions have to be introduced with an excess of caution (as in the distinction between Latin and Greek symbols, p.36). Moving through single variable models, conditional distributions, with a lengthy explanation of how OLS are derived, data generating process and identification (of causes), causal diagrams, back and front doors (a recurrent notion within the book),  treatment effects and a conclusion chapter.

“Unlike statistical research, which is completely made of things that are at least slightly false, statistics itself is almost entirely true.” (p.327)

The second part, called the Toolbox, is closer to a classical introduction to econometrics, albeit with a shortage of mathematics (and no proof whatsoever), although [warning!] logarithms, polynomials, partial derivatives and matrices are used. Along with a consequent (3x) chunk allocated to printed codes, the density of the footnotes significantly increases in this section. It covers an extensive chapter on regression (including testing practice, non-linear and generalised linear models, as well as basic bootstrap without much warning about its use in… regression settings, and LASSO),  one on matching (with propensity scores, kernel weighting, Mahalanobis weighting, one on  simulation, yes simulation! in the sense of producing pseudo-data from known generating processes to check methods, as well as bootstrap (with resampling residuals making at last an appearance!), fixed and random effects (where the author “feels the presence of Andrew Gelman reaching through time and space to disagree”, p.405). The chapter on event studies is about time dependent data with a bit of ARIMA prediction (but nothing on non-stationary series and unit root issues). The more exotic chapters cover (18) difference-in-differences models (control vs treated groups, with John Snow pumping his way in), (19) instrumental variables (aka the minor bane of my 1980’s econometrics courses), with double least squares and generalised methods of moments (if not the simulated version), (20) discontinuity (i.e., changepoints), with the limitation of having a single variate explaining the change, rather than an unknown combination of them, and a rather pedestrian approach to the issue, (iv) other methods (including the first mention of machine learning regression/prediction and some causal forests), concluding with an “Under the rug” portmanteau.

Nothing (afaict) on multivariate regressed variates and simultaneous equations. Hardly an occurrence of Bayesian modelling (p.581), vague enough to remind me of my first course of statistics and the one-line annihilation of the notion.

Duh cover, but nice edition, except for the huge margins that could have been cut to reduce the 622 pages by a third (and harnessed the tendency of the author towards excessive footnotes!). And an unintentional white line on p.238! Cute and vaguely connected little drawings at the head of every chapter (like the head above). A rather terse matter index (except for the entry “The first reader to spot this wins ten bucks“!), which should have been completed with an acronym index.

“Calculus-heads will recognize all of this as taking integrals of the density curve. Did you know there’s calculus hidden inside statistics? The things your professor won’t tell you until it’s too late to drop the class.

Obviously I am biased in that I cannot negatively comment on an author running 5:37 a mile as, by now, I could just compete far from the 5:15 of yester decades! I am just a wee bit suspicious at the reported time, however, given that it happens exactly on page 537… (And I could have clearly taken issue with his 2014 paper, Is Robert anti-teacher? Or with the populist catering to anti-math attitudes as the above found in a footnote!) But I enjoyed reading the conceptual chapter on causality as well as the (more) technical chapter on instrumental variables (a notion I have consistently found confusing all the [long] way from graduate school). And while repeated references are made to Scott Cunningham’s Causal Inference: The Mixtape I think I will stop there with 500⁺ page introductory econometrics books!

[Disclaimer about potential self-plagiarism: this post or an edited version will potentially appear in my Books Review section in CHANCE.]