Archive for data

Number savvy [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on March 31, 2023 by xi'an

“This book aspires to contribute to overall numeracy through a tour de force presentation of the production, use, and evolution of data.”

Number Savvy: From the Invention of Numbers to the Future of Data is written by George Sciadas, a  statistician working at Statistics Canada. This book is mostly about data, even though it starts with the “compulsory” tour of the invention(s) of numbers and the evolution towards a mostly universal system and the issue of measurements (with a funny if illogical/anti-geographical confusion in “gare du midi in Paris and gare du Nord in Brussels” since Gare du Midi (south) is in Brussels while Gare du Nord (north) in in Paris). The chapter (Chap. 3) on census and demography is quite detailed about the hurdles preventing an exact count of a population, but much less about the methods employed to improve the estimation. (The request for me to fill the short form for the 2023 French Census actually came while I was reading the book!)

The next chapter links measurement with socio-economic notions or models, like unemployment rate, which depends on so many criteria (pp. 77-87) that its measurement sounds impossible or arbitrary. Almost as arbitrary as the reported number of protesters in a French demonstration! Same difficulty with the GDP, whose interpretation seems beyond the grasp of the common reader. And does not cover significantly missing (-not-at-random) data like tax evasion, money laundering, and the grey economy. (Nitpicking: if GDP got down by 0.5% one year and up by 0.5% the year after, this does not exactly compensate!) Chapter 5 reflects upon the importance of definitions and boundaries in creating official statistics and categorical data. A chapter (Chap 6) on the gathering of data in the past (read prior to the “Big Data” explosion) is preparing the ground to the chapter on the current setting. Mostly about surveys, presented as definitely from the past, “shadows of their old selves”. And with anecdotes reminding me of my only experience as a survey interviewer (on Xmas practices!). About administrative data, progressively moving from collected by design to available for any prospection (or “farming”). A short chapter compared with the one (Chap 7) on new data (types), mostly customer, private sector, data. Covering the data accumulated by big tech companies, but not particularly illuminating (with bar-room remarks like “Facebook users tend to portray their lives as they would like them to be. Google searches may reflect more truthfully what people are looking for.”)

The following Chapter 8 is somehow confusing in its defence of microdata, by which I understand keeping the raw data rather than averaging through summary statistics. Synthetic data is mentioned there, but without reference to a reference model, while machine learning makes a very brief appearance (p.222). In Chapter 9, (statistical) data analysis is [at last!] examined, but mostly through descriptive statistics. Except for a regression model and a discussion of the issues around hypothesis testing and Bayesian testing making its unique visit, albeit confusedly in-between references to Taleb’s Black swan, Gödel’s incompleteness theorem (which always seem to fascinate authors of general public science books!), and Kahneman and Tversky’s prospect theory. Somewhat surprisingly, the chapter also includes a Taoist tale about the farmer getting in turns lucky and unlucky… A tale that was already used in What are the chances? that I reviewed two years ago. As this is a very established parable dating back at least to the 2nd century B.C., there is no copyright involved, but what are the chances the story finds its way that quickly in another book?!

The last and final chapter is about the future, unsurprisingly. With prediction of “plenty of black boxes“, “statistical lawlessness“, “data pooling” and data as a commodity (which relates with some themes of our OCEAN ERC-Synergy grant). Although the solution favoured by the author is centralised, through a (national) statistics office or another “trusted third party“. The last section is about the predicted end of theory, since “simply looking at data can reveal patterns“, but resisting the prophets of doom and idealising the Rise of the (AI) machines… The lyrical conclusion that “With both production consolidation and use of data increasingly in the ‘hands’ of machines, and our wise interventions, the more distant future will bring complete integrations” sounds too much like Brave New World for my taste!

“…the privacy argument is weak, if not hypocritical. Logically, it’s hard to fathom what data that we share with an online retailer or a delivery company we wouldn’t share with others (…) A naysayer will say nay.” (p.190)

The way the book reads and unrolls is somewhat puzzling to this reader, as it sounds like a sequence of common sense remarks with a Guesstimation flavour on the side, and tiny historical or technical facts, some unknown and most of no interest to me, while lacking in the larger picture. For instance, the long-winded tale on evaluating the cumulated size of a neighbourhood lawns (p.34-38) does not seem to be getting anywhere. The inclusion of so many warnings, misgivings, and alternatives in the collection and definition of data may have the counter-effect of discouraging readers from making sense of numeric concepts and trusting the conclusions of data-based analyses. The constant switch in perspective(s) and the apparent absence of definite conclusions are also exhausting. Furthermore, I feel that the author and his rosy prospects are repeatedly minimizing the risks of data collection on individual privacy and freedom, when presenting the platforms as a solution to a real time census (as, e.g., p.178), as exemplified by the high social control exercised by some number savvy dictatures!  And he is highly critical of EU regulations such as GDPR, “less-than-subtle” (p.267), “with its huge impact on businesses” (p.268). I am thus overall uncertain which audience this book will eventually reach.

[Disclaimer about potential self-plagiarism: this post or an edited version will potentially appear in my Books Review section in CHANCE.]

the biggest bluff [not a book review]

Posted in Books with tags , , , , , , , , , , , on August 14, 2020 by xi'an

It came as a surprise to me that the book reviewed in the book review section of Nature of 25 June was a personal account of a professional poker player, The Biggest Bluff by Maria Konnikova.  (Surprise enough to write a blog entry!) As I see very little scientific impetus in studying the psychology of poker players and the associated decision making. Obviously, this is not a book review, but a review of the book review. (Although the NYT published a rather extensive extract of the book, from which I cannot detect anything deep from a game-theory viewpoint. Apart from the maybe-not-so-deep message that psychology matters a lot in poker…) Which does not bring much incentive for those uninterested (or worse) in money games like poker. Even when “a heap of Bayesian model-building [is] thrown in”, as the review mixes randomness and luck, while seeing the book as teaching the reader “how to play the game of life”, a type of self-improvement vending line one hardly expects to read in a scientific journal. (But again I have never understood the point in playing poker…)

politics coming [too close to] statistics [or the reverse]

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on May 9, 2020 by xi'an

On 30 April, David Spiegelhalter wrote an opinion column in The Guardian, Coronavirus deaths: how does Britain compare with other countries?, where he pointed out the difficulty, even “for a bean-counting statistician to count deaths”, as the reported figures are undercounts, and stated that “many feel that excess deaths give a truer picture of the impact of an epidemic“. Which, on the side, I indeed believe is a more objective material, as also reported by INSEE and INED in France.

“…my cold, statistical approach is to wait until the end of the year, and the years after that, when we can count the excess deaths. Until then, this grim contest won’t produce any league tables we can rely on.” D. Spiegelhalter

My understanding of the tribune is that the quick accumulation of raw numbers, even for deaths, and their use in the comparison of procedures and countries is not helping in understanding the impacts of policies and actions-reactions from a week ago. Starting with the delays in reporting death certificates, as again illustrated by the ten day lag in the INSEE reports. And accounting for covariates such as population density, economic and health indicators. (The graph below for instance relies on deaths so far attributed to COVID-19 rather than on excess deaths, while these attributions depend on the country policy and its official statistics capacities.)

“Polite request to PM and others: please stop using my Guardian article to claim we cannot make any international comparisons yet. I refer only to detailed league tables—of course we should now use other countries to try and learn why our numbers are high.” D. Spiegelhalter

However, when on 6 May Boris Johnson used this Guardian article during prime minister’s questions in the UK Parliement, to defuse a question from the Labour leader, Keir Starmer, David Spiegelhalter reacted with the above tweet, which is indeed that even with poor and undercounted data the total number of cases is much worse than predicted by the earlier models and deadlier than in neighbouring countries. Anyway, three other fellow statisticians, Phil Brown, Jim Smith (Warwick), and Henry Wynn, also reacted to David’s tribune by complaining at the lack of statistical modelling behind it and the fatalistic message it carries, advocating for model based decision-making, which would be fine if the data was not so unreliable… or if the proposed models were equipped with uncertainty bumpers accounting for misspecification and erroneous data.

data is everywhere

Posted in Kids, pictures, Statistics, University life with tags , , , , , , , , on November 25, 2018 by xi'an

agent-based models

Posted in Books, pictures, Statistics with tags , , , , , , , , on October 2, 2018 by xi'an

An August issue of Nature I recently browsed [on my NUS trip] contained a news feature on agent- based models applied to understanding the opioid crisis in US. (With a rather sordid picture of a drug injection in Philadelphia, hence my own picture.)

To create an agent-based model, researchers first ‘build’ a virtual town or region, sometimes based on a real place, including buildings such as schools and food shops. They then populate it with agents, using census data to give each one its own characteristics, such as age, race and income, and to distribute the agents throughout the virtual town. The agents are autonomous but operate within pre-programmed routines — going to work five times a week, for instance. Some behaviours may be more random, such as a 5% chance per day of skipping work, or a 50% chance of meeting a certain person in the agent’s network. Once the system is as realistic as possible, the researchers introduce a variable such as a flu virus, with a rate and pattern of spread based on its real-life characteristics. They then run the simulation to test how the agents’ behaviour shifts when a school is closed or a vaccination campaign is started, repeating it thousands of times to determine the likelihood of different outcomes.

While I am obviously supportive of simulation based solutions, I cannot but express some reservation at the outcome, given that it is the product of the assumptions in the model. In Bayesian terms, this is purely prior predictive rather than posterior predictive. There is no hard data to create “realism”, apart from the census data. (The article also mixes the outcome of the simulation with real data. Or epidemiological data, not yet available according to the authors.)

In response to the opioid epidemic, Bobashev’s group has constructed Pain Town — a generic city complete with 10,000 people suffering from chronic pain, 70 drug dealers, 30 doctors, 10 emergency rooms and 10 pharmacies. The researchers run the model over five simulated years, recording how the situation changes each virtual day.

This is not to criticise the use of such tools to experiment with social, medical or political interventions, which practically and ethically cannot be tested in real life and working with such targeted versions of the Sims game can paradoxically be more convincing when dealing with policy makers. If they do not object at the artificiality of the outcome, as they often do for climate change models. Just from reading this general public article, I thus wonder at whether model selection and validation tools are implemented in conjunction with agent-based models…

%d bloggers like this: