Archive for Annals of Applied Statistics

how many migrants died trying to reach Europe?

Posted in Books, Kids, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , on February 6, 2024 by xi'an

“We estimated that about 40,000 human beings have died trying to enter the European Union, during about 5500 tragic attempts, in the period between January 1993 and March 2019.”

A paper by Alessio Farcomeni published last year in Annals of Applied Statistics [that is delivered to me by regular mail, but which I missed] about estimating the number of deaths on migration routes. I have been interested in this question for moral and statistical reasons since the beginning of the Syrian civil war and the induced massive increase in the number of people attempting to cross the Mediterranean Sea to reach Europe. Unfortunately, despite different attempts to contact governmental agencies (like the French Minister of the Interior),  NGOs (like Amnesty), friends, academics (incl. a Dutch group on that very topic but with no data or data scientist), and connections (with the Italian Navy, the Tunisian government and Frontex), I could not access more than newspaper level data or highly local data that did not allow for a general picture.


“The law of large numbers guarantees consistency as long as the population size estimator is consistent and the model is well-specified.”

As it was my intention, the paper uses capture-recapture. The data is obtained from UNITED for Intercultural Action. It had recorded 4333 attempts to enter Europe that had at least one death occurring, between January 1993 and March 2019, as reported by one or several sources, like newspaper articles. Sources that are most often not independent, but rather copying one another, which is obviously problematic for the extrapolation to the unreported cases and for resorting to capture-recapture estimation. And the number of deaths per event is itself most likely inexact, since casualties are not always recovered or identified as such. Since migration routes, migrant flows, and smuggler policies keep massively changing, the homogeneity of the observations over nearly 30 years is low to inexistent, which makes invoking consistency rather inappropriate. This is also the reason why I find the approach followed by the paper too strongly model-based as for instance when relying on an Horvitz-Thompson estimator or using a GLM to link the number of deaths in one crossing and the number of sources reporting the tragedy. This fundamental difficulty in modelling or inferring from such unreliable and untrustworthy data sources and the absence of record linkage with other datasets like the entries in the border countries (e.g., Turkey) or the number of prevented crossings by the local coastguards alas make the final estimate of 40,000 deaths at sea close to impossible to calibrate from a model-free perspective. The actual figure is not only higher, but maybe considerably so. Unfortunately, the datasets that would allow linkage and recapture are unavailable or inexistent in departure countries, while arrival countries most abstain from storing data about the migrant flows and histories…

folded Normals

Posted in Books, Kids, pictures, R, Running, Statistics with tags , , , , , , , , , , , , on February 25, 2021 by xi'an

While having breakfast (after an early morn swim at the vintage La Butte aux Cailles pool, which let me in free!), I noticed a letter to the Editor in the Annals of Applied Statistics, which I was unaware existed. (The concept, not this specific letter!) The point of the letter was to indicate that finding the MLE for the mean and variance of a folded normal distribution was feasible without resorting to the EM algorithm. Since the folded normal distribution is a special case of mixture (with fixed weights), using EM is indeed quite natural, but the author, Iain MacDonald, remarked that an optimiser such as R nlm() could be called instead. The few lines of relevant R code were even included. While this is a correct if minor remark, I am a wee bit surprised at seeing it included in the journal, the more because the authors of the original paper using the EM approach were given the opportunity to respond, noticing EM is much faster than nlm in the cases they tested, and Iain MacDonald had a further rejoinder! The more because the Wikipedia page mentioned the use of optimisers much earlier (and pointed out at the R package Rfast as producing MLEs for the distribution).

Xmas tree at UCL, with a special gift

Posted in Books, pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , , on November 26, 2019 by xi'an

Ph.D. students at UCL Statistics have made this Xmas tree out of bound and unbound volumes of statistics journals, not too hard to spot (especially the Current Indexes which I abandoned when I left my INSEE office a few years ago). An invisible present under the tree is the opening of several positions, namely two permanent lectureships and two three-year research fellowships, all in Statistics or Applied Probability, with the fellowship deadline being the 1st of December 2019!

running shoes

Posted in Books, Running, Statistics with tags , , , , , , , , , , on August 12, 2018 by xi'an

A few days ago, when back from my morning run, I spotted a NYT article on Nike shoes that are supposed to bring on average a 4% gain in speed. Meaning for instance a 3 to 4 minute gain in a half-marathon.

“Using public race reports and shoe records from Strava, a fitness app that calls itself the social network for athletes, The Times found that runners in Vaporflys ran 3 to 4 percent faster than similar runners wearing other shoes, and more than 1 percent faster than the next-fastest racing shoe.”

What is interesting in this NYT article is that the two journalists who wrote it have analysed their own data, taken from Strava. Using a statistical model or models (linear regression? non-linear regression? neural net?) to predict the impact of the shoe make, against “all” other factors contributing to the overall time or position or percentage gain or yet something else. In most analyses produced in the NYT article, the 4% gain is reproduced (with a 2% gain for female shoe switcher and a 7% gain for slow runners).

“Of course, these observations do not constitute a randomized control trial. Runners choose to wear Vaporflys; they are not randomly assigned them. One statistical approach that seeks to address this uses something called propensity scores, which attempt to control for the likelihood that someone wears the shoes in the first place. We tried this, too. Our estimates didn’t change.”

The statistical analysis (or analyses) seems rather thorough, from what is reported in the NYT article, with several attempts at controlling for confounders. Still, the data itself is observational, even if providing a lot of variables to run the analyses, as it only covers runners using Strava (from 5% in Tokyo to 25% in London!) and indicating the type of shoes they wear during the race. There is also the issue that the shoes are quite expensive, at $250 a pair, especially if the effect wears out after 100 miles (this was not tested in the study), as I would hesitate to use them unless the race conditions look optimal (and they never do!). There is certainly a new shoes effect on top of that, between the real impact of a better response and a placebo effect. As shown by a similar effect of many other shoe makes. Hence, a moderating impact on the NYT conclusion that these Nike Vaporflys (flies?!) are an “outlier”. But nonetheless a fairly elaborate and careful statistical study that could potentially make it to a top journal like Annals of Applied Statistics!

coauthorship and citation networks

Posted in Books, pictures, R, Statistics, University life with tags , , , , , , , , , on February 21, 2017 by xi'an

cozauthorAs I discovered (!) the Annals of Applied Statistics in my mailbox just prior to taking the local train to Dauphine for the first time in 2017 (!), I started reading it on the way, but did not get any further than the first discussion paper by Pengsheng Ji and Jiashun Jin on coauthorship and citation networks for statisticians. I found the whole exercise intriguing, I must confess, with little to support a whole discussion on the topic. I may have read the paper too superficially as a métro pastime, but to me it sounded more like a post-hoc analysis than a statistical exercise, something like looking at the network or rather at the output of a software representing networks and making sense of clumps and sub-networks a posteriori. (In a way this reminded of my first SAS project at school, on the patterns of vacations in France. It was in 1983 on pinched cards. And we spent a while cutting & pasting in a literal sense the 80 column graphs produced by SAS on endless listings.)

It may be that part of the interest in the paper is self-centred. I do not think analysing a similar dataset in another field like deconstructionist philosophy or Korean raku would have attracted the same attention. Looking at the clusters and the names on the pictures is obviously making sense, if more at a curiosity than a scientific level, as I do not think this brings much in terms of ranking and evaluating research (despite what Bernard Silverman suggests in his preface) or understanding collaborations (beyond the fact that people in the same subfield or same active place like Duke tend to collaborate). Speaking of curiosity, I was quite surprised to spot my name in one network and even more to see that I was part of the “High-Dimensional Data Analysis” cluster, rather than of the “Bayes” cluster.  I cannot fathom how I ended up in that theme, as I cannot think of a single paper of mines pertaining to either high dimensions or data analysis [to force the trait just a wee bit!]. Maybe thanks to my joint paper with Peter Mueller. (I tried to check the data itself but cannot trace my own papers in the raw datafiles.)

I also wonder what is the point of looking at solely four major journals in the field, missing for instance most of computational statistics and biostatistics, not to mention machine learning or econometrics. This results in a somewhat narrow niche, if obviously recovering the main authors in the [corresponding] field. Some major players in computational stats still make it to the lists, like Gareth Roberts or Håvard Rue, but under the wrong categorisation of spatial statistics.