Archive for INSEE

updated (over)mortality curves

Posted in Statistics with tags , , , , , , , on May 26, 2020 by xi'an

another surmortality graph

Posted in Books, R, Statistics with tags , , , , , , , , on April 27, 2020 by xi'an

Another graph showing the recent peak in daily deaths throughout France as recorded by INSEE and plotted by Baptiste Coulmont from Paris 8 Sociology Department. And further discussed by Arthur Charpentier on Freakonometrics. With a few days off due to reporting, this brings an objective perspective on the impact of the epidemics (and of the quarantine) compared with the other years since 2001, without requiring tests or even surveys. (The huge peak in August 2003 was an heat wave that decimated elderly citizens throughout France.)

coronavirus counts do not count

Posted in Books, pictures, Statistics with tags , , , , , , , , on April 8, 2020 by xi'an

Somewhat by chance I came across Nate Silver‘s tribune on FiveThirtyEight about the meaninglessness of COVID-19 case counts. As it reflects on sampling efforts and available resources rather than actual cases, furthermore sampling efforts from at least a fortnight.

“The data, at best, is highly incomplete, and often the tip of the iceberg for much larger problems. And data on tests and the number of reported cases is highly nonrandom. In many parts of the world today, health authorities are still trying to triage the situation with a limited number of tests available. Their goal in testing is often to allocate scarce medical care to the patients who most need it — rather than to create a comprehensive dataset for epidemiologists and statisticians to study.”

This article runs four different scenarios, with the same actual parameters for the epidemics, and highly different and mostly misleading perceptions based on the testing strategies. This is a highly relevant warning but I am surprised Nate Silver does not move to the rather obvious conclusion that some form of official survey or another, for instance based on capture-recapture and representative samples, testing for present and past infections, should be implemented on a very regular basis, even with a limited number of tested persons to get a much more reliable vision of the status of the epidemics. Here, the French official institute of statistics, INSEE, would be most suited to implement such a scheme.

mortality rates as sounder indicators

Posted in pictures, Statistics, Travel with tags , , , , , , on April 6, 2020 by xi'an

on anonymisation

Posted in Books, pictures, Statistics, University life with tags , , , , , , , , , , , on August 2, 2019 by xi'an

An article in the New York Times covering a recent publication in Nature Communications on the ability to identify 99.98% of Americans from almost any dataset with fifteen covariates. And mentioning the French approach of INSEE, more precisely CASD (a branch of GENES, as ENSAE and CREST to which I am affiliated), where my friend Antoine worked for a few years, and whose approach is to vet researchers who want access to non-anonymised data, by creating local working environments on the CASD machines  so that data does not leave the site. The approach is to provide the researcher with a dedicated interface, which “enables access remotely to a secure infrastructure where confidential data is safe from harm”. It further delivers reproducibility certificates for publications, a point apparently missed by the New York Times which advances the lack of reproducibility as a drawback of the method. It also mentions the possibility of doing cryptographic data analysis, again missing the finer details with a lame objection.

“Our paper shows how the likelihood of a specific individual to have been correctly re-identified can be estimated with high accuracy even when the anonymized dataset is heavily incomplete.”

The Nature paper is actually about the probability for an individual to be uniquely identified from the given dataset, which somewhat different from the NYT headlines. Using a copula for the distribution of the covariates. And assessing the model with a mean square error evaluation when what matters are false positives and false negatives. Note that the model need be trained for each new dataset, which reduces the appeal of the claim, especially when considering that individuals tagged as uniquely identified about 6% are not. The statistic of 99.98% posted in the NYT is actually a count on a specific dataset,  the 5% Public Use Microdata Sample files, and Massachusetts residents, and not a general statistic [which would not make much sense!, as I can easily imagine 15 useless covariates] or prediction from the authors’ model. And a wee bit anticlimactic.