Archive for capture-recapture

Cédric Villani on COVID-19 [and Zoom for the local COVID-19 seminar]

Posted in Statistics, University life with tags , , , , , , , , , , , , on June 19, 2020 by xi'an

From the “start” of the COVID-19 crisis in France (or more accurately after lockdown on March 13), the math department at Paris-Dauphine has run an internal webinar around this crisis, not solely focusing on the math or stats aspects but also involving speakers from other domains, from epidemiology to sociology, to economics. The speaker today was [Field medalist then elected member of Parliament] Cédric Villani, as a member of the French Parliament sciences and technology committee, l’Office parlementaire d’évaluation des choix scientifiques et technologiques (OPECST), which adds its recommendations to those of the several committees advising the French government. The discussion was interesting as an insight on the political processing of the crisis and the difficulties caused by the heavy-handed French bureaucracy, which still required to fill form A-3-b6 in emergency situations. And the huge delays in launching a genuine survey of the range and diffusion of the epidemic. Which, as far as I understand, has not yet started….

deduplication and population size estimation [discussion]

Posted in Books, Statistics with tags , , , , , , on April 23, 2020 by xi'an

[Here is my discussion on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April. Discussions are to be submitted to BA as regular submissions.]

Congratulations to the authors, for this paper that expand the modelling of populations investigated by faulty surveys, a poor quality feature that applies to extreme cases like Syria casualties. And possibly COVID-19 victims.

The model considered in this paper, as given by (2.1), is a latent variable model which appears as hyper-parameterised in the sense it involves a large number of parameters and latent variables. First, this means it is essentially intractable outside a Bayesian resolution. Second, within the Bayesian perspective, it calls for identifiability and consistency questions, namely which fraction of the unknown entities is identifiable and which fraction can be consistently estimated, eventually severing the dependence on the prior modelling. Personal experiences with capture-recapture models on social data like drug addict populations showed me that prior choices often significantly drive posterior inference on the population size. Here, it seems that the generative distortion mechanism between registry of individuals and actual records is paramount.

“We now investigate an alternative aspect of the uniform prior distribution of λ given N.”

Since the practical application stressed in the title, namely some of civil casualties in Syria, interrogations take a more topical flavour as one wonders at the connection between the model and the actual data, between the prior modelling and the available prior information. It is however not the strategy adopted in the paper, which instead proposes a generic prior modelling that could be deemed to be non-informative. I find the property that conditioning on the list sizes eliminates the capture probabilities and the duplication rates quite amazing, reminding me indeed of similar properties for conjugate mixtures, although we found the property hard to exploit from a computational viewpoint. And that the hit-miss model provides computationally tractable marginal distributions for the cluster observations.

“Several records of the VDC data set represent unidentified victims and report only the date of death or do not have the first name and report only the relationship with the head of the family.”

This non-informative choice is however quite informative in the misreporting mechanism and does not address the issue that it presumably is misspecified. It indeed makes the assumption that individual label and type of record are jointly enough to explain the probability of misreporting the exact record. In practical cases, it seems more realistic that the probability to appear in a list depends on the characteristics of an individual, hence far from being uniform as well as independent from one list to the next. The same applies to the probability of being misreported. The alternative to the uniform allocation of individuals to lists found in (3.3) remains neutral to the reasons why (some) individuals are missing from (some) lists. No informative input is indeed made here on how duplicates could appear or on how errors are made in registering individuals. Furthermore, given the high variability observed in inferring the number of actual deaths covered by the collection of the two lists, it would have been of interest to include a model comparison assessment, especially when contemplating the clash between the four posteriors in Figure 4.

The implementation of a manageable Gibbs sampler in such a convoluted model is quite impressive and one would welcome further comments from the authors on its convergence properties, since it is facing a large dimensional space. Are there theoretical or numerical irreducibility issues for instance, created by the discrete nature of some latent variables as in mixture models?

coronavirus counts do not count

Posted in Books, pictures, Statistics with tags , , , , , , , , on April 8, 2020 by xi'an

Somewhat by chance I came across Nate Silver‘s tribune on FiveThirtyEight about the meaninglessness of COVID-19 case counts. As it reflects on sampling efforts and available resources rather than actual cases, furthermore sampling efforts from at least a fortnight.

“The data, at best, is highly incomplete, and often the tip of the iceberg for much larger problems. And data on tests and the number of reported cases is highly nonrandom. In many parts of the world today, health authorities are still trying to triage the situation with a limited number of tests available. Their goal in testing is often to allocate scarce medical care to the patients who most need it — rather than to create a comprehensive dataset for epidemiologists and statisticians to study.”

This article runs four different scenarios, with the same actual parameters for the epidemics, and highly different and mostly misleading perceptions based on the testing strategies. This is a highly relevant warning but I am surprised Nate Silver does not move to the rather obvious conclusion that some form of official survey or another, for instance based on capture-recapture and representative samples, testing for present and past infections, should be implemented on a very regular basis, even with a limited number of tested persons to get a much more reliable vision of the status of the epidemics. Here, the French official institute of statistics, INSEE, would be most suited to implement such a scheme.

deduplication and population size estimation [discussion opened]

Posted in Books, pictures, Running, Statistics, University life with tags , , , , on March 27, 2020 by xi'an

A call (worth disseminating) for discussions on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April.

Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of N different entities. The main novelty of our approach is to consider the population size N as an unknown model parameter. As a result, a salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at population level. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conflict.

RSS honours recipients for 2020

Posted in Statistics with tags , , , , , , , , , , on March 16, 2020 by xi'an

Just read the news that my friend [and co-author] Arnaud Doucet (Oxford) is the winner of the 2020 Guy Silver Medal award from the Royal Statistical Society. I was also please to learn about David Spiegelhalter‘s Guy Gold medal (I first met David at the fourth Valencia Bayesian meeting in 1991, where he had a poster on the very early stages of BUGS) and Byron Morgan‘s Barnett Award for his indeed remarkable work on statistical ecology and in particular Bayesian capture recapture models. Congrats to all six recipients!