Archive for record linkage

deduplication and population size estimation [discussion]

Posted in Books, Statistics with tags , , , , , , on April 23, 2020 by xi'an

[Here is my discussion on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April. Discussions are to be submitted to BA as regular submissions.]

Congratulations to the authors, for this paper that expand the modelling of populations investigated by faulty surveys, a poor quality feature that applies to extreme cases like Syria casualties. And possibly COVID-19 victims.

The model considered in this paper, as given by (2.1), is a latent variable model which appears as hyper-parameterised in the sense it involves a large number of parameters and latent variables. First, this means it is essentially intractable outside a Bayesian resolution. Second, within the Bayesian perspective, it calls for identifiability and consistency questions, namely which fraction of the unknown entities is identifiable and which fraction can be consistently estimated, eventually severing the dependence on the prior modelling. Personal experiences with capture-recapture models on social data like drug addict populations showed me that prior choices often significantly drive posterior inference on the population size. Here, it seems that the generative distortion mechanism between registry of individuals and actual records is paramount.

“We now investigate an alternative aspect of the uniform prior distribution of λ given N.”

Since the practical application stressed in the title, namely some of civil casualties in Syria, interrogations take a more topical flavour as one wonders at the connection between the model and the actual data, between the prior modelling and the available prior information. It is however not the strategy adopted in the paper, which instead proposes a generic prior modelling that could be deemed to be non-informative. I find the property that conditioning on the list sizes eliminates the capture probabilities and the duplication rates quite amazing, reminding me indeed of similar properties for conjugate mixtures, although we found the property hard to exploit from a computational viewpoint. And that the hit-miss model provides computationally tractable marginal distributions for the cluster observations.

“Several records of the VDC data set represent unidentified victims and report only the date of death or do not have the first name and report only the relationship with the head of the family.”

This non-informative choice is however quite informative in the misreporting mechanism and does not address the issue that it presumably is misspecified. It indeed makes the assumption that individual label and type of record are jointly enough to explain the probability of misreporting the exact record. In practical cases, it seems more realistic that the probability to appear in a list depends on the characteristics of an individual, hence far from being uniform as well as independent from one list to the next. The same applies to the probability of being misreported. The alternative to the uniform allocation of individuals to lists found in (3.3) remains neutral to the reasons why (some) individuals are missing from (some) lists. No informative input is indeed made here on how duplicates could appear or on how errors are made in registering individuals. Furthermore, given the high variability observed in inferring the number of actual deaths covered by the collection of the two lists, it would have been of interest to include a model comparison assessment, especially when contemplating the clash between the four posteriors in Figure 4.

The implementation of a manageable Gibbs sampler in such a convoluted model is quite impressive and one would welcome further comments from the authors on its convergence properties, since it is facing a large dimensional space. Are there theoretical or numerical irreducibility issues for instance, created by the discrete nature of some latent variables as in mixture models?

deduplication and population size estimation [discussion opened]

Posted in Books, pictures, Running, Statistics, University life with tags , , , , on March 27, 2020 by xi'an

A call (worth disseminating) for discussions on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April.

Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of N different entities. The main novelty of our approach is to consider the population size N as an unknown model parameter. As a result, a salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at population level. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conflict.

Bayes for good

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on November 27, 2018 by xi'an

A very special weekend workshop on Bayesian techniques used for social good in many different sense (and talks) that we organised with Kerrie Mengersen and Pierre Pudlo at CiRM, Luminy, Marseilles. It started with Rebecca (Beka) Steorts (Duke) explaining [by video from Duke] how the Syrian war deaths were processed to eliminate duplicates, to be continued on Monday at the “Big” conference, Alex Volfonsky (Duke) on a Twitter experiment on the impact of being exposed to adverse opinions as depolarising (not!) or further polarising (yes), turning into network causal analysis. And then Kerrie Mengersen (QUT) on the use of Bayesian networks in ecology, through observational studies she conducted. And the role of neutral statisticians in case of adversarial experts!

Next day, the first talk of David Corlis (Peace-Work), who writes the Stats for Good column in CHANCE and here gave a recruiting spiel for volunteering in good initiatives. Quoting Florence Nightingale as the “first” volunteer. And presenting a broad collection of projects as supports to his recommendations for “doing good”. We then heard [by video] Julien Cornebise from Element AI in London telling of his move out of DeepMind towards investing in social impacting projects through this new startup. Including working with Amnesty International on Darfour village destructions, building evidence from satellite imaging. And crowdsourcing. With an incoming report on the year activities (still under embargo). A most exciting and enthusiastic talk!

Continue reading

congrats!

Posted in Statistics, University life with tags , , , , , , , on August 24, 2015 by xi'an

Two items of news that reached my mailbox at about the same time: my friends and CMU coauthors Rebecca (Beka) Steorts and Steve Fienberg both received a major award in the past few days. Congrats to both of them!!! At JSM 2015, Steve got the 2015 Jerome Sacks Award for Cross-Disciplinary Research “for a remarkable career devoted to the development and application of statistical methodology to solve problems for the benefit of society, including aspects of human rights, privacy and confidentiality, forensics, survey and census-taking, and more; and for exceptional leadership in a variety of professional and governmental organizations, including in the founding of NISS.” The Award is delivered by the National Institute of Statistical Sciences (NISS) in honour of Jerry Sacks. And Beka has been selected as one of the 35 innovators under 35 for 2015, a list published yearly by the MIT Technology Review. In particular for her record-linkage work on estimating the number of casualties in the Syrian civil war. (Which led the Review to classify her as a humanitarian rather than a visionary, which list includes two other machine learners.) Great!

JSM 2014, Boston [#4]

Posted in Books, Statistics, Travel, University life with tags , , , , , , , on August 9, 2014 by xi'an

Last and final day and post at and about JSM 2014! It is very rare that I stay till the last day and it is solely due to family constraints that I attended the very last sessions. It was a bit eerie, walking through the huge structure of the Boston Convention Centre that could easily house several A380 and meeting a few souls dragging a suitcase to the mostly empty rooms… Getting scheduled on the final day of the conference is not the nicest thing and I offer my condolences to all speakers ending up speaking today! Including my former Master student Anne Sabourin.

I first attended the Frontiers of Computer Experiments: Big Data, Calibration, and Validation session with a talk by David Hingdon on the extrapolation limits of computer model, talk that linked very nicely with Stephen Stigler’s Presidential Address and stressed the need for incorporating the often neglected fact that models are not reality. Jared Niemi also presented an approximative way of dealing with large dataset Gaussian process modelling. It was only natural to link this talk with David’s and wonder about the extrapola-bility of the modelling and the risk of over-fitting and the potential for detecting sudden drops in the function.

The major reason why I made the one-hour trip back to the Boston Convention Centre was however theonder about the extrapola-bility of the modelling and the risk of over-fitting and the potential for detecting sudden drops in the function.

The major reason why I made the one-hour trip back to the Boston Convention Centre was however the Human Rights Violations: How Do We Begin Counting the Dead? session. It was both of direct interest to me as I had wondered in the past days about statistically assessing the number of political kidnappings and murders in Eastern Ukraine. And of methodological relevance, as the techniques were connected with capture-recapture and random forests. And of close connections with two speakers who alas could not make it and were replaced by co-authors. The first talk by Samuel Ventura considered ways of accelerating the comparison of entries into multiple lists for identifying unique individuals, with the open methodological question of handling populations of probabilities. As the outcome of random forests. My virtual question related to this talk was why the causes for duplications and errors in the record were completely ignored. At least in the example of the Syrian death, some analysis could be conducted on the reasons for differences in the entries. And maybe a prior model constructed. The second talk by Daniel Manrique-Vallier was about using non-parametric capture-recapture to count the number of dead from several lists. Once again bypassing the use of potential covariates for explaining the differences.  As I noticed a while ago when analysing the population of (police) captured drug addicts in the Greater Paris, the prior modelling has a strong impact on the estimated population. Another point I would have liked to discuss was the repeated argument that Arabic (script?) made the identification of individuals more difficult: my naïve reaction was to wonder whether or not this was due to the absence of fluent Arabic speakers in the team. Who could have further helped to build a model on the potential alternative spellings and derivations of Arabic names. But I maybe missed more subtle difficulties.