Archive for record linkage

Contextual Integrity for Differential Privacy #3 [23w5106]

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , on August 4, 2023 by xi'an

Morning of diverse short talks. First talk by Bei Jiang (Edmonton) on locally processed privacy for quantile estimation, which relates very much to our ongoing research with Stan, who is starting his ERC funded PhD on privacy. Randomised response, in having a positive probability of replacing indicators in the empirical cdf by a random or perturbed version whose bias can be corrected. I may have overdone the similarity though in confusing users with agents. Followed by a hacking foray by Joel Reardon (Calgary) into how much information is transmitted by apps on completely unrelated phone activity. (Moral: Never send a bug report.)

The afternoon break saw us visiting the Frind Estate winery on the other side of the lake. Meaning not only wine tasting (great Syrah!), and discovering an hybrid grape called Maréchal Foch, but also entering the lab with its mass spectrometer. (But no glimpse of the winemaking process per se…)

Contextual Integrity for Differential Privacy #2 [23w5106]

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life, Wines with tags , , , , , , , , , , , , , , , , , , , , , , , on August 3, 2023 by xi'an

Morning of diverse short talks. First one on What are the chances? Explaining ε towards endusers by presenting odds and illustrating the impact of including one potential user’s data. Then one on re-placing DP within CI in terms of causality. And multi-agent models, illustrated by the Cambridge Analytics scandal. I am still not getting the point of the CI perspective which sounds to me like an impossibility theorem. A bit as if Statistics had stopped at “All models are wrong” (as Keynes did, in a way). And a talk on Uses & misuses of DP inference, with nice drawings explaining that publicly available information (eg, smoking causes cancer) may create breaches of privacy (Alice may have cancer). Last talk of the morning on framing effects as privileging data processors and overly technical? Fundamental law of information privacy? Got me wondering about the lack (?) of dynamic perspective so far, in the (simplistic?) sense that DP does not seem to account for potential breaches were a secondary dataset to become available with shared subjects and record linkage. (A bit of a go at GDPR, for the second time within a week.)

Before, I had a rather nice early morning in woods on top of Okanagan Lake, crossing many white tailed deer, hopefully no ticks!, as well as No trespassing signs. And a quick and c…ool swim in the Lake 20⁰ waters. No sign of the large wildfires raging south in Osoyoos or north in Kalmoops. We had a fantastic lunch break at the nearby Arrowleaf Cellars winery, with a stellar pinot noir, although this rather made the following working session harder to engage with (not mentioning the lingering jetlag)!

how to count excess deaths?

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , , , , , , , , , on February 17, 2022 by xi'an

Another terrible graph from Nature… With vertical bars meaning nothing. Nothing more than the list of three values and both confidence intervals. But the associated article is quite interesting in investigating the difficulties in assessing the number of deaths due to COVID-19, when official death statistics are (almost) as shaky as the official COVID-19 deaths. Even in countries with sound mortality statistics and trustworthy official statistics institutes. This article opposes prediction models run by the Institute for Health Metrics and Evaluation and The Economist. The later being a machine-learning prediction procedure based on a large number of covariates. Without looking under the hood, it is unclear to me how poor entries across the array of covariates can be corrected to return a meaningful prediction. It is also striking that the model predicts much less excess deaths than those due to COVID-19 in a developed country like Japan. Survey methods are briefly mentioned at the end of the article, with interesting attempts to use satellite images of burial grounds, but no further techniques like capture-recapture or record linkage and entity resolution.

deduplication and population size estimation [discussion]

Posted in Books, Statistics with tags , , , , , , on April 23, 2020 by xi'an

[Here is my discussion on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April. Discussions are to be submitted to BA as regular submissions.]

Congratulations to the authors, for this paper that expand the modelling of populations investigated by faulty surveys, a poor quality feature that applies to extreme cases like Syria casualties. And possibly COVID-19 victims.

The model considered in this paper, as given by (2.1), is a latent variable model which appears as hyper-parameterised in the sense it involves a large number of parameters and latent variables. First, this means it is essentially intractable outside a Bayesian resolution. Second, within the Bayesian perspective, it calls for identifiability and consistency questions, namely which fraction of the unknown entities is identifiable and which fraction can be consistently estimated, eventually severing the dependence on the prior modelling. Personal experiences with capture-recapture models on social data like drug addict populations showed me that prior choices often significantly drive posterior inference on the population size. Here, it seems that the generative distortion mechanism between registry of individuals and actual records is paramount.

“We now investigate an alternative aspect of the uniform prior distribution of λ given N.”

Since the practical application stressed in the title, namely some of civil casualties in Syria, interrogations take a more topical flavour as one wonders at the connection between the model and the actual data, between the prior modelling and the available prior information. It is however not the strategy adopted in the paper, which instead proposes a generic prior modelling that could be deemed to be non-informative. I find the property that conditioning on the list sizes eliminates the capture probabilities and the duplication rates quite amazing, reminding me indeed of similar properties for conjugate mixtures, although we found the property hard to exploit from a computational viewpoint. And that the hit-miss model provides computationally tractable marginal distributions for the cluster observations.

“Several records of the VDC data set represent unidentified victims and report only the date of death or do not have the first name and report only the relationship with the head of the family.”

This non-informative choice is however quite informative in the misreporting mechanism and does not address the issue that it presumably is misspecified. It indeed makes the assumption that individual label and type of record are jointly enough to explain the probability of misreporting the exact record. In practical cases, it seems more realistic that the probability to appear in a list depends on the characteristics of an individual, hence far from being uniform as well as independent from one list to the next. The same applies to the probability of being misreported. The alternative to the uniform allocation of individuals to lists found in (3.3) remains neutral to the reasons why (some) individuals are missing from (some) lists. No informative input is indeed made here on how duplicates could appear or on how errors are made in registering individuals. Furthermore, given the high variability observed in inferring the number of actual deaths covered by the collection of the two lists, it would have been of interest to include a model comparison assessment, especially when contemplating the clash between the four posteriors in Figure 4.

The implementation of a manageable Gibbs sampler in such a convoluted model is quite impressive and one would welcome further comments from the authors on its convergence properties, since it is facing a large dimensional space. Are there theoretical or numerical irreducibility issues for instance, created by the discrete nature of some latent variables as in mixture models?

deduplication and population size estimation [discussion opened]

Posted in Books, pictures, Running, Statistics, University life with tags , , , , on March 27, 2020 by xi'an

A call (worth disseminating) for discussions on the paper “A Unified Framework for De-Duplication and Population Size Estimation” by [my friends] Andrea Tancredi, Rebecca Steorts, and Brunero Liseo, to appear on the June 2020 issue of Bayesian Analysis. The deadline is 24 April.

Data de-duplication is the process of detecting records in one or more datasets which refer to the same entity. In this paper we tackle the de-duplication process via a latent entity model, where the observed data are perturbed versions of a set of key variables drawn from a finite population of N different entities. The main novelty of our approach is to consider the population size N as an unknown model parameter. As a result, a salient feature of the proposed method is the capability of the model to account for the de-duplication uncertainty in the population size estimation. As by-products of our approach we illustrate the relationships between de-duplication problems and capture-recapture models and we obtain a more adequate prior distribution on the linkage structure. Moreover we propose a novel simulation algorithm for the posterior distribution of the matching configuration based on the marginalization of the key variables at population level. We apply our method to two synthetic data sets comprising German names. In addition we illustrate a real data application, where we match records from two lists which report information about people killed in the recent Syrian conflict.