## surveying homelessness

Posted in Books, pictures, Statistics with tags , , , , , , , , , , , on February 26, 2023 by xi'an

A recent NYT article, entitled “582,462 and Counting“, is describing how the USA Federal Administation is running a yearly survey of homeless people. By sending agents and volunteers in the streets and shelters to get an idea of the magnitude of the problem. The figure of 582,462 was the one produced by HUD (the US Department of Housing and Urban Development) for 2022. In the methodological instructions related to this survey,  I could not find any mention of an advanced statistical technique such as capture-recapture. Despite a possible call to statistical software

The next step is to enter data from the survey into a computer program, such as SPSS, SAS, Excel, or the CoCs’ HMIS.

And both undercount and overcount are not formally addressed

CoCs must ensure that during the PIT count homeless persons are only counted once. It is critical that the counting methods be coordinated to ensure that there is no double-counting. Therefore, CoCs must also collect sufficient information to be able to deduplicate the PIT count (i.e. ensure that the same homeless person was not counted more than once.

(where CoC stands for Continuum of Care and PiT Point-in-Time). (I remember discussing homeless surveys in the 1990’s with researchers from INED, albeit they had already launched their survey and would thus be unable to “recapture” interviewees.)

## Bayes in Riddler mode

Posted in Books, Kids, R, Statistics with tags , , , , , on July 7, 2022 by xi'an

A very classical (textbook) question on the Riddler on inferring the contents of an urn from an Hypergeometric experiment:

You have an urn with N  red and white balls, but you have no information about what N might be. You draw n=19 balls at random, without replacement, and you get 8 red balls and 11 white balls. What is your best guess for the original number of balls (red and white) in the urn?

With therefore a likelihood given by

$\frac{R!}{(R-8)!}\frac{W!}{(W-11)!}\frac{(R+W-19)!}{(R+W)!}$

leading to a simple posterior derivation when choosing a 1/RW improper prior. That can be computed for a range of integer values of R and W:

L=function(R,W)lfactorial(R)+lfactorial(W)+
lfactorial(R+W-19)-lfactorial(R-8)-
lfactorial(W-11)-lfactorial(R+W)


and produces a posterior mean of 99.1 for R and of 131.2 for W, or a posterior median of 52 for R and 73 for W. And to the above surface for the log-likelihood. Which is unsurprisingly maximal at (8,11). The dependence on the prior is of course significant!

However silly me missed one word in the riddle, namely that R and W were equal… With a proper prior in 1/R², the posterior mean is 42.2 (unstable) and the posterior median 20. While an improper prior in 1/R leads to a posterior mean of 133.7 and a posterior median of 72. However, since the posterior mean increases with the number of values of R for which the posterior is computed, it may be that this mean does not exist!

## RSS 2022 Honours

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , , , on March 21, 2022 by xi'an

## capture-recapture rediscovered

Posted in Books, Statistics with tags , , , , , , , , , , , , on March 2, 2022 by xi'an

A recent Science paper applies capture-recapture to estimating how much medieval literature has been lost, using ancient lists of works and comparing with the currently know corpus. To deduce at a 91% loss. Which begets the next question of how many ancient lists have been lost! Or how many of the observed ones are sheer copies of the others. First I thought I had no access to the paper so could not comment on the specific data and accounting for the uneven and unrandom sampling behind this modelling… But I still would not share the anti-modelling bias of this Harvard historian, given the superlative record of Anne Chao in capture-recapture methodology!

“The paper seems geared more toward systems theorists and statisticians, says Daniel Smail, a historian at Harvard University who studies medieval social and cultural history, and the authors haven’t done enough to establish why cultural production should follow the same rules as life systems. But for him, the bigger question is: Given that we already have catalogs of ancient texts, and previous estimates were pretty close to the model’s new one, what does the new work add?”

Once at Ca’Foscari, I realised the local network gave me access to the paper. The description of the Chao1 method, as far as I can tell, does not describe how the problematic collection of catalogs where duplicates (recaptures) can be observed is taken into account. For one thing, the collection is far from iid since some catalogs must have built on earlier ones. It is also surprising imho that the authors spend space on discussing unbiasedness when a more crucial issue is the randomness assumption behind the collected data.

## how to count excess deaths?

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , , , , , , , , , on February 17, 2022 by xi'an

Another terrible graph from Nature… With vertical bars meaning nothing. Nothing more than the list of three values and both confidence intervals. But the associated article is quite interesting in investigating the difficulties in assessing the number of deaths due to COVID-19, when official death statistics are (almost) as shaky as the official COVID-19 deaths. Even in countries with sound mortality statistics and trustworthy official statistics institutes. This article opposes prediction models run by the Institute for Health Metrics and Evaluation and The Economist. The later being a machine-learning prediction procedure based on a large number of covariates. Without looking under the hood, it is unclear to me how poor entries across the array of covariates can be corrected to return a meaningful prediction. It is also striking that the model predicts much less excess deaths than those due to COVID-19 in a developed country like Japan. Survey methods are briefly mentioned at the end of the article, with interesting attempts to use satellite images of burial grounds, but no further techniques like capture-recapture or record linkage and entity resolution.