Archive for capture-recapture

Bayes in Riddler mode

Posted in Books, Kids, R, Statistics with tags , , , , , on July 7, 2022 by xi'an

A very classical (textbook) question on the Riddler on inferring the contents of an urn from an Hypergeometric experiment:

You have an urn with N  red and white balls, but you have no information about what N might be. You draw n=19 balls at random, without replacement, and you get 8 red balls and 11 white balls. What is your best guess for the original number of balls (red and white) in the urn?

With therefore a likelihood given by


leading to a simple posterior derivation when choosing a 1/RW improper prior. That can be computed for a range of integer values of R and W:


and produces a posterior mean of 99.1 for R and of 131.2 for W, or a posterior median of 52 for R and 73 for W. And to the above surface for the log-likelihood. Which is unsurprisingly maximal at (8,11). The dependence on the prior is of course significant!

However silly me missed one word in the riddle, namely that R and W were equal… With a proper prior in 1/R², the posterior mean is 42.2 (unstable) and the posterior median 20. While an improper prior in 1/R leads to a posterior mean of 133.7 and a posterior median of 72. However, since the posterior mean increases with the number of values of R for which the posterior is computed, it may be that this mean does not exist!

RSS 2022 Honours

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , , , on March 21, 2022 by xi'an

capture-recapture rediscovered

Posted in Books, Statistics with tags , , , , , , , , , , , , on March 2, 2022 by xi'an

A recent Science paper applies capture-recapture to estimating how much medieval literature has been lost, using ancient lists of works and comparing with the currently know corpus. To deduce at a 91% loss. Which begets the next question of how many ancient lists have been lost! Or how many of the observed ones are sheer copies of the others. First I thought I had no access to the paper so could not comment on the specific data and accounting for the uneven and unrandom sampling behind this modelling… But I still would not share the anti-modelling bias of this Harvard historian, given the superlative record of Anne Chao in capture-recapture methodology!

“The paper seems geared more toward systems theorists and statisticians, says Daniel Smail, a historian at Harvard University who studies medieval social and cultural history, and the authors haven’t done enough to establish why cultural production should follow the same rules as life systems. But for him, the bigger question is: Given that we already have catalogs of ancient texts, and previous estimates were pretty close to the model’s new one, what does the new work add?”

Once at Ca’Foscari, I realised the local network gave me access to the paper. The description of the Chao1 method, as far as I can tell, does not describe how the problematic collection of catalogs where duplicates (recaptures) can be observed is taken into account. For one thing, the collection is far from iid since some catalogs must have built on earlier ones. It is also surprising imho that the authors spend space on discussing unbiasedness when a more crucial issue is the randomness assumption behind the collected data.

how to count excess deaths?

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , , , , , , , , , on February 17, 2022 by xi'an

Another terrible graph from Nature… With vertical bars meaning nothing. Nothing more than the list of three values and both confidence intervals. But the associated article is quite interesting in investigating the difficulties in assessing the number of deaths due to COVID-19, when official death statistics are (almost) as shaky as the official COVID-19 deaths. Even in countries with sound mortality statistics and trustworthy official statistics institutes. This article opposes prediction models run by the Institute for Health Metrics and Evaluation and The Economist. The later being a machine-learning prediction procedure based on a large number of covariates. Without looking under the hood, it is unclear to me how poor entries across the array of covariates can be corrected to return a meaningful prediction. It is also striking that the model predicts much less excess deaths than those due to COVID-19 in a developed country like Japan. Survey methods are briefly mentioned at the end of the article, with interesting attempts to use satellite images of burial grounds, but no further techniques like capture-recapture or record linkage and entity resolution.

Measuring abundance [book review]

Posted in Books, Statistics with tags , , , , , , , , , , , , on January 27, 2022 by xi'an

This 2020 book, Measuring Abundance:  Methods for the Estimation of Population Size and Species Richness was written by Graham Upton, retired professor of applied statistics, for the Data in the Wild series published by Pelagic Publishing, a publishing company based in Exeter.

“Measuring the abundance of individuals and the diversity of species are core components of most ecological research projects and conservation monitoring. This book brings together in one place, for the first time, the methods used to estimate the abundance of individuals in nature.”

Its purpose is to provide a collection of statistical methods for measuring animal abundance or lack thereof. There are four parts: a primer on statistical methods, going no further than maximum likelihood estimation and bootstrap. The term Bayesian only occurs once, in connection with the (a-Bayesian) BIC. (I first spotted a second entry, until I realised this was not a typo and the example truly was about Bawean warty pigs!) The second part is about stationary (or static) individuals, such as trees, and it mostly exposes different recognised ways of sampling, with a focus on minimising the surveyor’s effort. Examples include forestry sampling (with a chainsaw method!) and underwater sampling. There is very little statistics involved in this part apart from the rare appearance of a MLE with an asymptotic confidence interval. There is also very little about misspecified models, except for the occasional warning that the estimates may prove completely wrong. The third part is about mobile individuals, with capture-recapture methods receiving the lion’s share (!). No lion was actually involved in the studies used as examples (but there were grizzly bears from Yellowstone and Banff National Parks). Given the huge variety of capture-recapture models, very little input is found within the book as the practical aspects are delegated to R software like the RMark and mra packages. Very little is written on using covariates or spatial features in such models, mostly dedicated to printed output from R packages with AIC as the sole standard for comparing models. I did not know of distance methods (Chapter 8), which are less invasive counting methods. They however seem to rely on a particular model of missing on individuals as the distance increases. The last section is about estimating the number of species. With again a model assumption that may prove wrong. With the inclusion of diversity measures,

The contents of the book are really down to earth and intended for field data gatherers. For instance, “drive slowly and steadily at 20 mph with headlights and hazard lights on ” (p.91) or “Before starting to record, allow fish time to acclimatize to the presence of divers” (p.91). It is unclear to me how useful the book would prove to be for general statisticians, apart from revealing the huge diversity of methods actually employed in the field. To either build upon these or expose students to their reassessment. More advanced books are McCrea and Morgan (2014), Buckland et al. (2016) and the most recent Seber and Schofield (2019).

[Disclaimer about potential self-plagiarism: this post or an edited version will eventually appear in my Book Review section in CHANCE.]

%d bloggers like this: