Bayes for good

Posted in Books, Mountains, pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , on November 27, 2018 by xi'an

A very special weekend workshop on Bayesian techniques used for social good in many different sense (and talks) that we organised with Kerrie Mengersen and Pierre Pudlo at CiRM, Luminy, Marseilles. It started with Rebecca (Beka) Steorts (Duke) explaining [by video from Duke] how the Syrian war deaths were processed to eliminate duplicates, to be continued on Monday at the “Big” conference, Alex Volfonsky (Duke) on a Twitter experiment on the impact of being exposed to adverse opinions as depolarising (not!) or further polarising (yes), turning into network causal analysis. And then Kerrie Mengersen (QUT) on the use of Bayesian networks in ecology, through observational studies she conducted. And the role of neutral statisticians in case of adversarial experts!

Next day, the first talk of David Corlis (Peace-Work), who writes the Stats for Good column in CHANCE and here gave a recruiting spiel for volunteering in good initiatives. Quoting Florence Nightingale as the “first” volunteer. And presenting a broad collection of projects as supports to his recommendations for “doing good”. We then heard [by video] Julien Cornebise from Element AI in London telling of his move out of DeepMind towards investing in social impacting projects through this new startup. Including working with Amnesty International on Darfour village destructions, building evidence from satellite imaging. And crowdsourcing. With an incoming report on the year activities (still under embargo). A most exciting and enthusiastic talk!

Hamiltonian MC on discrete spaces

Posted in Statistics, Travel, University life with tags , , , , , , , , on July 3, 2017 by xi'an

Following a lively discussion with Akihiko Nishimura during a BNP11 poster session last Tuesday, I took the opportunity of the flight to Montréal to read through the arXived paper (written jointly with David Dunson and Jianfeng Liu). The issue is thus one of handling discrete valued parameters in Hamiltonian Monte Carlo. The basic “trick” in handling this complexity goes by turning the discrete support via the inclusion of an auxiliary continuous variable whose discretisation is the discrete parameter, hence resembling to some extent the slice sampler. This removes the discreteness blockage but creates another difficulty, namely handling a discontinuous target density. (I idly wonder why the trick cannot be iterated to second or higher order so that to achieve the right amount of smoothness. Of course, the maths behind would be less cool!) The extension of the Hamiltonian to this setting by a  convolution is a trick I had not seen since the derivation of the Central Limit Theorem during Neveu’s course at Polytechnique.  What I find most exciting in the resolution is the move from a Gaussian momentum to a Laplace momentum, for the reason that I always wondered at alternatives [without trying anything myself!]. The Laplace version is indeed most appropriate here in that it avoids a computation of all discontinuity points and associated values along a trajectory. Since the moves are done component-wise, the method has a Metropolis-within-Gibbs flavour, which actually happens to be a special case. What is also striking is that the approach is both rejection-free and exact, provided ergodicity occurs, which is the case when the stepsize is random.

In addition to this resolution of the discrete parameter problem, the paper presents the further appeal of (re-)running an analysis of the Jolly-Seber capture-recapture model. Where the discrete parameter is the latent number of live animals [or whatever] in the system at any observed time. (Which we cover in Bayesian essentials with R as a neat entry to both dynamic and latent variable models.) I would have liked to see a comparison with the completion approach of Jérôme Dupuis (1995, Biometrika), since I figure the Metropolis version implemented here differs from Jérôme’s. The second example is built on Bissiri et al. (2016) surrogate likelihood (discussed earlier here) and Chopin and Ridgway (2017) catalogue of solutions for not analysing the Pima Indian dataset. (Replaced by another dataset here.)

capture-recapture with continuous covariates

Posted in Books, pictures, Statistics, University life with tags , , , , , on September 14, 2015 by xi'an

This morning, I read a paper by Roland Langrock and Ruth King in a 2013 issue of Annals of Applied Statistics that had gone too far under my desk to be noticed… This problem of using continuous variates in capture-recapture models is a frustrating one as it is not clear what one should do at times the subject and therefore its covariates are not observed. This is why I was quite excited by the [trinomial] paper of Catchpole, Morgan, and Tavecchia when they submitted it to JRSS Series B and I was the editor handling it. In the current paper Langrock and King build a hidden Markov model on the capture history (as in Jérôme Dupui’s main thesis paper, 1995), as well as a discretised Markov chain model on the covariates and a logit connection between those covariates and the probability of capture. (At first, I thought the Markov model was a sheer unconstrained Markov chain on the discretised space and found curious that increasing the number of states had a positive impact on the estimation but, blame my Métro environment!, I had not read the paper carefully.)

“The accuracy of the likelihood approximation increases with increasing m.” (p.1719)

While I acknowledge that something has to be done about the missing covariates, and that this approach may be the best one can expect in such circumstances, I nonetheless disagree with the above notion that increasing the discretisation step m will improve the likelihood approximation, simply because the model on the covariates that was chosen ex nihilo has no reason to fit the real phenomenon, especially since the value of the covariates impact the probability of capture: the individuals are not (likely to get) missing at random, i.e., independently from the covariates. For instance, in a lizard study on which Jérôme Dupuis worked in the early 1990’s, weight and survival were unsurprisingly connected, with a higher mortality during the cold months where food was sparse. Using autoregressive-like models on the covariates is missing the possibility of sudden changes in the covariates that could impact the capture patterns. I do not know whether or not this has been attempted in this area, but connecting the covariates between individuals at a specific time, so that missing covariates can be inferred from observed covariates, possibly with spatial patterns, would also make sense.

In fine, I fear there is a strong and almost damning limitation to the notion of incorporating covariates into capture-recapture models, namely, if a covariate is determinantal in deciding of a capture or non-capture, the non-capture range of the covariate will never be observed and hence cannot be derived from the observed values.

capture mark recapture with no mark and no recapture [aka 23andmyfish]

Posted in Mountains, pictures, Statistics, Travel, University life with tags , , , , , , , , on June 11, 2015 by xi'an

A very exciting talk today at NBBC15 here in Reykjavik was delivered by Mark Bravington yesterday on Close-kin mark recapture by modern magic (!). Although Mark is from Australia, being a Hobart resident does qualify him for the Nordic branch of the conference! The exciting idea is to use genetic markers to link catches in a (fish) population as being related as parent-offspring or as siblings. This sounds like science-fantasy when you first hear of it!, but it is actually working better than standard capture-mark-recapture methods for populations of a certain size (so that the chances to find related animals are not the absolute zero!, as, e.g., krill populations). The talk was focussed on bluefin tuna, whose survival is unlikely under the current fishing pressure… Among the advantages, a much more limited impact of the capture on the animal, since only a small amount of genetic material is needed, no tag loss, tag destruction by hunters, or tag impact of the animal survival, no recapture, a unique identification of each animal, and the potential for a detailed amount of information through the genetic record. Ideally, the entire sample could lead to a reconstruction of its genealogy all the way to the common ancestor, a wee bit like what 23andme proposes for humans, but this remains at the science-fantasy level given what is currently know about the fish species genomes.

Feller’s shoes and Rasmus’ socks [well, Karl’s actually…]

Posted in Books, Kids, R, Statistics, University life with tags , , , , on October 24, 2014 by xi'an

Yesterday, Rasmus Bååth [of puppies’ fame!] posted a very nice blog using ABC to derive the posterior distribution of the total number of socks in the laundry when only pulling out orphan socks and no pair at all in the first eleven draws. Maybe not the most pressing issue for Bayesian inference in the era of Big data but still a challenge of sorts!

Rasmus set a prior on the total number m of socks, a negative Binomial Neg(15,1/3) distribution, and another prior of the proportion of socks that come by pairs, a Beta B(15,2) distribution, then simulated pseudo-data by picking eleven socks at random, and at last applied ABC (in Rubin’s 1984 sense) by waiting for the observed event, i.e. only orphans and no pair [of socks]. Brilliant!

The overall simplicity of the problem set me wondering about an alternative solution using the likelihood. Cannot be that hard, can it?! After a few computations rejected by opposing them to experimental frequencies, I put the problem on hold until I was back home and with access to my Feller volume 1, one of the few [math] books I keep at home… As I was convinced one of the exercises in Chapter II would cover this case. After checking, I found a partial solution, namely Exercice 26:

A closet contains n pairs of shoes. If 2r shoes are chosen at random (with 2r<n), what is the probability that there will be (a) no complete pair, (b) exactly one complete pair, (c) exactly two complete pairs among them?

This is not exactly a solution, but rather a problem, however it leads to the value

$p_j=\binom{n}{j}2^{2r-2j}\binom{n-j}{2r-2j}\Big/\binom{2n}{2r}$

as the probability of obtaining j pairs among those 2r shoes. Which also works for an odd number t of shoes:

$p_j=2^{t-2j}\binom{n}{j}\binom{n-j}{t-2j}\Big/\binom{2n}{t}$

as I checked against my large simulations. So I solved Exercise 26 in Feller volume 1 (!), but not Rasmus’ problem, since there are those orphan socks on top of the pairs. If one draws 11 socks out of m socks made of f orphans and g pairs, with f+2g=m, the number k of socks from the orphan group is an hypergeometric H(11,m,f) rv and the probability to observe 11 orphan socks total (either from the orphan or from the paired groups) is thus the marginal over all possible values of k:

$\sum_{k=0}^{11} \dfrac{\binom{f}{k}\binom{2g}{11-k}}{\binom{m}{11}}\times\dfrac{2^{11-k}\binom{g}{11-k}}{\binom{2g}{11-k}}$

so it could be argued that we are facing a closed-form likelihood problem. Even though it presumably took me longer to achieve this formula than for Rasmus to run his exact ABC code!

capture-recapture homeless deaths

Posted in Statistics, Travel, University life with tags , , , , , , , on August 28, 2014 by xi'an

In the newspaper I grabbed in the corridor to my plane today (flying to Bristol to attend the SuSTaIn image processing workshop on “High-dimensional Stochastic Simulation and Optimisation in Image Processing” where I was kindly invited and most readily accepted the invitation), I found a two-page entry on estimating the number of homeless deaths using capture-recapture. Besides the sheer concern about the very high mortality rate among homeless persons (expected lifetime, 48 years; around 7000 deaths in France between 2008 and 2010) and the dreadful realisation that there are an increasing number of kids dying in the streets, I was obviously interested in this use of capture-recapture methods as I had briefly interacted with researchers from INED working on estimating the number of (living) homeless persons about 15 years ago. Glancing at the original paper once I had landed, there was alas no methodological innovation in the approach, which was based on the simplest maximum likelihood estimate. I wonder whether or not more advanced models and [Bayesian] methods of inference could [or should] be used on such data. Like introducing covariates in the process. For instance, when conditioning the probability of (cross-)detection on the cause of death.

JSM 2014, Boston [#4]

Posted in Books, Statistics, Travel, University life with tags , , , , , , , on August 9, 2014 by xi'an

Last and final day and post at and about JSM 2014! It is very rare that I stay till the last day and it is solely due to family constraints that I attended the very last sessions. It was a bit eerie, walking through the huge structure of the Boston Convention Centre that could easily house several A380 and meeting a few souls dragging a suitcase to the mostly empty rooms… Getting scheduled on the final day of the conference is not the nicest thing and I offer my condolences to all speakers ending up speaking today! Including my former Master student Anne Sabourin.

I first attended the Frontiers of Computer Experiments: Big Data, Calibration, and Validation session with a talk by David Hingdon on the extrapolation limits of computer model, talk that linked very nicely with Stephen Stigler’s Presidential Address and stressed the need for incorporating the often neglected fact that models are not reality. Jared Niemi also presented an approximative way of dealing with large dataset Gaussian process modelling. It was only natural to link this talk with David’s and wonder about the extrapola-bility of the modelling and the risk of over-fitting and the potential for detecting sudden drops in the function.

The major reason why I made the one-hour trip back to the Boston Convention Centre was however theonder about the extrapola-bility of the modelling and the risk of over-fitting and the potential for detecting sudden drops in the function.

The major reason why I made the one-hour trip back to the Boston Convention Centre was however the Human Rights Violations: How Do We Begin Counting the Dead? session. It was both of direct interest to me as I had wondered in the past days about statistically assessing the number of political kidnappings and murders in Eastern Ukraine. And of methodological relevance, as the techniques were connected with capture-recapture and random forests. And of close connections with two speakers who alas could not make it and were replaced by co-authors. The first talk by Samuel Ventura considered ways of accelerating the comparison of entries into multiple lists for identifying unique individuals, with the open methodological question of handling populations of probabilities. As the outcome of random forests. My virtual question related to this talk was why the causes for duplications and errors in the record were completely ignored. At least in the example of the Syrian death, some analysis could be conducted on the reasons for differences in the entries. And maybe a prior model constructed. The second talk by Daniel Manrique-Vallier was about using non-parametric capture-recapture to count the number of dead from several lists. Once again bypassing the use of potential covariates for explaining the differences.  As I noticed a while ago when analysing the population of (police) captured drug addicts in the Greater Paris, the prior modelling has a strong impact on the estimated population. Another point I would have liked to discuss was the repeated argument that Arabic (script?) made the identification of individuals more difficult: my naïve reaction was to wonder whether or not this was due to the absence of fluent Arabic speakers in the team. Who could have further helped to build a model on the potential alternative spellings and derivations of Arabic names. But I maybe missed more subtle difficulties.