graph of the day & AI4good versus AI4bad

Apart from the above graph from Nature, rendering in a most appalling and meaningless way the uncertainty about the number of active genes in the human genome, I read a couple of articles in this issue of Nature relating to the biases and dangers of societal algorithms. One of which sounded very close to the editorial in the New York Times on which Kristian Lum commented on this blog. With the attached snippet on what is fair and unfair (or not).

The second article was more surprising as it defended the use of algorithms for more democracy. Nothing less. Written by Wendy Tam Cho, professor of political sciences, law, statistics, and mathematics at UIUC, it argued that the software that she develops to construct electoral maps produces fair maps. Which sounds over-rosy imho, as aiming to account for all social, ethnic, income, &tc., groups, i.e., most of the axes that define a human, is meaningless, if only because the structure of these groups is not frozen in time. To state that “computers are impervious to the lure of power” is borderline ridiculous, as computers and algorithms are [so far] driven by humans. This is not to say that gerrymandering should not be fought by technological means, especially and obviously by open source algorithms, as existing proposals (discussed here) demonstrate, but to entertain the notion of a perfectly representative redistricting is not only illusory, but also far from democratic as it shies away from the one person one vote  at the basis of democracy. And the paper leaves us on the dark as to whom will decide on which group or which characteristic need be represented in the votes. Of course, this is the impression obtained by reading a one page editorial in Nature [in an overcrowded and sweltering commuter train] rather than the relevant literature. Nonetheless, I remain puzzled at why this editorial was ever published. (Speaking of democracy, the issue contains also warning reports about Hungary’s ultra-right government taking over the Hungarian Academy of Sciences.)

postdoc on Bayesian computation for statistical genomics

[An opportunity to work with Richard Everitt in Reading, UK, in a postdoc position starting this summer]

It is now possible to retrieve the complete DNA sequence of a bacterial strain relatively quickly and cheaply, and population genetics has been revolutionised in the past ten years through the availability of these data. To gain a deep understanding of sequence data, model-based statistical techniques are required. However, current approaches for performing inference in these models do not scale to whole genome sequence data. The BBSRC project “Understanding recombination through tractable statistical analysis of whole genome sequences” aims to address this issue. A position as Post-Doctoral Research Assistant is available on this project, supervised by Dr Richard Everitt in the Statistics group at the Department of Mathematics & Statistics at the University of Reading.

The deadline for applications is March 31, 2016 (details).

Cancún, ISBA 2014 [day #1]

sunrise in Cancún, July 15, 2014The first full day of talks at ISBA 2014, Cancún, was full of goodies, from the three early talks on specifically developed software, including one by Daniel Lee on STAN that completed the one given by Bob Carpenter a few weeks ago in Paris (which gives me the opportunity to advertise STAN tee-shirts!). To the poster session (which just started a wee bit late for my conference sleep pattern!). Sylvia Richardson gave an impressive lecture full of information on Bayesian genomics. I also enjoyed very much two sessions with young Bayesian statisticians, one on Bayesian econometrics and the other one more diverse and sponsored by ISBA. Overall, and this also applies to the programme of the following days, I found that the proportion of non-parametric talks was quite high this year, possibly signalling a switch in the community and the interest of Bayesians. And conversely very few talks on computing related issues. (With most scheduled after my early departure…)

In the first of those sessions, Brendan Kline talked about partially identified parameters, a topic quite close to my interests, although I did not buy the overall modelling adopted in the analysis. For instance, Brendan Kline presented the example of a parameter θ that is the expectation of a random variable Y which is indirectly observed through x <Y< x̅ . While he maintained that inference should be restricted to an interval around θ and that using a prior on θ was doomed to fail (and against econometrics culture), I would have prefered to see this example as a missing data one, with both x and x̅ containing information about θ. And somewhat object to the argument against the prior as it would equally apply to any prior modelling. Although unrelated in the themes, Angela Bitto presented a work on the impact of different prior modellings on the estimation of time-varying parameters in time-series models. À la Harrison and West 1994 Discriminating between good and poor shrinkage in a way I could not spot. Unless it was based on the data fit (horror!). And a third talk of interest by Andriy Norets that (very loosely) related to Angela’s talk by presenting a framework to modify credible sets towards frequentist properties: one example was the credible interval on a positive normal mean that led to a frequency-valid confidence interval with a modified prior. This reminded me very much of the shrinkage confidence intervals of the James-Stein era.

PhD+postdoc grant on ABC

I have received the following email announcement about a joint INRA/INRIA PhD grant on statistical methods for high frequency genomics, backed by an additional two year postdoc contract:

“Identifier les signatures de sélection dans les données issues de la génomique haut-débit : développement de modèles et de méthodes d’analyse statistique”.
Le développement rapide des technologies de séquençage et de génotypage à haut débit permet désormais de produire de très grandes quantités de données de polymorphisme génétique à une échelle populationnelle, y compris chez des espèces « non-modèles ». Dans ce contexte, la recherche de marqueurs moléculaires portant des signatures de sélection est primordiale pour comprendre la dynamique de l’adaptation. Cette thèse aura donc pour objet de développer des méthodes d’analyse statistique innovantes, pour caractériser la typologie des marqueurs génétiques vis-à-vis de leur statut évolutif. Ces méthodes seront développées dans un cadre bayésien, et se concentreront sur les outils stochastiques afférents (méthodes MCMC et approche ABC lorsque la vraisemblance n’est pas accessible) et les techniques de sélection de variables.

whose google translation is

The fast development of high-frequency sequencing and genotyping technologies allows henceforth to produce very large quantities  of genetic polymorphism data in a populationnal scale, including “non-model” species. In this context, the search for molecular markers carrying selection signatures is essential to understand the dynamics of the adaptation. This thesis will thus have for its goal to develop innovative statistical analysis methods, to characterize the typology of the genetic markers towards their evolutionary status. These methods will be developed in a Bayesian framework, and will concentrate on the relative stochastic tools (MCMC and ABC methods when the likelihood is not available) and the techniques of variable selection.

It involves my friend and coauthor Gilles Celeux (Paris Sud, Orsay) as one of the advisors, as well as two researchers from the place that taught me everything about ABC, the INRA CBGP (Centre de Biologie et de Gestion des Populations)  lab in Montpelliers: Mathieu Gautier and Renaud Vitalis. It is thus a highly interesting proposal whose deadline is April 23.

JSM 2009 impressions [day 4]

A very full day today, where I wish I could have been ubiquitous…! I first attended the particle learning session, and thus missed both Gabor Lugosi’s Medallion lecture and the memorial session for David Friedman. The particle learning session has several interesting talks, among which Raquel Prado’s with informed priors about roots in an AR model and Christian Macaro‘s on an innovative construction of mixtures of AR chains as volatilities to overcome the difficulty in handling long memory processes. I then chaired the session organised by Julien Cornebise on population Monte Carlo, a quite exciting and well-attended session, where I found the results of Mark Huber on the product estimator to offer some strong potential to study nested sampling. This means I missed Charlie Geyer’s talk, among others. The afternoon session was where I talked, along with Jun Liu and Simon Tavaré, who both gave talks full of exciting directions in connection with genomics. The planning was so horrendous that both Gareth Roberts and Judea Pearl were giving special invited lectures at the time, not to mention four Bayesian sessions in parallel… The day ended with the COPSS awards, among which The Florence Nightingale David Award was awarded to Nancy Reid for her role model in the profession, a well-deserved recognition indeed!