Archive for Australia
My friend and co-author of many years, Kerrie Mengersen, just received the 2016 Pitman Medal, which is the prize of the Statistical Society of Australia. Congratulations to Kerrie for a well-deserved reward of her massive contributions to Australian, Bayesian, computational, modelling statistics, and to data science as a whole. (In case you wonder about the picture above, she has not yet lost the medal, but is instead looking for jaguars in the Amazon.)
This medal is named after EJG Pitman, Australian probabilist and statistician, whose name is attached to an estimator, a lemma, a measure of efficiency, a test, and a measure of comparison between estimators. His estimator is the best equivariant (or invariant) estimator, which can be expressed as a Bayes estimator under the relevant right Haar measure, despite having no Bayesian motivation to start with. His lemma is the Pitman-Koopman-Darmois lemma, which states that outside exponential families, sufficient is essentially useless (except for exotic distributions like the Uniform distributions). Darmois published the result first in 1935, but in French in the Comptes Rendus de l’Académie des Sciences. And the measure of comparison is Pitman nearness or closeness, on which I wrote a paper with my friends Gene Hwang and Bill Strawderman, paper that we thought was the final paper on the measure as it was pointing out several majors deficiencies with this concept. But the literature continued to grow after that..!
A mostly genetics issue of Nature this week (of October 13), as the journal contains an article on the genomes of 300 individuals from 142 diverse populations across the globe, and another one on the genetic history of Australia Aborigines, plus a third one of 483 individuals from 125 populations drawing genetic space barriers, leading to diverging opinions on the single versus multiple out-of-Africa scenario. As some of these papers are based on likelihood-based techniques, I wish I had more time to explore the statistics behind. Another paper builds a phylogeny of violence in mammals, rising as one nears the primates. I find the paper most interesting but I am not convinced by the genetic explanation of violence, in particular because it seems hard to believe that data about Palaeolithic, Mesolithic, and Neolithic periods can be that informative about the death rate due to intra-species violence. And to conclude on a “pessimistic” note, the paper that argues there is a maximum lifespan for humans, meaning that the 122 years enjoyed (?) by Jeanne Calment from France may remain a limit. However, the argument seems to be that the observed largest, second largest, &tc., ages at death reached a peak in 1997, the year Jeanne Calment died, and is declining since then. That does not sound super-convincing when considering extreme value theory, since 1997 is the extreme event and thus another extreme event of a similar magnitude is not going to happen immediately after.
Among several interesting (general public) entries and the fascinating article reconstituting the death of Lucy by a fall from a tree, I spotted in the current Sept. 22 issue of Nature two short summaries involving statistical significance, one in linguistics about repeated (and significant) links between some sounds and some concepts (like ‘n’ and ‘nose’) shared between independent languages, another about the (significant) discovery of a π meson and a K meson. The first anonymous editorial, entitled “Algorithm and blues“, was rather gloomy about the impact of proprietary algorithms on our daily life and on our democracies (or what is left of them), like the reliance on such algorithms to grant loan or determining the length of a sentence (based on the estimated probability of re-offending). The article called for more accountability of such tools, from going completely open-source to allowing for some form of strong auditing. This reminded me of the current (regional) debate about the algorithm allocating Greater Paris high school students to local universities and colleges based on their grades, wishes, and available positions. The apparent randomness and arbitrariness of those allocations prompted many (parents) to complain about the algorithm and ask for its move to the open. (Besides the pun in the title, the paper also contained a line about “affirmative algorithmic action”!) There was also a perfectly irrelevant tribune from a representative of the Church of England about its desire to give a higher profile to science in the/their church. Whatever. And I also was bemused by a news article on the difficulty to build a genetic map of Australia Aboriginals due to cultural reticence of Aboriginals to the use of body parts from their communities in genetic research. While I understand and agree with the concept of data privacy, so that to restrain to expose personal information, it is much less clear [to me] why data collected a century ago should come under such protections if it does not create a risk of exposing living individuals. It reminded me of this earlier Nature news article about North-America Aboriginals claiming right to a 8,000 year old skeleton. On a more positive side, this news part also mentioned the first catalogue produced by the Gaia European Space Agency project, from the publication of more than a billion star positions to the open access nature of the database, in that the Gaia team had hardly any prior access to such wealth of data. A special issue part of the journal was dedicated to the impact of social inequalities in the production of (future) scientists, but this sounds rather shallow, at least at the level of the few pages produced on the topic and it did not mention a comparison with other areas of society, where they are also most obviously at work!
“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”
After my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the 4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11), that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the multiple then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. In fine, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.