Among several interesting (general public) entries and the fascinating article reconstituting the death of Lucy by a fall from a tree, I spotted in the current Sept. 22 issue of Nature two short summaries involving statistical significance, one in linguistics about repeated (and significant) links between some sounds and some concepts (like ‘n’ and ‘nose’) shared between independent languages, another about the (significant) discovery of a π meson and a K meson. The first anonymous editorial, entitled “Algorithm and blues“, was rather gloomy about the impact of proprietary algorithms on our daily life and on our democracies (or what is left of them), like the reliance on such algorithms to grant loan or determining the length of a sentence (based on the estimated probability of re-offending). The article called for more accountability of such tools, from going completely open-source to allowing for some form of strong auditing. This reminded me of the current (regional) debate about the algorithm allocating Greater Paris high school students to local universities and colleges based on their grades, wishes, and available positions. The apparent randomness and arbitrariness of those allocations prompted many (parents) to complain about the algorithm and ask for its move to the open. (Besides the pun in the title, the paper also contained a line about “affirmative algorithmic action”!) There was also a perfectly irrelevant tribune from a representative of the Church of England about its desire to give a higher profile to science in the/their church. Whatever. And I also was bemused by a news article on the difficulty to build a genetic map of Australia Aboriginals due to cultural reticence of Aboriginals to the use of body parts from their communities in genetic research. While I understand and agree with the concept of data privacy, so that to restrain to expose personal information, it is much less clear [to me] why data collected a century ago should come under such protections if it does not create a risk of exposing living individuals. It reminded me of this earlier Nature news article about North-America Aboriginals claiming right to a 8,000 year old skeleton. On a more positive side, this news part also mentioned the first catalogue produced by the Gaia European Space Agency project, from the publication of more than a billion star positions to the open access nature of the database, in that the Gaia team had hardly any prior access to such wealth of data. A special issue part of the journal was dedicated to the impact of social inequalities in the production of (future) scientists, but this sounds rather shallow, at least at the level of the few pages produced on the topic and it did not mention a comparison with other areas of society, where they are also most obviously at work!
Archive for Australia
“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”
After my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the 4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11), that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the multiple then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. In fine, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.
Taking advantage of being in San Francisco, I flew yesterday to Australia over the Pacific, crossing for the first time the day line. The 15 hour Qantas flight to Sydney was remarkably smooth and quiet, with most passengers sleeping for most of the way, and it gave me a great opportunity to go over several papers I wanted to read and review. Over the next week or so, I will work with my friends and co-authors David Frazier and Gael Martin at Monash University (and undoubtedly enjoy the great food and wine scene!). Before flying back to Paris (alas via San Francisco rather than direct).
Bayes on the Beach is a yearly conference taking place in Queensland Gold Coast and organised by Kerrie Mengersen and her BRAG research group at QUT. To quote from the email I just received, the conference will be held at the Mantra Legends Hotel on Surfers Paradise, Gold Coast during November 7 – 9, 2016. The conference provides a forum for discussion on developments and applications of Bayesian statistics, and includes keynote presentations, tutorials, practical problem-based workshops, invited oral presentations, and poster presentations. Abstract submissions are now open until September 2.