Among several interesting (general public) entries and the fascinating article reconstituting the death of Lucy by a fall from a tree, I spotted in the current Sept. 22 issue of Nature two short summaries involving statistical significance, one in linguistics about repeated (and significant) links between some sounds and some concepts (like ‘n’ and ‘nose’) shared between independent languages, another about the (significant) discovery of a π meson and a K meson. The first anonymous editorial, entitled “Algorithm and blues“, was rather gloomy about the impact of proprietary algorithms on our daily life and on our democracies (or what is left of them), like the reliance on such algorithms to grant loan or determining the length of a sentence (based on the estimated probability of re-offending). The article called for more accountability of such tools, from going completely open-source to allowing for some form of strong auditing. This reminded me of the current (regional) debate about the algorithm allocating Greater Paris high school students to local universities and colleges based on their grades, wishes, and available positions. The apparent randomness and arbitrariness of those allocations prompted many (parents) to complain about the algorithm and ask for its move to the open. (Besides the pun in the title, the paper also contained a line about “affirmative algorithmic action”!) There was also a perfectly irrelevant tribune from a representative of the Church of England about its desire to give a higher profile to science in the/their church. Whatever. And I also was bemused by a news article on the difficulty to build a genetic map of Australia Aboriginals due to cultural reticence of Aboriginals to the use of body parts from their communities in genetic research. While I understand and agree with the concept of data privacy, so that to restrain to expose personal information, it is much less clear [to me] why data collected a century ago should come under such protections if it does not create a risk of exposing living individuals. It reminded me of this earlier Nature news article about North-America Aboriginals claiming right to a 8,000 year old skeleton. On a more positive side, this news part also mentioned the first catalogue produced by the Gaia European Space Agency project, from the publication of more than a billion star positions to the open access nature of the database, in that the Gaia team had hardly any prior access to such wealth of data. A special issue part of the journal was dedicated to the impact of social inequalities in the production of (future) scientists, but this sounds rather shallow, at least at the level of the few pages produced on the topic and it did not mention a comparison with other areas of society, where they are also most obviously at work!
Archive for linguistics
[I first thought this title was highly original but a google search showed me wrong…] This short post to point out to the new blog started by Ingmar Schuster on computational statistics and linguistics. Which, so far, keeps strictly to the discussion of recent research papers (rather than ratiocinating about all kinds of tangential topics like a certain ‘Og…) Some of which we may discuss in parallel. And some not. So keep posted! Ingmar came to Paris-Dauphine for a doctoral visit last Winter and is back as a postdoc (supported by the Fondation des Sciences Mathématiques de Paris) since last Fall. Working with me and Nicolas, among others.
This morning, after a nice and cool run along the river Torrens amidst almost unceasing bird songs, I attended another Bayesian ASC 2012 session with Scott Sisson presenting a simulation method aimed at correcting for biased confidence intervals and Robert Kohn giving the same talk in Kyoto. Scott’s proposal, which is rather similar to parametric bootstrap bias correction, is actually more frequentist than Bayesian as the bias is defined in terms of an correct frequentist coverage of a given confidence (or credible) interval. (Thus making the connection with Roderick Little’s calibrated Bayes talk of yesterday.) This perspective thus perceives ABC as a particular inferential method, instead of a computational approximation to the genuine Bayesian object. (We will certainly discuss the issue with Scott next week in Sydney.)
Then Peter Donnely gave a particularly exciting and well-attended talk on the geographic classification of humans, in particular of the (early 1900’s) population of the British isles, based on a clever clustering idea derived from an earlier paper of Na Li and Matthew Stephens: using genetic sequences from a group of individuals, each individual was paired with the rest of the sample as if it descended from this population. Using an HMM model, this led to clustering the sample into about 50 groups, with a remarkable geographic homogeneity: for instance, Cornwall and Devon made two distinct groups, an English speaking pocket of Wales (Little England) was identified as a specific group and so on, the central, eastern and southern England constituting an homogenous group of its own…
Shravan Vasishth has written a response to my review both published on the Statistics Forum. His response is quite straightforward and honest. In particular, he acknowledges not being a statistician and that he “should spend more time studying statistics”. I also understand the authors’ frustration at trying “to recruit several statisticians (at different points) to join [them] as co-authors for this book, in order to save [them] from [them]selves, so to speak. Nobody was willing to do join in.” (Despite the kind proposal to join as a co-author to a new edition, I would be rather unwilling as well, mostly because of the concept to avoid calculus at all cost… I will actually meet with Shravan at the end of the month to discuss specifics of the statistical flaws in this book.)
However, I still do not understand why the book was published without a proper review from a statistician. Springer is a/my serious scientific editor and book proposals usually go through several reviews, prior to and after redaction. Shravan Vasishth asks for alternative references, which I personally cannot provide for lack of teaching at this level, but this is somehow besides the point: even if a book at the intended level and for the intended audience did not exist, this would not justify the publication of a book on statistics (and only statistics) by authors not proficient enough in the topic.
One point of the response I do not get is the third item about the blog and letting my “rage get the better of [myself] (the rage is no doubt there for good reason)”. Indeed, while I readily acknowledge the review is utterly negative, I have tried to stick to facts, either statistical flaws (like the unbiasedness of s) or presentation defects. The reference to a blog in the book could be a major incentive to adopt the book, so if the blog does not live as a blog, it is both a disappointment to the reader and a sort of a breach of advertising. I perfectly understand the many reasons for not maintaining a blog (!), but then the site should have been advertised as a site rather than a blog. This was the meaning of the paragraph
that does not sound full of rage to me… Anyway, this is a minor point.
“We have seen that a perfect correlation is perfectly linear, so an imperfect correlation will be `imperfectly linear’.” page 128
This book has been written by two linguists, Shravan Vasishth and Michael Broe, in order to teach statistics “in areas that are traditionally not mathematically demanding” at a deeper level than traditional textbooks “without using too much mathematics”, towards building “the confidence necessary for carrying more sophisticated analyses” through R simulation. This is a praiseworthy goal, bound to produce a great book. However, and most sadly, I find the book does not live up to expectations. As in Radford Neal’s recent coverage of introductory probability books with R, there are statements there that show a deep misunderstanding of the topic… (This post has also been published on the Statistics Forum.) Continue reading
Robin Ryder—with whom I am sharing an office at CREST, and who is currently doing a postdoc on ABC methods—, got interviewed in the March issue of La Recherche. (The interviewer was Philippe Pajot who wrote “Parcours de mathématiciens”, reviewed in a recent post.) The interview is reproduced on Robin’s blog (in French) and gives in a few words the principles of Bayesian linguistics. This two-page interview also includes a few lines of a technical entry to MCMC (called Monte Carlo Markov chains rather than Markov chain Monte Carlo) that focus on the exploration of huge state-spaces associated with trees. Overall, a very good advertising for MCMC methods for the general public through the highly attractive story of the history of languages…