Archive for linguistics

Nature highlights

Posted in Books, Kids, pictures, Statistics with tags , , , , , , , , , , , , , on October 16, 2016 by xi'an

Among several interesting (general public) entries and the fascinating article reconstituting the death of Lucy by a fall from a tree, I spotted in the current Sept. 22 issue of Nature two short summaries involving statistical significance, one in linguistics about repeated (and significant) links between some sounds and some concepts (like ‘n’ and ‘nose’) shared between independent languages, another about the (significant) discovery of a π meson and a K meson. The first anonymous editorial, entitled “Algorithm and blues“, was rather gloomy about the impact of proprietary algorithms on our daily life and on our democracies (or what is left of them), like the reliance on such algorithms to grant loan or determining the length of a sentence (based on the estimated probability of re-offending). The article called for more accountability of such tools, from going completely open-source to allowing for some form of strong auditing. This reminded me of the current (regional) debate about the algorithm allocating Greater Paris high school students to local universities and colleges based on their grades, wishes, and available positions. The apparent randomness and arbitrariness of those allocations prompted many (parents) to complain about the algorithm and ask for its move to the open. (Besides the pun in the title, the paper also contained a line about “affirmative algorithmic action”!) There was also a perfectly irrelevant tribune from a representative of the Church of England about its desire to give a higher profile to science in the/their church. Whatever. And I also was bemused by a news article on the difficulty to build a genetic map of Australia Aboriginals due to cultural reticence of Aboriginals to the use of body parts from their communities in genetic research. While I understand and agree with the concept of data privacy, so that to restrain to expose personal information, it is much less clear [to me] why data collected a century ago should come under such protections if it does not create a risk of exposing living individuals. It reminded me of this earlier Nature news article about North-America Aboriginals claiming right to a 8,000 year old skeleton. On a more positive side, this news part also mentioned the first catalogue produced by the Gaia European Space Agency project, from the publication of more than a billion star positions to the open access nature of the database, in that the Gaia team had hardly any prior access to such wealth of data. A special issue part of the journal was dedicated to the impact of social inequalities in the production of (future) scientists, but this sounds rather shallow, at least at the level of the few pages produced on the topic and it did not mention a comparison with other areas of society, where they are also most obviously at work!

new kid on the blog

Posted in Kids, Statistics, University life with tags , , , , , , on January 27, 2016 by xi'an

[I first thought this title was highly original but a google search showed me wrong…] This short post to point out to the new blog started by Ingmar Schuster on computational statistics and linguistics. Which, so far, keeps strictly to the discussion of recent research papers (rather than ratiocinating about all kinds of tangential topics like a certain ‘Og…) Some of which we may discuss in parallel. And some not. So keep posted! Ingmar came to Paris-Dauphine for a doctoral visit last Winter and is back as a postdoc (supported by the Fondation des Sciences Mathématiques de Paris) since last Fall. Working with me and Nicolas, among others.


The synoptic problem and statistics [book review]

Posted in Books, R, Statistics, University life, Wines with tags , , , , , , , , , , , , on March 20, 2015 by xi'an

A book that came to me for review in CHANCE and that came completely unannounced is Andris Abakuks’ The Synoptic Problem and Statistics.  “Unannounced” in that I had not heard so far of the synoptic problem. This problem is one of ordering and connecting the gospels in the New Testament, more precisely the “synoptic” gospels attributed to Mark, Matthew and Luke, since the fourth canonical gospel of John is considered by experts to be posterior to those three. By considering overlaps between those texts, some statistical inference can be conducted and the book covers (some of?) those statistical analyses for different orderings of ancestry in authorship. My overall reaction after a quick perusal of the book over breakfast (sharing bread and fish, of course!) was to wonder why there was no mention made of a more global if potentially impossible approach via a phylogeny tree considering the three (or more) gospels as current observations and tracing their unknown ancestry back just as in population genetics. Not because ABC could then be brought into the picture. Rather because it sounds to me (and to my complete lack of expertise in this field!) more realistic to postulate that those gospels were not written by a single person. Or at a single period in time. But rather that they evolve like genetic mutations across copies and transmission until they got a sort of official status.

“Given the notorious intractability of the synoptic problem and the number of different models that are still being advocated, none of them without its deficiencies in explaining the relationships between the synoptic gospels, it should not be surprising that we are unable to come up with more definitive conclusions.” (p.181)

The book by Abakuks goes instead through several modelling directions, from logistic regression using variable length Markov chains [to predict agreement between two of the three texts by regressing on earlier agreement] to hidden Markov models [representing, e.g., Matthew’s use of Mark], to various independence tests on contingency tables, sometimes bringing into the model an extra source denoted by Q. Including some R code for hidden Markov models. Once again, from my outsider viewpoint, this fragmented approach to the problem sounds problematic and inconclusive. And rather verbose in extensive discussions of descriptive statistics. Not that I was expecting a sudden Monty Python-like ray of light and booming voice to disclose the truth! Or that I crave for more p-values (some may be found hiding within the book). But I still wonder about the phylogeny… Especially since phylogenies are used in text authentication as pointed out to me by Robin Ryder for Chauncer’s Canterbury Tales.

ACS 2012 (#2)

Posted in pictures, Running, Statistics, Travel, University life with tags , , , , , , , , , on July 12, 2012 by xi'an

This morning, after a nice and cool run along the river Torrens amidst almost unceasing bird songs, I attended another Bayesian ASC 2012 session with Scott Sisson presenting a simulation method aimed at correcting for biased confidence intervals and Robert Kohn giving the same talk in Kyoto. Scott’s proposal, which is rather similar to parametric bootstrap bias correction, is actually more frequentist than Bayesian as the bias is defined in terms of an correct frequentist coverage of a given confidence (or credible) interval. (Thus making the connection with Roderick Little’s calibrated Bayes talk of yesterday.) This perspective thus perceives ABC as a particular inferential method, instead of a computational approximation to the genuine Bayesian object. (We will certainly discuss the issue with Scott next week in Sydney.)

Then Peter Donnely gave a particularly exciting and well-attended talk on the geographic classification of humans, in particular of the (early 1900’s) population of the British isles, based on a clever clustering idea derived from an earlier paper of Na Li and Matthew Stephens: using genetic sequences from a group of individuals, each individual was paired with the rest of the sample as if it descended from this population. Using an HMM model, this led to clustering the sample into about 50 groups, with a remarkable geographic homogeneity: for instance, Cornwall and Devon made two distinct groups, an English speaking pocket of Wales (Little England) was identified as a specific group and so on, the central, eastern and southern England constituting an homogenous group of its own…

The foundations of Statistics [reply]

Posted in Books, R, Statistics, University life with tags , , , , , , , on July 19, 2011 by xi'an

Shravan Vasishth has written a response to my review both published on the Statistics Forum. His response is quite straightforward and honest. In particular, he acknowledges not being a statistician and that he “should spend more time studying statistics”. I also understand the authors’ frustration at trying “to recruit several statisticians (at different points) to join [them] as co-authors for this book, in order to save [them] from [them]selves, so to speak. Nobody was willing to do join in.” (Despite the kind proposal to join as a co-author to a new edition, I  would be rather unwilling as well, mostly because of the concept to avoid calculus at all cost… I will actually meet with Shravan at the end of the month to discuss specifics of the statistical flaws in this book.)

However, I still do not understand why the book was published without a proper review from a statistician. Springer is a/my serious scientific editor and book proposals usually go through several reviews, prior to and after redaction. Shravan Vasishth asks for alternative references, which I personally cannot provide for lack of teaching at this level, but this is somehow besides the point: even if a book at the intended level and for the intended audience did not exist, this would not justify the publication of a book on statistics (and only statistics) by authors not proficient enough in the topic.

One point of the response I do not get is the third item about the blog and letting my “rage get the better of [myself] (the rage is no doubt there for good reason)”. Indeed, while I readily acknowledge the review is utterly negative, I have tried to stick to facts, either statistical flaws (like the unbiasedness of s) or presentation defects. The reference to a blog in the book could be a major incentive to adopt the book, so if the blog does not live as a blog, it is both a disappointment to the reader and a sort of a breach of advertising. I perfectly understand the many reasons for not maintaining a blog (!), but then the site should have been advertised as a site rather than a blog. This was the meaning of the paragraph

The authors advertise a blog about the book that contains very little information. (The last entry is from December 2010: “The book is out”.) This was a neat idea, had it been implemented.

that does not sound full of rage to me… Anyway, this is a minor point.

The foundations of Statistics: a simulation-based approach

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , on July 12, 2011 by xi'an

“We have seen that a perfect correlation is perfectly linear, so an imperfect correlation will be `imperfectly linear’.” page 128

This book has been written by two linguists, Shravan Vasishth and Michael Broe, in order to teach statistics “in  areas that are traditionally not mathematically demanding” at a deeper level than traditional textbooks “without using too much mathematics”, towards building “the confidence necessary for carrying more sophisticated analyses” through R simulation. This is a praiseworthy goal, bound to produce a great book. However, and most sadly, I find the book does not live up to expectations. As in Radford Neal’s recent coverage of introductory probability books with R, there are statements there that show a deep misunderstanding of the topic… (This post has also been published on the Statistics Forum.) Continue reading

Robin Ryder’s interview

Posted in Statistics, University life with tags , , , , on March 9, 2011 by xi'an

Robin Ryder—with whom I am sharing an office at CREST, and who is currently doing a postdoc on ABC methods—, got interviewed in the March issue of La Recherche. (The interviewer was Philippe Pajot who wrote “Parcours de mathématiciens”, reviewed in a recent post.) The interview is reproduced on Robin’s blog (in French) and gives in a few words the principles of Bayesian linguistics. This two-page interview also includes a few lines of a technical entry to MCMC (called Monte Carlo Markov chains rather than Markov chain Monte Carlo) that focus on the exploration of huge state-spaces associated with trees. Overall, a very good advertising for MCMC methods for the general public through the highly attractive story of the history of languages…