Archive for April, 2011

MAP, MLE and loss

Posted in Statistics with tags , , , , on April 25, 2011 by xi'an

Michael Evans and Gun Ho Jang posted an arXiv paper where they discuss the connection between MAP, least relative surprise (or maximum profile likelihood) estimators, and loss functions. I posted a while ago my perspective on MAP estimators, followed by several comments on the Bayesian nature of those estimators, hence will not reproduce them here, but the core of the matter is that neither MAP estimators, nor MLEs are really justified by a decision-theoretic approach, at least in a continuous parameter space. And that the dominating measure [arbitrarily] chosen on the parameter space impacts the value of the MAP, as demonstrated by Druihlet and Marin in 2007.

Continue reading

166 days to half-marathon

Posted in Running with tags , on April 24, 2011 by xi'an

Last weekend, I was in Argentan and did two good training runs on the small country roads. This is in long-term preparation for the half-marathon in early October. If I do not suffer from [more] mechanical problems in the meanwhile, my hope is to run the half-marathon in less than 1:24 as this is a hard race with up-hill segments and most often a terrible wind (and that I am turning v2 this year). A big if and a big hope!

A kitsch Monument aux Morts

Posted in pictures, Travel with tags , on April 23, 2011 by xi'an

Close to Argentan, there is a little village called Le Pin au Haras with a beautiful national stud and a rather kitsch war memorial: the soldier bust is indeed painted in the original colours of the first World War French uniform, the famous bleu horizon that, along with red pants, made soldiers such great targets… Even the medal (presumably the Croix de Guerre) , the fourragère and the helmet jugular are painted in the right (?) colours… This may have been the original version of the bust, only recently restored to its original colors, as google shows there exist other painted war memorials in France.


Posted in Books with tags , , , on April 23, 2011 by xi'an

During my visit to Madrid I managed to finish another book by Arnaldur Indriðason, Graforþögn (La Femme en Vert), which has been translated into English under the rather dull title of Silence of the Grave. While it is an impressive book, by its description of domestic violence and of its impact on the children and grand-children of abusive fathers, it is not exactly a detective story because there is not much in terms of police work… The book is terrifying in the spiral of physical and psychological violence suffered by the family and it is no wonder the book got several awards (Glass Key award 2003, CWA Gold Dagger 2005, Grand Prix des lectrices de Elle 2007). However, having the two stories exposed in parallel, the one of the suffering family in the 1940’s and the uncovering of the grave in the early 2000’s, reduces the plot in the current era to a spectator’s game, the reader being aware of much more than the policemen conducting the inquiry, and suspecting in particular that the body slowly unearthed by the archaeologists can only be one of two members of this doomed family… I must say I preferred Arctic Chill, especially because of the vision it gave of the contemporary Icelandic society, but this novel Graforþögn also contains insights about an older, more rural and just as cruel, Iceland that WWII was going to change so radically.

Statistical analyses using R

Posted in Books, R, Statistics with tags , , , , , on April 22, 2011 by xi'an

Another book I received from the Short Book Reviews section of the International Statistical Review is Everitt’s and Hothorn’s Handbook of statistical analyses using R. Here is a [blog-ified] version of my book review.

This book is the second (blue) edition of a successful (violet) handbook that can benefit a wide audience interested in using R for its data analysis. (After I wrote the review, I saw this appropriate analysis of the first edition.) It covers most of non-Bayesian statistical methods, with forays into exploratory data analysis with tools like principal components, clustering and bagging/boosting. As reflected in the long list of chapters, the coverage is quite extensive and only missing specialised statistical domains like time-series (apart from longitudinal data), econometrics (except for generalised linear models), and signal processing. Beside the absence of a Bayesian perspective (only mentioned in connection with BIC and the mclust package, while the Bayesian formalism would be a natural tool for analysing mixed models), I miss some material on simulation, the only entry found in the book being bootstrap (pages 153-154).

Given its title and emphasis on analyses, the book is logically associated with an R package HSAUR2 [if there is an intended pun, I missed it!] and works according to a fixed pattern: each chapter (1) starts with a description of a few datasets, (2) summarises the statistical main issues in one or two pages, and then (3) engages into an R analysis. As the complexity increases with the chapter number, the authors are relying more and more on specialised packages that need to be downloaded by the reader. I have no objection with this pedagogical choice, especially when considering that the packages are mostly recent. I would however have like a bit more details about those packages or at least about their main function, as the reader is left to experiment solely from the line of code provided in the handbook. (In contrast, a few passages are a bit “geeky” and require a deeper understanding of R objects than casual readers master. Also, using layout instead of par(mfrow=… is not that obvious.) My only criticism of the book at this level is the puzzling insistence on including all the datasets used therein in the form of tables. I frankly fail to see the point in spending so many pages on those tables given that they all are available from the HSAUR2 package. A page of further explanation, of background or of statistical theory would have been much more beneficial to any reader, in my opinion! The same criticism applies to the few exercises found at the end of each chapter. (The most glaring use of a table occurs in the graphical display chapter, of course! The authors rely on a dataset about the 50 north-American States and list the data instead of illustrating the use of a map….)

In conclusion, I find the book by Everitt and Hothorn quite pleasant and bound to fit its purpose. The layout and presentation is nice (with a single noticeable mishap on page 332 caused by Darwin’s tree of life.) It should appeal to all readers as it contains a wealth of information about the use of R for statistical analysis. Included seasoned R users: When reading the first chapters, I found myself scribbling small light-bulbs in the margin to point out features of R I was not aware of. (In particular, the authors mentioned the option type=”n” for plot that R-bloggers signalled as the most useful option for plotting.) In addition, the book is quite handy for a crash introduction to statistics for (well-enough motivated) non-statisticians. (This post has also appeared on Statistical Forum on April 20.)

Lack of confidence [revised]

Posted in R, Statistics, University life with tags , , , , , , , on April 22, 2011 by xi'an

Following the comments on our earlier submission to PNAS, we have written (and re-arXived) a revised version where we try to spell out (better) the distinction between ABC point (and confidence) estimation and ABC model choice, namely that the problem was at another level for Bayesian model choice (using posterior probabilities). When doing point estimation with in-sufficient summary statistics, the information content is poorer, but unless one uses very degraded summary statistics, inference is converging. We completely agree with the reviewers that the posterior distribution is different from the true posterior in this case but, at least, gathering more observations brings more information about the parameter (and convergence when the number of observations goes to infinity). For model choice, this is not guaranteed if we use summary statistics that are not inter-model sufficient, as shown by the Poisson and normal examples. Furthermore, except for very specific cases such as Gibbs random fields, it is almost always impossible to derive inter-model sufficient statistics, beyond the raw sample. This is why we consider there is a fundamental difference between point estimation and model choice.

Following the request from a referee, we also ran a more extensive simulation experiment for comparing two scenarios with 3 populations, 100 diploid individuals per population, and 50 loci/markers. However, the results are somehow less conclusive, in the sense that, since we use 50 loci, the data is much more informative about the model and therefore both the importance sampling and the ABC approximations provide a value of the posterior probability approximation that is close to one, hence both concluding with the validation of the true model. Because both approximations are very close to one, it is difficult to assess the worth of the ABC approximation per se, i.e. in numerical terms. (The fact that the statistical conclusion is the same for both approaches is of course satisfying from an inferential perspective, but is an altogether separate issue from our argument about the possible lack of convergence of the ABC Bayes factor approximation to the true Bayes factor.) Furthermore, this experiment may be beyond the manageable/reasonable in the sense that the importance sampling approximation cannot be taken for granted, nor can it be checked empirically. Indeed, with 50 markers and 100 individuals, the product likelihood suffers from an enormous variability that 100,000 particles and 100 trees per locus have trouble to address (despite a huge computing cost of more than 12 days on a powerful cluster).

Incidentally, I had a problem with natbib, when using the pnas style:

!Package natbib Error: Bibliography not compatible with author-year citations.
Press <return> to continue in numerical citation style.
See the natbib package documentation for explanation.

but it vanisheds with the options


which is an easy fix.

Thomas Bayes, 250 years later

Posted in R, Statistics, University life with tags , , , on April 21, 2011 by xi'an

A link on R-bloggers signaled a series of blogs and videos by IBM Netezza about Thomas Bayes and the consequences of his theorem. Which made me realise this was indeed the 250th anniversary of his death, and that maybe we (as a collective, incl. ISBA) should have done something on April 17th (or is it April 7th?)… Before the Revolution Analytics announcement, only this post in Dutch by Tom Heskes appeared to celebrate the event. I am not sure one should draw major consequences from this, but if any, this means that the Bayesian community has sufficiently grown in strength and maturity to stop focusing on a single point of its long history. And maybe celebrations will be more widespread for the 250th anniversaty of the publication of An essay towards solving a problem in the doctrine of chances on Dec. 23, 1763.