an attempt at bloggin, nothing more…

**T**oday was the second session of our Reading Classics Seminar for the academic year 2014-2015. I have not reported on this seminar so far because it has had starting problems, namely hardly any student present on the first classes and therefore several re-starts until we reach a small group of interested students. Actually, this is the final year for my TSI Master at Paris-Dauphine, as it will become integrated within the new MASH Master next year. The latter started this year and drew away half of our potential applicants, presumably because of the wider spectrum between machine-learning, optimisation, programming and a tiny bit of statistics… If we manage to salvage [within the new Master] our speciality of offering the only Bayesian Statistics training in France, this will not be a complete disaster!

Anyway, the first seminar was about the great 1939 Biometrika paper by Pitman about the best invariant estimator appearing magically as a Bayes estimator! Alas, the student did not grasp the invariance part and hence focussed on less relevant technical parts, which was not a great experience (and therefore led me to abstain from posting the slides here). The second paper was *not* on my list but was proposed by another student as of yesterday when he realised he was to present today! This paper, entitled “The Counter-intuitive Non-informative Prior for the Bernoulli Family”, was published in the Journal of Statistics Education in 2004 by Zu and Liu, I had not heard of the paper (or of the journal) previously and I do not think it is worth advertising any further as it gives a very poor entry to non-informative priors in the simplest of settings, namely for Bernoulli B(p) observations. Indeed, the stance of the paper is to define a non-informative prior as one returning the MLE of p as its posterior expectation (missing altogether the facts that such a definition is parameterisation-invariant and that, given the modal nature of the MLE, a posterior mode would be much more appropriate, leading to the uniform prior of p as a solution) and that the corresponding prior was made of two Dirac masses at 0 and 1! Which again misses several key points like defining properly convergence in a space of probability distributions and using an improper prior *differently* from a proper prior. Esp. since in the next section, the authors switch to Haldane’s prior being the Be(0,0) distribution..! A prior that cannot be used since the posterior is not defined when all the observations are identical. Certainly *not* a paper to make it to *the* list! *(My student simply pasted pages from this paper as his slides and so I see again no point in reposting them here. )*

**T**he editors of a new blog entitled Marauders of the Lost Sciences (Learn from the giants) sent me an email to signal the start of this blog with a short excerpt from a giant in maths or stats posted every day:

There is a new blog I wanted to tell you about which excerpts one interesting or classic paper or book a day from the mathematical sciences. We plan on daily posting across the range of mathematical fields and at any level, but about 20-30% of the posts in queue are from statistics. The goal is to entice people to read the great works of old. The first post today was from an old paper by Fisher applying Group Theory to the design of experiments.

Interesting concept, which will hopefully generate comments to put the quoted passage into context. Somewhat connected to my Reading Statistical Classics posts. Which ~~incidentally if sadly will not take place this year since only two students registered.~~ should take place in the end since more students registered! (I am unsure about the references behind the title of that blog, besides Spielberg’s Raiders of the Lost Ark and Norman’s Marauders of Gor… I just hope Statistics does not qualify as a lost science!)

**T**his week, I decided not to report on the paper read at the Reading Classics student seminar, as it did not work out well-enough. The paper was the “Regression models and life-table” published in 1972 by David Cox… A classic if any! Indeed, I do not think posting a severe criticism of the presentation or the presentation itself would be of much use to anyone. It is rather sad as (a) the student clearly put some effort in the presentation, including a reproduction of an R execution, and (b) this was an entry on semi-parametrics, Kaplan-Meyer, truncated longitudinal data, and more, that could have benefited the class immensely. Alas, the talk did not take any distance from the paper, did not exploit the following discussion, and exceeded by far the allocated time, without delivering a comprehensible message. It is a complex paper with concise explanations, granted, but there were ways to find easier introductions to its contents in the more recent literature… It is possible that a second student takes over and presents her analysis of the paper next January. Unless she got so scared with this presentation that she will switch to another paper… *[Season wishes to Classics Readers!]*

**T**his week, thanks to a lack of clear instructions (from me) to my students in the Reading Classics student seminar, four students showed up with a presentation! Since I had planned for two teaching blocks, three of them managed to fit within the three hours, while the last one nicely accepted to wait till next week to present a paper by David Cox…

**T**he first paper discussed therein was A new look at the statistical model identification, written in 1974 by Hirotugu Akaike. And presenting the AIC criterion. My student Rozan asked to give the presentation in French as he struggled with English, but it was still a challenge for him and he ended up being too close to the paper to provide a proper perspective on why AIC is written the way it is and why it is (potentially) relevant for model selection. And why it is not such a definitive answer to the model selection problem. This is not the simplest paper in the list, to be sure, but some intuition could have been built from the linear model, rather than producing the case of an ARMA(p,q) model without much explanation. (I actually wonder why the penalty for this model is (p+q)/T, rather than (p+q+1)/T for the additional variance parameter.) Or simulation ran on the performances of AIC versus other xIC’s…

**T**he second paper was another classic, the original GLM paper by John Nelder and his coauthor Wedderburn, published in 1972 in Series B. A slightly easier paper, in that the notion of a generalised linear model is presented therein, with mathematical properties linking the (conditional) mean of the observation with the parameters and several examples that could be discussed. Plus having the book as a backup. My student Ysé did a reasonable job in presenting the concepts, but she would have benefited from this extra-week in including properly the computations she ran in R around the *glm()* function… (The definition of the deviance was somehow deficient, although this led to a small discussion during the class as to how the analysis of deviance was extending the then flourishing analysis of variance.) In the generic definition of the generalised linear models, I was also reminded of the

generality of the nuisance parameter modelling, which made the part of interest appear as an exponential shift on the original (nuisance) density.

**T**he third paper, presented by Bong, was yet another classic, namely the FDR paper, Controlling the false discovery rate, of Benjamini and Hochberg in Series B (which was recently promoted to the should-have-been-a-Read-Paper category by the RSS Research Committee and discussed at the Annual RSS Conference in Edinburgh four years ago, as well as published in Series B). This 2010 discussion would actually have been a good start to discuss the paper in class, but Bong was not aware of it and mentioned earlier papers extending the 1995 classic. She gave a decent presentation of the problem and of the solution of Benjamini and Hochberg but I wonder how much of the novelty of the concept the class grasped. (I presume everyone was getting tired by then as I was the only one asking questions.) The slides somewhat made it look too much like a simulation experiment… (Unsurprisingly, the presentation did not include any Bayesian perspective on the approach, even though they are quite natural and emerged very quickly once the paper was published. I remember for instance the Valencia 7 meeting in Teneriffe where Larry Wasserman discussed about the Bayesian-frequentist agreement in multiple testing.)

**T**his week at the Reading Classics student seminar, Thomas Ounas presented a paper, *Statistical inference on massive datasets*, written by Li, Lin, and Li, a paper out of The List. (This paper was recently published as *Applied Stochastic Models in Business and Industry*, 29, 399-409..) I accepted this unorthodox proposal as (a) it was unusual, i.e., this was the very first time a student made this request, and (b) the topic of large datasets and their statistical processing definitely was interesting even though the authors of the paper were unknown to me. The presentation by Thomas was very power-pointish *(or power[-point]ful!)*, with plenty of dazzling transition effects… Even including (a) a Python software replicating the method and (b) a nice little video on internet data transfer protocols. And on a Linux machine! Hence the experiment was worth the try! Even though the paper is a rather unlikely candidate for the list of classics… (And the rendering in static power point no so impressive. Hence a video version available as well…)

**T**he solution adopted by the authors of the paper is one of breaking a massive dataset into blocks so that each fits into the computer(s) memory and of computing a separate estimate for each block. Those estimates are then averaged (and standard-deviationed) without a clear assessment of the impact of this multi-tiered handling of the data. Thomas then built a software to illustrate this approach, with mean and variance and quantiles and densities as quantities of interest. Definitely original! The proposal itself sounds rather basic from a statistical viewpoint: for instance, evaluating the loss in information due to using this blocking procedure requires repeated sampling, which is unrealistic. Or using solely the inter-variance estimates which seems to be missing the intra-variability. Hence to be overly optimistic. Further, strictly speaking, the method does not asymptotically apply to biased estimators, hence neither to Bayes estimators (nor to density estimators). Convergence results are thus somehow formal, in that the asymptotics cannot apply to a finite memory computer. In practice, the difficulty of the splitting technique is rather in breaking the data into blocks since Big Data is rarely made of iid observations. Think of amazon data, for instance. A question actually asked by the class. The method of Li et al. should also include some boostrapping connection. E.g., to Michael’s bag of little bootstraps.