This week, I decided not to report on the paper read at the Reading Classics student seminar, as it did not work out well-enough. The paper was the “Regression models and life-table” published in 1972 by David Cox… A classic if any! Indeed, I do not think posting a severe criticism of the presentation or the presentation itself would be of much use to anyone. It is rather sad as (a) the student clearly put some effort in the presentation, including a reproduction of an R execution, and (b) this was an entry on semi-parametrics, Kaplan-Meyer, truncated longitudinal data, and more, that could have benefited the class immensely. Alas, the talk did not take any distance from the paper, did not exploit the following discussion, and exceeded by far the allocated time, without delivering a comprehensible message. It is a complex paper with concise explanations, granted, but there were ways to find easier introductions to its contents in the more recent literature… It is possible that a second student takes over and presents her analysis of the paper next January. Unless she got so scared with this presentation that she will switch to another paper… [Season wishes to Classics Readers!]
Archive for Master program
This week, thanks to a lack of clear instructions (from me) to my students in the Reading Classics student seminar, four students showed up with a presentation! Since I had planned for two teaching blocks, three of them managed to fit within the three hours, while the last one nicely accepted to wait till next week to present a paper by David Cox…
The first paper discussed therein was A new look at the statistical model identification, written in 1974 by Hirotugu Akaike. And presenting the AIC criterion. My student Rozan asked to give the presentation in French as he struggled with English, but it was still a challenge for him and he ended up being too close to the paper to provide a proper perspective on why AIC is written the way it is and why it is (potentially) relevant for model selection. And why it is not such a definitive answer to the model selection problem. This is not the simplest paper in the list, to be sure, but some intuition could have been built from the linear model, rather than producing the case of an ARMA(p,q) model without much explanation. (I actually wonder why the penalty for this model is (p+q)/T, rather than (p+q+1)/T for the additional variance parameter.) Or simulation ran on the performances of AIC versus other xIC’s…
The second paper was another classic, the original GLM paper by John Nelder and his coauthor Wedderburn, published in 1972 in Series B. A slightly easier paper, in that the notion of a generalised linear model is presented therein, with mathematical properties linking the (conditional) mean of the observation with the parameters and several examples that could be discussed. Plus having the book as a backup. My student Ysé did a reasonable job in presenting the concepts, but she would have benefited from this extra-week in including properly the computations she ran in R around the glm() function… (The definition of the deviance was somehow deficient, although this led to a small discussion during the class as to how the analysis of deviance was extending the then flourishing analysis of variance.) In the generic definition of the generalised linear models, I was also reminded of the
generality of the nuisance parameter modelling, which made the part of interest appear as an exponential shift on the original (nuisance) density.
The third paper, presented by Bong, was yet another classic, namely the FDR paper, Controlling the false discovery rate, of Benjamini and Hochberg in Series B (which was recently promoted to the should-have-been-a-Read-Paper category by the RSS Research Committee and discussed at the Annual RSS Conference in Edinburgh four years ago, as well as published in Series B). This 2010 discussion would actually have been a good start to discuss the paper in class, but Bong was not aware of it and mentioned earlier papers extending the 1995 classic. She gave a decent presentation of the problem and of the solution of Benjamini and Hochberg but I wonder how much of the novelty of the concept the class grasped. (I presume everyone was getting tired by then as I was the only one asking questions.) The slides somewhat made it look too much like a simulation experiment… (Unsurprisingly, the presentation did not include any Bayesian perspective on the approach, even though they are quite natural and emerged very quickly once the paper was published. I remember for instance the Valencia 7 meeting in Teneriffe where Larry Wasserman discussed about the Bayesian-frequentist agreement in multiple testing.)
This week at the Reading Classics student seminar, Thomas Ounas presented a paper, Statistical inference on massive datasets, written by Li, Lin, and Li, a paper out of The List. (This paper was recently published as Applied Stochastic Models in Business and Industry, 29, 399-409..) I accepted this unorthodox proposal as (a) it was unusual, i.e., this was the very first time a student made this request, and (b) the topic of large datasets and their statistical processing definitely was interesting even though the authors of the paper were unknown to me. The presentation by Thomas was very power-pointish (or power[-point]ful!), with plenty of dazzling transition effects… Even including (a) a Python software replicating the method and (b) a nice little video on internet data transfer protocols. And on a Linux machine! Hence the experiment was worth the try! Even though the paper is a rather unlikely candidate for the list of classics… (And the rendering in static power point no so impressive. Hence a video version available as well…)
The solution adopted by the authors of the paper is one of breaking a massive dataset into blocks so that each fits into the computer(s) memory and of computing a separate estimate for each block. Those estimates are then averaged (and standard-deviationed) without a clear assessment of the impact of this multi-tiered handling of the data. Thomas then built a software to illustrate this approach, with mean and variance and quantiles and densities as quantities of interest. Definitely original! The proposal itself sounds rather basic from a statistical viewpoint: for instance, evaluating the loss in information due to using this blocking procedure requires repeated sampling, which is unrealistic. Or using solely the inter-variance estimates which seems to be missing the intra-variability. Hence to be overly optimistic. Further, strictly speaking, the method does not asymptotically apply to biased estimators, hence neither to Bayes estimators (nor to density estimators). Convergence results are thus somehow formal, in that the asymptotics cannot apply to a finite memory computer. In practice, the difficulty of the splitting technique is rather in breaking the data into blocks since Big Data is rarely made of iid observations. Think of amazon data, for instance. A question actually asked by the class. The method of Li et al. should also include some boostrapping connection. E.g., to Michael’s bag of little bootstraps.
Here is an email I received a few days ago, similar to many other emails I/we receive on a regular basis:
I am working on Markov Chain Monte Carlo methods as part of my Masters project. I have to estimate mean, variance from a Gaussian mixture using metropolis method. I came across your paper ‘Bayesian Modelling and Inference on Mixtures of Distributions’. I am unable to understand how to obtain the new sample for mean, variance etc… I am using uniform distribution as proposal distribution. Should it be random numbers for the proposal distribution.
I have been working and trying to understand this for a long time. I would be grateful for any help.
While I felt sorry for the Master student, I consider it is the responsibility of his/her advisor to give her/him the proper directions for understanding the paper. (Given the contents of the email, it sounds as if the student would require proper training in both Bayesian statistics [uniform priors on unbounded parameters?] and simulation [the question about random numbers does not make sense]…) This is what I replied to the student, hopefully in a positive tone.
Another read today and not from JRSS B for once, namely, Efron‘s (an)other look at the Jackknife, i.e. the 1979 bootstrap classic published in the Annals of Statistics. My Master students in the Reading Classics Seminar course thus listened today to Marco Brandi’s presentation, whose (Beamer) slides are here:
In my opinion this was an easier paper to discuss, more because of its visible impact than because of the paper itself, where the comparison with the jackknife procedure does not sound so relevant nowadays. again mostly algorithmic and requiring some background on how it impacted the field. Even though Marco also went through Don Rubin’s Bayesian bootstrap and Michael Jordan bag of little bootstraps, he struggled to get away from the technicality towards the intuition and the relevance of the method. The Bayesian bootstrap extension was quite interesting in that we discussed a lot the connections with Dirichlet priors and the lack of parameters that sounded quite antagonistic with the Bayesian principles. However, at the end of the day, I feel that this foundational paper was not explored in proportion to its depth and that it would be worth another visit.
Following in the reading classics series, my Master students in the Reading Classics Seminar course, listened today to Kaniav Kamary analysis of Denis Lindley’s and Adrian Smith’s 1972 linear Bayes paper Bayes Estimates for the Linear Model in JRSS Series B. Here are her (Beamer) slides
At a first (mathematical) level this is an easier paper in the list, because it relies on linear algebra and normal conditioning. Of course, this is not the reason why Bayes Estimates for the Linear Model is in the list and how it impacted the field. It is indeed one of the first expositions on hierarchical Bayes programming, with some bits of empirical Bayes shortcuts when computation got a wee in the way. (Remember, this is 1972, when shrinkage estimation and its empirical Bayes motivations is in full blast…and—despite Hstings’ 1970 Biometrika paper—MCMC is yet to be imagined, except maybe by Julian Besag!) So, at secondary and tertiary levels, it is again hard to discuss, esp. with Kaniav’s low fluency in English. For instance, a major concept in the paper is exchangeability, not such a surprise given Adrian Smith’s translation of de Finetti into English. But this is a hard concept if only looking at the algebra within the paper, as a motivation for exchangeability and partial exchangeability (and hierarchical models) comes from applied fields like animal breeding (as in Sørensen and Gianola’s book). Otherwise, piling normal priors on top of normal priors is lost on the students. An objection from a 2012 reader is also that the assumption of exchangeability on the parameters of a regression model does not really make sense when the regressors are not normalised (this is linked to yesterday’s nefarious post!): I much prefer the presentation we make of the linear model in Chapter 3 of our Bayesian Core. Based on Arnold Zellner‘s g-prior. An interesting question from one student was whether or not this paper still had any relevance, other than historical. I was a bit at a loss on how to answer as, again, at a first level, the algebra was somehow natural and, at a statistical level, less informative priors could be used. However, the idea of grouping parameters together in partial exchangeability clusters remained quite appealing and bound to provide gains in precision….
Following last week read of Hartigan and Wong’s 1979 K-Means Clustering Algorithm, my Master students in the Reading Classics Seminar course, listened today to Agnė Ulčinaitė covering Rob Tibshirani‘s original LASSO paper Regression shrinkage and selection via the lasso in JRSS Series B. Here are her (Beamer) slides
Again not the easiest paper in the list, again mostly algorithmic and requiring some background on how it impacted the field. Even though Agnė also went through the Elements of Statistical Learning by Hastie, Friedman and Tibshirani, it was hard to get away from the paper to analyse more widely the importance of the paper, the connection with the Bayesian (linear) literature of the 70’s, its algorithmic and inferential aspects, like the computational cost, and the recent extensions like Bayesian LASSO. Or the issue of handling n<p models. Remember that one of the S in LASSO stands for shrinkage: it was quite pleasant to hear again about ridge estimators and Stein’s unbiased estimator of the risk, as those were themes of my Ph.D. thesis… (I hope the students do not get discouraged by the complexity of those papers: there were fewer questions and fewer students this time. Next week, the compass will move to the Bayesian pole with a talk on Lindley and Smith’s 1973 linear Bayes paper by one of my PhD students.)