Archive for data analysis
In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.
“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”
The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…
“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)
One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):
- formal statistics
- advanced computing and graphical devices
- the ability to face ever-growing data flows
- its adoption by an ever-wider range of fields
“Science about data science will grow dramatically in significance.”
David Donoho then moves on to incorporate Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level? Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.
[Verbatim from the Alan Turing Institute webpage]Alan Turing Fellowships
This is a unique opportunity for early career researchers to join The Alan Turing Institute. The Alan Turing Institute is the UK’s new national data science institute, established to bring together world-leading expertise to provide leadership in the emerging field of data science. The Institute has been founded by the universities of Cambridge, Edinburgh, Oxford, UCL and Warwick and EPSRC.
Fellowships are available for 3 years with the potential for an additional 2 years of support following interim review. Fellows will pursue research based at the Institute hub in the British Library, London. Fellowships will be awarded to individual candidates and fellows will be employed by a joint venture partner university (Cambridge, Edinburgh, Oxford, UCL or Warwick).
Key requirements: Successful candidates are expected to have i) a PhD in a data science (or adjacent) subject (or to have submitted their doctorate before taking up the post), ii) an excellent publication record and/or demonstrated excellent research potential such as via preprints, iii) a novel and challenging research agenda that will advance the strategic objectives of the Institute, and iv) leadership potential. Fellowships are open to all qualified applicants regardless of background.
Alan Turing Fellowship applications can be made in all data science research areas. The Institute’s research roadmap is available here. In addition to this open call, there are two specific fellowship programmes:
Fellowships addressing data-centric engineering
The Lloyd’s Register Foundation (LRF) / Alan Turing Institute programme to support data-centric engineering is a 5-year, £10M global programme, delivered through a partnership between LRF and the Alan Turing Institute. This programme will secure high technical standards (for example the next-generation algorithms and analytics) to enhance the safety of life and property around the major infrastructure upon which modern society relies. For further information on data-centric engineering, see LRF’s Foresight Review of Big Data. Applications for Fellowships under this call, which address the aims of the LRF/Turing programme, may also be considered for funding under the data-centric engineering programme. Fellowships awarded under this programme may vary from the conditions given above; for more details contact firstname.lastname@example.org.
Fellowships addressing data analytics and high-performance computing
Intel and the Alan Turing Institute will be supporting additional Fellowships in data analytics and high-performance computing. Applications for Fellowships under this call may also be considered for funding under the joint Intel-Alan Turing Institute programme. Fellowships awarded under this joint programme may vary from the conditions given above; for more details contact email@example.com.
Diversity and equality are promoted in all aspects of the recruitment and career management of our researchers. In keeping with the principles of the Institute, we especially encourage applications from female researchers
I have to admit the rather embarrassing fact that Machine Learning, A probabilistic perspective by Kevin P. Murphy is the first machine learning book I really read in detail…! It is a massive book with close to 1,100 pages and I thus hesitated taking it with me around, until I grabbed it in my bag for Warwick. (And in the train to Argentan.) It is also massive in its contents as it covers most (all?) of what I call statistics (but visibly corresponds to machine learning as well!). With a Bayesian bent most of the time (which is the secret meaning of probabilistic in the title).
“…we define machine learning as a set of methods that can automatically detect patterns in data, and then use the uncovered patterns to predict future data, or to perform other kinds of decision making under uncertainty (such as planning how to collect more data!).” (p.1)
Apart from the Introduction—which I find rather confusing for not dwelling on the nature of errors and randomness and on the reason for using probabilistic models (since they are all wrong) and charming for including a picture of the author’s family as an illustration of face recognition algorithms—, I cannot say I found the book more lacking in foundations or in the breadth of methods and concepts it covers than a “standard” statistics book. In short, this is a perfectly acceptable statistics book! Furthermore, it has a very relevant and comprehensive selection of references (sometimes favouring “machine learning” references over “statistics” references!). Even the vocabulary seems pretty standard to me. All this makes me wonder why we at all distinguish between the two domains, following Larry Wasserman’s views (for once!) that the difference is mostly in the eye of the beholder, i.e. in which department one teaches… Which was already my perspective before I read the book but it comforted me even further. And the author agrees as well (“The probabilistic approach to machine learning is closely related to the field of statistics, but differs slightly in terms of its emphasis and terminology”, p.1). Let us all unite!
[..part 2 of the book review to appear tomorrow…]
This book by David Cox and Christl Donnelly, Principles of Applied Statistics, is an extensive coverage of all the necessary steps and precautions one must go through when contemplating applied (i.e. real!) statistics. As the authors write in the very first sentence of the book, “applied statistics is more than data analysis” (p.i); the title could indeed have been “Principled Data Analysis”! Indeed, Principles of Applied Statistics reminded me of how much we (at least I) take “the model” and “the data” for granted when doing statistical analyses, by going through all the pre-data and post-data steps that lead to the “idealized” (p.188) data analysis. The contents of the book are intentionally simple, with hardly any mathematical aspect, but with a clinical attention to exhaustivity and clarity. For instance, even though I would have enjoyed more stress on probabilistic models as the basis for statistical inference, they only appear in the fourth chapter (out of ten) with error in variable models. The painstakingly careful coverage of the myriad of tiny but essential steps involved in a statistical analysis and the highlight of the numerous corresponding pitfalls was certainly illuminating to me. Just as the book refrains from mathematical digressions (“our emphasis is on the subject-matter, not on the statistical techniques as such p.12), it falls short from engaging into detail and complex data stories. Instead, it uses little grey boxes to convey the pertinent aspects of a given data analysis, referring to a paper for the full story. (I acknowledge this may be frustrating at times, as one would like to read more…) The book reads very nicely and smoothly, and I must acknowledge I read most of it in trains, métros, and planes over the past week. (This remark is not intended as a criticism against a lack of depth or interest, by all means [and medians]!)
“A general principle, sounding superficial but difficult to implement, is that analyses should be as simple as possible, but not simpler.” (p.9)
To get into more details, Principles of Applied Statistics covers the (most!) purposes of statistical analyses (Chap. 1), design with some special emphasis (Chap. 2-3), which is not surprising given the record of the authors (and “not a moribund art form”, p.51), measurement (Chap. 4), including the special case of latent variables and their role in model formulation, preliminary analysis (Chap. 5) by which the authors mean data screening and graphical pre-analysis, [at last!] models (Chap. 6-7), separated in model formulation [debating the nature of probability] and model choice, the later being somehow separated from the standard meaning of the term (done in §8.4.5 and §8.4.6), formal [mathematical] inference (Chap. 8), covering in particular testing and multiple testing, interpretation (Chap. 9), i.e. post-processing, and a final epilogue (Chap. 10). The readership of the book is rather broad, from practitioners to students, although both categories do require a good dose of maturity, to teachers, to scientists designing experiments with a statistical mind. It may be deemed too philosophical by some, too allusive by others, but I think it constitutes a magnificent testimony to the depth and to the spectrum of our field.
“Of course, all choices are to some extent provisional.“(p.130)
As a personal aside, I appreciated the illustration through capture-recapture models (p.36) with a remark of the impact of toe-clipping on frogs, as it reminded me of a similar way of marking lizards when my (then) student Jérôme Dupuis was working on a corresponding capture-recapture dataset in the 90’s. On the opposite, while John Snow‘s story [of using maps to explain the cause of cholera] is alluring, and his map makes for a great cover, I am less convinced it is particularly relevant within this book.
“The word Bayesian, however, became more widely used, sometimes representing a regression to the older usage of flat prior distributions supposedly representing initial ignorance, sometimes meaning models in which the parameters of interest are regarded as random variables and occasionaly meaning little more than that the laws of probability are somewhere invoked.” (p.144)
My main quibble with the book goes, most unsurprisingly!, with the processing of Bayesian analysis found in Principles of Applied Statistics (pp.143-144). Indeed, on the one hand, the method is mostly criticised over those two pages. On the other hand, it is the only method presented with this level of details, including historical background, which seems a bit superfluous for a treatise on applied statistics. The drawbacks mentioned are (p.144)
- the weight of prior information or modelling as “evidence”;
- the impact of “indifference or ignorance or reference priors”;
- whether or not empirical Bayes modelling has been used to construct the prior;
- whether or not the Bayesian approach is anything more than a “computationally convenient way of obtaining confidence intervals”
The empirical Bayes perspective is the original one found in Robbins (1956) and seems to find grace in the authors’ eyes (“the most satisfactory formulation”, p.156). Contrary to MCMC methods, “a black box in that typically it is unclear which features of the data are driving the conclusions” (p.149)…
“If an issue can be addressed nonparametrically then it will often be better to tackle it parametrically; however, if it cannot be resolved nonparametrically then it is usually dangerous to resolve it parametrically.” (p.96)
Apart from a more philosophical paragraph on the distinction between machine learning and statistical analysis in the final chapter, with the drawback of using neural nets and such as black-box methods (p.185), there is relatively little coverage of non-parametric models, the choice of “parametric formulations” (p.96) being openly chosen. I can somehow understand this perspective for simpler settings, namely that nonparametric models offer little explanation of the production of the data. However, in more complex models, nonparametric components often are a convenient way to evacuate burdensome nuisance parameters…. Again, technical aspects are not the focus of Principles of Applied Statistics so this also explains why it does not dwell intently on nonparametric models.
“A test of meaningfulness of a possible model for a data-generating process is whether it can be used directly to simulate data.” (p.104)
The above remark is quite interesting, especially when accounting for David Cox’ current appreciation of ABC techniques. The impossibility to generate from a posited model as some found in econometrics precludes using ABC, but this does not necessarily mean the model should be excluded as unrealistic…
“The overriding general principle is that there should be a seamless flow between statistical and subject-matter considerations.” (p.188)
As mentioned earlier, the last chapter brings a philosophical conclusion on what is (applied) statistics. It is stresses the need for a careful and principled use of black-box methods so that they preserve a general framework and lead to explicit interpretations.
Gelman et al. just published a paper in the Annals of Applied Statistics on the selection of a prior on the parameters of a logistic regression. The idea is to scale the prior in terms of the impact of a “typical” change in a covariate onto the probability function, which is reasonable as long as there is enough independence between those covariates. The covariates are primarily rescaled to all have the same expected range, which amounts to me to a kind of empirical Bayes estimation of the scales in an unormalised problem. The parameters are then associated with independent Cauchy (or t) priors, whose scale s is chosen as 2.5 in order to make the ±5 logistic range the extremal value. The perspective is well-motivated within the paper, and supported in addition by the availability of an R package called bayesglm.
This being said, I would have liked to see a comparison of bayesglm. with the generalised g-prior perspective we develop in Bayesian Core rather than with the flat prior, which is not the correct Jeffreys’ prior and which anyway does not always lead to a proper prior. In fact, the independent prior seems too rudimentary in the case of many (inevitably correlated) covariates, with the scale of 2.5 being then too large even when brought back to a reasonable change in the covariate. On the other hand, starting with a g-like-prior on the parameters and using a non-informative prior on the factor g allows for both a natural data-based scaling and an accounting of the dependence between the covariates. This non-informative prior on g then amounts to a generalised t prior on the parameter, once g is integrated. Anyone interested in the comparison can use the functions provided here on the webpage of Bayesian Core. (The paper already includes a comparison with Jeffreys’ prior implemented as brglm and the BBR algorithm of Genkins et al. (2007).) In the revision of Bayesian Core, we will most likely draw this comparison.
In order to make advances in the processing of their datasets and experiments, and in the understanding of the fundamental parameters driving the general relativity model, cosmologists are lauching a competition called the great’08 challenge through the Pascal European network. Details about the challenge are available on an arXiv:0802.1214 document, the model being clearly defined from a statistical point of view as a combination of lensing shear (the phenomenon of interest) and of various (=three) convolution noises that make the analysis so challenging, and the date being a collection of images of galaxies. The fundamental problem is to identify a 2d-linear distortion applied to all images within a certain region of the space, up (or down) to a precision of 0.003, the distortion being identified by an isotonic assumption over the un-distrorted images. The solution must be efficient too in that it is to be tested on 27 million galaxies! A standard MCMC mixture analysis on each galaxy is thus unlikely to converge before the challenge is over, next April. I think the challenge is worth considering by statistical teams, even though this represents a considerable involvement over the next six months….