Archive for data analysis

principal components [xkcd repost]

Posted in Kids with tags , , , , , , , on January 13, 2017 by xi'an

Nature snapshot [Volume 539 Number 7627]

Posted in Books, Statistics, University life with tags , , , , , , , , , , on November 15, 2016 by xi'an

A number of entries of interest [to me] in that Nature issue: from the Capuchin monkeys that break stones in a way that resembles early hominins biface tools, to the persistent association between some sounds and some meanings across numerous languages, to the use of infected mosquitoes in South America to fight Zika, to the call for more maths in psychiatry by the NIMH director, where since prevision is mentioned I presumed stats is included, to the potentially earthshaking green power revolution in Africa, to the reconstruction of the first HIV strains in North America, along with the deconstruction of the “Patient 0” myth, helped by Bayesian phylogenetic analyses, to a cover of the Open Syllabus Project, with Monte Carlo Statistical Methods arriving first [in the Monte Carlo list]….

“Observations should not converge on one model but aim to find anomalies that carry clues about the nature of dark matter, dark energy or initial conditions of the Universe. Further observations should be motivated by testing unconventional interpretations of those anomalies (such as exotic forms of dark matter or modified theories of gravity). Vast data sets may contain evidence for unusual behaviour that was unanticipated when the projects were conceived.” Avi Loeb

One editorial particularly drew my attention, Good data are not enough, by the astronomer Avi Loeb. as illustrated  by the quote above, Loeb objects to data being interpreted and even to data being collected towards the assessment of the standard model. While I agree that this model contains a lot of fudge factors like dark matter and dark energy, which apparently constitutes most of the available matter, the discussion is quite curious, in that interpreting data according to alternative theories sounds impossible and certainly beyond the reach of most PhD students [as Loeb criticises the analysis of some data in a recent thesis he evaluated].

“modern cosmology is augmented by unsubstantiated, mathematically sophisticated ideas — of the multiverse, anthropic reasoning and string theory.

The author argues to always allow for alternative interpretations of the data, which sounds fine at a primary level but again calls for the conception of such alternative models. When discrepancies are found between the standard model and the data, they can be due to errors in the measurement itself, in the measurement model, or in the theoretical model. However, they may be impossible to analyse outside the model, in the neutral way called and wished by Loeb. Designing neutral experiments sounds even less meaningful. Which is why I am fairly taken aback by the call to “a research frontier [that] should maintain at least two ways of interpreting data so that new experiments will aim to select the correct one”! Why two and not more?! And which ones?! I am not aware of fully developed alternative theories and cannot see how experiments designed under one model could produce indications about a new and incomplete model.

“Such simple, off-the-shelf remedies could help us to avoid the scientific fate of the otherwise admirable Mayan civilization.”

Hence I am bemused by the whole exercise, which deepest arguments seem to be a paper written by the author last year and an interdisciplinary centre on black holes also launched recently by the same author.


Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , , , , on April 28, 2016 by xi'an

years (and years) of data science

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 4, 2016 by xi'an

In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.

“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

  1. formal statistics
  2. advanced computing and graphical devices
  3. the ability to face ever-growing data flows
  4. its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate   Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level?  Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

fellowship openings at the Alan Turing Institute

Posted in pictures, Statistics, University life with tags , , , , , , , , , , , , on November 17, 2015 by xi'an

[Verbatim from the  Alan Turing Institute webpage]Alan Turing Fellowships

This is a unique opportunity for early career researchers to join The Alan Turing Institute. The Alan Turing Institute is the UK’s new national data science institute, established to bring together world-leading expertise to provide leadership in the emerging field of data science. The Institute has been founded by the universities of Cambridge, Edinburgh, Oxford, UCL and Warwick and EPSRC.

Fellowships are available for 3 years with the potential for an additional 2 years of support following interim review. Fellows will pursue research based at the Institute hub in the British Library, London. Fellowships will be awarded to individual candidates and fellows will be employed by a joint venture partner university (Cambridge, Edinburgh, Oxford, UCL or Warwick).

Key requirements: Successful candidates are expected to have i) a PhD in a data science (or adjacent) subject (or to have submitted their doctorate before taking up the post), ii) an excellent publication record and/or demonstrated excellent research potential such as via preprints, iii) a novel and challenging research agenda that will advance the strategic objectives of the Institute, and iv) leadership potential. Fellowships are open to all qualified applicants regardless of background.

Alan Turing Fellowship applications can be made in all data science research areas. The Institute’s research roadmap is available here. In addition to this open call, there are two specific fellowship programmes:

Fellowships addressing data-centric engineering

The Lloyd’s Register Foundation (LRF) / Alan Turing Institute programme to support data-centric engineering is a 5-year, £10M global programme, delivered through a partnership between LRF and the Alan Turing Institute. This programme will secure high technical standards (for example the next-generation algorithms and analytics) to enhance the safety of life and property around the major infrastructure upon which modern society relies. For further information on data-centric engineering, see LRF’s Foresight Review of Big Data. Applications for Fellowships under this call, which address the aims of the LRF/Turing programme, may also be considered for funding under the data-centric engineering programme. Fellowships awarded under this programme may vary from the conditions given above; for more details contact

Fellowships addressing data analytics and high-performance computing

Intel and the Alan Turing Institute will be supporting additional Fellowships in data analytics and high-performance computing. Applications for Fellowships under this call may also be considered for funding under the joint Intel-Alan Turing Institute programme. Fellowships awarded under this joint programme may vary from the conditions given above; for more details contact

Download full information on the Turing fellowships here

Diversity and equality are promoted in all aspects of the recruitment and career management of our researchers. In keeping with the principles of the Institute, we especially encourage applications from female researchers