Archive for John Tukey

poems that solve puzzles [book review]

Posted in Books, Kids, University life with tags , , , , , , , , , , , , , , , , , , on January 7, 2021 by xi'an

Upon request, I received this book from Oxford University Press for review. Poems that Solve Puzzles is a nice title and its cover is quite to my linking (for once!). The author is Chris Bleakley, Head of the School of Computer Science at UCD.

“This book is for people that know algorithms are important, but have no idea what they are.”

These is the first sentence of the book and hence I am clearly falling outside the intended audience. When I asked OUP for a review copy, I was more thinking in terms of Robert Sedgewick’s Algorithms, whose first edition still sits on my shelves and which I read from first to last page when it appeared [and was part of my wife’s booklist]. This was (and is) indeed a fantastic book to learn how to build and optimise algorithms and I gain a lot from it (despite remaining a poor programmer!).

Back to poems, this one reads much more like an history of computer science for newbies than a deep entry into the “science of algorithms”, with imho too little on the algorithms themselves and their connections with computer languages and too much emphasis on the pomp and circumstances of computer science (like so-and-so got the ACM A.M. Turing Award in 19… and  retired in 19…). Beside the antique algorithms for finding primes, approximating π, and computing the (fast) Fourier transform (incl. John Tukey), the story moves quickly to the difference engine of Charles Babbage and Ada Lovelace, then to Turing’s machine, and artificial intelligence with the first checkers codes, which already included some learning aspects. Some sections on the ENIAC, John von Neumann and Stan Ulam, with the invention of Monte Carlo methods (but no word on MCMC). A bit of complexity theory (P versus NP) and then Internet, Amazon, Google, Facebook, Netflix… Finishing with neural networks (then and now), the unavoidable AlphaGo, and the incoming cryptocurrencies and quantum computers. All this makes for pleasant (if unsurprising) reading and could possibly captivate a young reader for whom computers are more than a gaming console or a more senior reader who so far stayed wary and away of computers. But I would have enjoyed much more a low-tech discussion on the construction, validation and optimisation of algorithms, namely a much soft(ware) version, as it would have made it much more distinct from the existing offer on the history of computer science.

[Disclaimer about potential self-plagiarism: this post or an edited version of it will eventually appear in my Books Review section in CHANCE.]

years (and years) of data science

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 4, 2016 by xi'an

In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.

“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

  1. formal statistics
  2. advanced computing and graphical devices
  3. the ability to face ever-growing data flows
  4. its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate   Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level?  Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

reading classics (#11)

Posted in Books, Statistics, University life with tags , , , , , , , on March 21, 2013 by xi'an

Today was my last Reading Seminar class and the concluding paper chosen by the student was Tukey’s “The future of data analysis“, a 1962 Annals of Math. Stat. paper. Unfortunately, reading this paper required much more maturity and background than the student could afford, which is the reason why this last presentation is not posted on this page… Given the global and a-theoretical perspective of the paper, it was quite difficult to interpret without further delving into Tukey’s work and without a proper knowledge of what was Data Analysis in the 1960’s. (The love affair of French statisticians with data analysis was then at its apex, but it has very much receded since then!) Being myself unfamiliar with this paper, and judging mostly from the sentences pasted by the student in his slides, I cannot tell how much of the paper is truly visionary and how much is cheap talk: focussing on trimmed and winsorized means does not sound like offering a very wide scope for data analysis… I liked the quote “It’s easier to carry a slide rule than a desk computer, to say nothing of a large computer”! (As well as the quote from Azimov “The sound of panting“…. (Still, I am unsure I will keep the paper within the list next year!)

Overall, despite a rather disappointing lower tail of the distribution of the talks, I am very happy with the way the seminar proceeded this year and the efforts produced by the students to assimilate the papers, the necessary presentation skills including building a background in LaTeX and Beamer for most students. I thus think almost all students will pass this course and do hope those skills will be profitable for their future studies…

A misleading title…

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , on September 5, 2011 by xi'an

When I received this book, Handbook of fitting statistical distributions with R, by Z. Karian and E.J. Dudewicz,  from/for the Short Book Reviews section of the International Statistical Review, I was obviously impressed by its size (around 1700 pages and 3 kilos…). From briefly glancing at the table of contents, and the list of standard distributions appearing as subsections of the first chapters, I thought that the authors were covering different estimation/fitting techniques for most of the standard distributions. After taking a closer look at the book, I think the cover is misleading in several aspects: this is not a handbook (a.k.a. a reference book), it does not cover standard statistical distributions, the R input is marginal, and the authors only wrote part of the book, since about half of the chapters are written by other authors…

Continue reading

A survey of [the 60’s] Monte Carlo methods

Posted in Books, R, Statistics, University life with tags , , , , , , , on May 17, 2011 by xi'an

“The only good Monte Carlos are the dead Monte Carlos” (Trotter and Tukey, quoted by Halton)

When I presented my [partial] history of MCM methods in Bristol two months ago, at the Julian Besag memorial, Christophe Andrieu mentioned a 1970 SIAM survey by John Halton on A retrospective and prospective survey of the Monte Carlo method. This is a huge paper (62 pages, 251 references) and it brings a useful light on the advances in the 60’s (the paper was written in 1968). From the reference list, it seems John Halton was planning two books on the Monte Carlo method, but a search on google did not show anything. I also discovered in this list that there was a 1954 RSS symposium (Read Paper?) on Monte Carlo methods. Note that there were at least two books on Monte Carlo published at the time, Hammersley and Handscomb’s 1964 Monte Carlo Methods and Scheider’s 1966 Monte Carlo Method. (Hammerlsey appears as a major figure in this survey.) There is a lot of material in this long review and most of the standard methods are listed: control variate, importance sampling, self-normalised simportance sampling, stratified sampling, antithetic variables, simulation by inversion, rejection or demarginalisation. Variance reduction is presented as the motivation for the alternative methods. Very little is said about the use of Monte Carlo methods in statistics (“many of  [the applications] are primitive and artless“)  I was first very surprised to find sequential Monte Carlomentioned as well, but it later appeared this was Monte Carlo methods for sequential problems, in the spirit of Abraham Wald. While the now forgotten EZH method is mentioned as a promising new method (p.11), the survey also contains an introduction to the conditional Monte Carlo method of Trotter and Tukey (1956) [from whom the above and rather puzzling quote is taken] that could relate to the averaging techniques of Kong, McCullagh, Meng, Nicolae and Tan as found in their 2003 Read Paper….

“The search for randomness is evidently futile” (Halton)

A large part of the review is taken by the study of uniform random generators and by the distinction between random, pseudo-random and quasi-random versions. Halton insists very much on the lack of justification in using non-random generators, even though they work well. He even goes further as to warn about bias because even the truly random generators are discrete. The book covers the pseudo-random generators, starting with the original version of von Neumann, Metropolis, and Ulam, continuing with Lehmer’s well-known congruencial generator, and the Fibonacci generalisation. For testing those generators by statistical tests (again with little theoretical ground), Marsaglia is mentioned.  The paper also covers in great detail the quasi-random sequences, covering low discrepancy requirements, van der Corput’s, Halton’s and Hammersley’s sequences. Halton considers quasi-Monte Carlo as “a branch of numerical analysis”.

The paper concludes with a list of 24 future developments I will cover in another post tomorrow…