Archive for John Tukey

Monte Carlo swindles

Posted in Statistics with tags , , , , , , , , , on April 2, 2023 by xi'an

While reading Boos and Hugues-Olivier’s 1998 American Statistician paper on the applications of Basu’s theorem I can across the notion of Monte Carlo swindles. Where a reduced variance can be achieved without the corresponding increase in Monte Carlo budget. For instance, approximating the variance of the median statistic Μ for a Normal location family can be sped up by considering that

\text{var}(M)=\text{var}(M-\bar X)+\text{var}(\bar X)

by Basu’s theorem. However, when reading the originating 1973 paper by Gross (although the notion is presumably due to Tukey), the argument boils down to Rao-Blackwellisation (without the Rao-Blackwell theorem being mentioned). The related 1985 American Statistician paper by Johnstone and Velleman exploits a latent variable representation. It also makes the connection with the control variate approach, noticing the appeal of using the score function as a (standard) control and (unusual) swindle, since its expectation is zero. I am surprised at uncovering this notion only now… Possibly because the method only applies in special settings.

A side remark from the same 1998 paper, namely that the enticing decomposition

\mathbb E[(X/Y)^k] = \mathbb E[X^k] \big/ \mathbb E[Y^k]

when X/Y and Y are independent, should be kept out of reach from my undergraduates at all costs, as they would quickly get rid of the assumption!!!

ten computer codes that transformed science

Posted in Books, Linux, R, Statistics, University life with tags , , , , , , , , , , , , , , , , , , , on April 23, 2021 by xi'an

In a “Feature” article of 21 January 2021, Nature goes over a poll on “software tools that have had a big impact on the world of science”. Among those,

the Fortran compiler (1957), which is one of the first symbolic languages, developed by IBM. This is the first computer language I learned (in 1982) and one of the two (with SAS) I ever coded on punch cards for the massive computers of INSEE. I quickly and enthusiastically switched to Pascal (and the Apple IIe) the year after and despite an attempt at moving to C, I alas kept the Pascal programming style in my subsequent C codes (until I gave up in the early 2000’s!). Moving to R full time, even though I had been using Splus since a Unix version was produced. Interestingly, a later survey of Nature readers put R at the top of the list of what should have been included!, incidentally including Monte Carlo algorithms into the list (and I did not vote in that poll!),

the fast Fourier transform (1965), co-introduced by John Tukey, but which I never ever used (or at least knowingly!),

arXiv (1991), which was started as an emailed preprint list by Paul Ginsparg at Los Alamos, getting the current name by 1998, and where I only started publishing (or arXiving) in 2007, perhaps because it then sounded difficult to submit a preprint there, perhaps because having a worldwide preprint server sounded more like bother (esp. since we had then to publish our preprints on the local servers) than revolution, perhaps because of a vague worry of being overtaken by others… Anyway, I now see arXiv as the primary outlet for publishing papers, with the possible added features of arXiv-backed journals and Peer Community validations,

the IPython Notebook (2011), by Fernando Pérez, which started by 259 lines of Python code, and turned into Jupyter in 2014. I know nothing about this, but I can relate to the relevance of the project when thinking about Rmarkdown, which I find more and more to be a great way to work on collaborative projects and to teach. And for producing reproducible research. (I do remember writing once a paper in Sweave, but not which one…!)

poems that solve puzzles [book review]

Posted in Books, Kids, University life with tags , , , , , , , , , , , , , , , , , , on January 7, 2021 by xi'an

Upon request, I received this book from Oxford University Press for review. Poems that Solve Puzzles is a nice title and its cover is quite to my linking (for once!). The author is Chris Bleakley, Head of the School of Computer Science at UCD.

“This book is for people that know algorithms are important, but have no idea what they are.”

These is the first sentence of the book and hence I am clearly falling outside the intended audience. When I asked OUP for a review copy, I was more thinking in terms of Robert Sedgewick’s Algorithms, whose first edition still sits on my shelves and which I read from first to last page when it appeared [and was part of my wife’s booklist]. This was (and is) indeed a fantastic book to learn how to build and optimise algorithms and I gain a lot from it (despite remaining a poor programmer!).

Back to poems, this one reads much more like an history of computer science for newbies than a deep entry into the “science of algorithms”, with imho too little on the algorithms themselves and their connections with computer languages and too much emphasis on the pomp and circumstances of computer science (like so-and-so got the ACM A.M. Turing Award in 19… and  retired in 19…). Beside the antique algorithms for finding primes, approximating π, and computing the (fast) Fourier transform (incl. John Tukey), the story moves quickly to the difference engine of Charles Babbage and Ada Lovelace, then to Turing’s machine, and artificial intelligence with the first checkers codes, which already included some learning aspects. Some sections on the ENIAC, John von Neumann and Stan Ulam, with the invention of Monte Carlo methods (but no word on MCMC). A bit of complexity theory (P versus NP) and then Internet, Amazon, Google, Facebook, Netflix… Finishing with neural networks (then and now), the unavoidable AlphaGo, and the incoming cryptocurrencies and quantum computers. All this makes for pleasant (if unsurprising) reading and could possibly captivate a young reader for whom computers are more than a gaming console or a more senior reader who so far stayed wary and away of computers. But I would have enjoyed much more a low-tech discussion on the construction, validation and optimisation of algorithms, namely a much soft(ware) version, as it would have made it much more distinct from the existing offer on the history of computer science.

[Disclaimer about potential self-plagiarism: this post or an edited version of it will eventually appear in my Books Review section in CHANCE.]

years (and years) of data science

Posted in Books, Statistics, Travel, University life with tags , , , , , , , , , , , , , , on January 4, 2016 by xi'an

In preparation for the round table at the start of the MCMSkv conference, this afternoon, Anto sent us a paper written by David Donoho for the Tukey Centennial workshop, held in Princeton last September. Entitled 50 years of Data Science. And which attracted a whole round of comments, judging from the Google search results. So much that I decided not to read any of them before parsing through the paper. But almost certainly reproducing here with my two cents some of the previous comments.

“John Tukey’s definition of `Big Data’ was `anything that won’t fit on one device’.”

The complaint that data science is essentially statistics that does not dare to spell out statistics as if it were a ten letter word (p.5) is not new, if appropriate. In this paper, David Donoho evacuates the memes that supposedly separate data science from statistics, like “big data” (although I doubt non-statisticians would accept the quick rejection that easily, wondering at the ability of statisticians to develop big models), skills like parallel programming (which ineluctably leads to more rudimentary algorithms and inferential techniques), jobs requiring such a vast array of skills and experience that no graduate student sounds properly trained for it…

“A call to action, from a statistician who fells `the train is leaving the station’.” (p.12)

One point of the paper is to see 1962 John Tukey’s “The Future of Data Analysis” as prophetical of the “Big Data” and “Data Science” crises. Which makes a lot of sense when considering the four driving forces advanced by Tukey (p.11):

  1. formal statistics
  2. advanced computing and graphical devices
  3. the ability to face ever-growing data flows
  4. its adoption by an ever-wider range of fields

“Science about data science will grow dramatically in significance.”

David Donoho then moves on to incorporate   Leo Breiman’s 2001 Two Cultures paper. Which separates machine learning and prediction from statistics and inference, leading to the “big chasm”! And he sees the combination of prediction with “common task framework” as the “secret sauce” of machine learning, because of the possibility of objective comparison of methods on a testing dataset. Which does not seem to me as the explanation for the current (real or perceived) disaffection for statistics and correlated attraction for more computer-related solutions. A code that wins a Kaggle challenge clearly has some efficient characteristics, but this tells me nothing of the abilities of the methodology behind that code. If any. Self-learning how to play chess within 72 hours is great, but is the principle behind able to handle go at the same level?  Plus, I remain worried about the (screaming) absence of model (or models) in predictive approaches. Or at least skeptical. For the same reason it does not help in producing a generic approach to problems. Nor an approximation to the underlying mechanism. I thus see nothing but a black box in many “predictive models”, which tells me nothing about the uncertainty, imprecision or reproducibility of such tools. “Tool evaluation” cannot be reduced to a final score on a testing benchmark. The paper concludes with the prediction that the validation of scientific methodology will solely be empirical (p.37). This leaves little ground if any for probability and uncertainty quantification, as reflected their absence in the paper.

reading classics (#11)

Posted in Books, Statistics, University life with tags , , , , , , , on March 21, 2013 by xi'an

Today was my last Reading Seminar class and the concluding paper chosen by the student was Tukey’s “The future of data analysis“, a 1962 Annals of Math. Stat. paper. Unfortunately, reading this paper required much more maturity and background than the student could afford, which is the reason why this last presentation is not posted on this page… Given the global and a-theoretical perspective of the paper, it was quite difficult to interpret without further delving into Tukey’s work and without a proper knowledge of what was Data Analysis in the 1960’s. (The love affair of French statisticians with data analysis was then at its apex, but it has very much receded since then!) Being myself unfamiliar with this paper, and judging mostly from the sentences pasted by the student in his slides, I cannot tell how much of the paper is truly visionary and how much is cheap talk: focussing on trimmed and winsorized means does not sound like offering a very wide scope for data analysis… I liked the quote “It’s easier to carry a slide rule than a desk computer, to say nothing of a large computer”! (As well as the quote from Azimov “The sound of panting“…. (Still, I am unsure I will keep the paper within the list next year!)

Overall, despite a rather disappointing lower tail of the distribution of the talks, I am very happy with the way the seminar proceeded this year and the efforts produced by the students to assimilate the papers, the necessary presentation skills including building a background in LaTeX and Beamer for most students. I thus think almost all students will pass this course and do hope those skills will be profitable for their future studies…