## the cartoon introduction to statistics

Posted in Books, Kids, Statistics, University life with tags , , , , , on May 16, 2013 by xi'an

A few weeks ago, I received a copy of The Cartoon Introduction to Statistics by Grady Klein and Alan Dabney, send by their publisher, Farrar, Staus and Giroux from New York City.  (Never heard of this publisher previously, but I must admit the aggregation of those three names sounds great!) As this was an unpublished version of the book, to appear in July 2013, I first assumed my copy was a draft version, with black and white drawings using limited precision graphics.. However, when checking the already published Cartoon Introduction to Economics, I realised this was the style of Grady Klein (as reflected below).

Thus, I have to assume this is how The Cartoon Introduction to Statistics will look like  when published in July… I am quite perplexed by the whole project. First, I do not see how a newcomer to the field can learn better from a cartoon with an average four sentences per page than from a regular introductory textbook. Cartoons introduce an element of fun into the explanation, with jokes and (irrelevant) side stories, but they are also distracting as readers are not always in  a position to know what matters and what does not. Second, as the drawings are done in a rough style, I find this increases the potential for confusion. For instance, the above cover reproduces an example linking the histogram of a sample of averages and the normal distribution. If a reader has never heard of histograms, I do not see how he or she could gather how they are constructed in practice. The width of the bags is related to the number of persons in each bag (50 random Americans) in the story, while it should be related to the inverse of the square root of this number in the theory.  Similarly, I find the explanation about confidence intervals lacking: when trying to reassure the readers about the fact that any given random sample from a population might be misleading, the authors state that “in the long run most cans [of worms] have averages in the clump under the hump [of the normal pdf]“. This is not reassuring at all: when using confidence intervals based on 10 or on 10⁵ normal observations, the corresponding 95% confidence intervals on their mean both have 95% chances to contain the true mean. The long run aspect refers to the repeated use of those intervals. (I am not even mentioning the classical fallacy of stating that “we are 99.7% confident that the population average is somewhere between -1.73 and -0.27″…)

In conclusion, I remember buying an illustrated entry to Marx’ Das Kapital when I started economics in graduate school (as a minor). This gave me a very quick idea of the purpose of the book. However, I read through the whole book to understand (or try to understand) Marx’ analysis of the economy. And the introduction did not help much in this regard. In the present setting, we are dealing with statistics, not economics, not philosophy. Having read a cartoon about the average length of worms within a can of worms is not going to help much in understanding the Central Limit Theorem and the subsequent derivation of confidence intervals. The validation of statistical methods is done through mathematics, which provides a formal language cartoons cannot reproduce.

## five years in Edinburgh

Posted in Kids, Mountains, Travel, University life with tags , , , , , , , on April 12, 2013 by xi'an

Got an email with this tantalizing offer of a five year postdoctoral position in mathematics at the University of Edinburgh:

Chancellor's Fellowship (five positions) [tenure-track posts at Lecturer or Reader
level]

Applications are invited for up to five Chancellor's Fellowship posts in
Mathematics. Each Fellowship provides a research-focused reduced-teaching position
for up to 5 years, followed immediately by a standard open-ended (ie "tenured")

Applicants should have research interests in any area of:

Applied and Computational Mathematics
Financial Mathematics
Mathematical Physics
Operational Research
Pure Mathematics
Statistics

One of the positions will be specifically dedicated to algebra (Representation

Applicants will have a research record of the highest calibre, exhibiting the
potential to become an international leader. We welcome candidates whose interests
may also reach out to other disciplines.

Appointment will normally be made on the Lecturer scale, £37,382 - £44,607.
Dependent on experience, and in exceptional circumstances, appointment may be to
Senior Lecturer/Reader level for which the salary scale is £47,314 - £53,233.

Interviews will be held during May 2013. Applications containing a detailed CV and
an outline of a proposed research programme should be made online

## L’Aquila: earthquake, verdict, and statistics

Posted in Statistics, University life with tags , , , , , , , , , on October 25, 2012 by xi'an

Yesterday I read this blog entry by Peter Coles, a Professor of Theoretical Astrophysics at Cardiff and soon in Brighton, about L’Aquila earthquake verdict, condemning six Italian scientists to severe jail sentences. While most of the blogs around reacted against this verdict as an anti-scientific decision and as a 21st Century remake of Giordano Bruno‘s murder by the Roman Inquisition, Peter Coles argues in the opposite that the scientists were not scientific enough in that instance. And should have used statistics and probabilistic reasoning. While I did not look into the details of the L’Aquila earthquake judgement and thus have no idea whether or not the scientists were guilty in not signalling the potential for disaster, were an earthquake to occur, I cannot but repost one of Coles’ most relevant paragraphs:

I thought I’d take this opportunity to repeat the reasons I think statistics and statistical reasoning are so important. Of course they are important in science. In fact, I think they lie at the very core of the scientific method, although I am still surprised how few practising scientists are comfortable even with statistical language. A more important problem is the popular impression that science is about facts and absolute truths. It isn’t. It’s a process. In order to advance, it has to question itself.

Statistical reasoning also applies outside science to many facets of everyday life, including business, commerce, transport, the media, and politics. It is a feature of everyday life that science and technology are deeply embedded in every aspect of what we do each day. Science has given us greater levels of comfort, better health care, and a plethora of labour-saving devices. It has also given us unprecedented ability to destroy the environment and each other, whether through accident or design. Probability even plays a role in personal relationships, though mostly at a subconscious level.

A bit further down, Peter Coles also bemoans the shortcuts and oversimplification of scientific journalism, which reminded me of the time Jean-Michel Marin had to deal with radio journalists about an “impossible” lottery coincidence:

Years ago I used to listen to radio interviews with scientists on the Today programme on BBC Radio 4. I even did such an interview once. It is a deeply frustrating experience. The scientist usually starts by explaining what the discovery is about in the way a scientist should, with careful statements of what is assumed, how the data is interpreted, and what other possible interpretations might be and the likely sources of error. The interviewer then loses patience and asks for a yes or no answer. The scientist tries to continue, but is badgered. Either the interview ends as a row, or the scientist ends up stating a grossly oversimplified version of the story.

## estimating a constant (not really)

Posted in Books, Statistics, University life with tags , , , , , , , , , , , , , on October 12, 2012 by xi'an

Larry Wasserman wrote a blog entry on the normalizing constant paradox, where he repeats that he does not understand my earlier point…Let me try to recap here this point and the various comments I made on StackExchange (while keeping in mind all this is for intellectual fun!)

The entry is somehow paradoxical in that Larry acknowledges (in that post) that the analysis in his book, All of Statistics, is wrong. The fact that “g(x)/c is a valid density only for one value of c” (and hence cannot lead to a notion of likelihood on c) is the very reason why I stated that there can be no statistical inference nor prior distribution about c: a sample from f does not bring statistical information about c and there can be no statistical estimate of c based on this sample. (In case you did not notice, I insist upon statistical!)

To me this problem is completely different from a statistical problem, at least in the modern sense: if I need to approximate the constant c—as I do in fact when computing Bayes factors—, I can produce an arbitrarily long sample from a certain importance distribution and derive a converging (and sometimes unbiased) approximation of c. Once again, this is Monte Carlo integration, a numerical technique based on the Law of Large Numbers and the stabilisation of frequencies. (Call it a frequentist method if you wish. I completely agree that MCMC methods are inherently frequentist in that sense, And see no problem with this because they are not statistical methods. Of course, this may be the core of the disagreement with Larry and others, that they call statistics the Law of Large Numbers, and I do not. This lack of separation between both notions also shows up in a recent general public talk on Poincaré’s mistakes by Cédric Villani! All this may just mean I am irremediably Bayesian, seeing anything motivated by frequencies as non-statistical!) But that process does not mean that c can take a range of values that would index a family of densities compatible with a given sample. In this Monte Carlo integration approach, the distribution of the sample is completely under control (modulo the errors induced by pseudo-random generation). This approach is therefore outside the realm of Bayesian analysis “that puts distributions on fixed but unknown constants”, because those unknown constants parameterise the distribution of an observed sample. Ergo, c is not a parameter of the sample and the sample Larry argues about (“we have data sampled from a distribution”) contains no information whatsoever about c that is not already in the function g. (It is not “data” in this respect, but a stochastic sequence that can be used for approximation purposes.) Which gets me back to my first argument, namely that c is known (and at the same time difficult or impossible to compute)!

Let me also answer here the comments on “why is this any different from estimating the speed of light c?” “why can’t you do this with the 100th digit of π?” on the earlier post or on StackExchange. Estimating the speed of light means for me (who repeatedly flunked Physics exams after leaving high school!) that we have a physical experiment that measures the speed of light (as the original one by Rœmer at the Observatoire de Paris I visited earlier last week) and that the statistical analysis infers about c by using those measurements and the impact of the imprecision of the measuring instruments (as we do when analysing astronomical data). If, now, there exists a physical formula of the kind

$c=\int_\Xi \psi(\xi) \varphi(\xi) \text{d}\xi$

where φ is a probability density, I can imagine stochastic approximations of c based on this formula, but I do not consider it a statistical problem any longer. The case is thus clearer for the 100th digit of π: it is also a fixed number, that I can approximate by a stochastic experiment but on which I cannot attach a statistical tag. (It is 9, by the way.) Throwing darts at random as I did during my Oz tour is not a statistical procedure, but simple Monte Carlo à la Buffon…

Overall, I still do not see this as a paradox for our field (and certainly not as a critique of Bayesian analysis), because there is no reason a statistical technique should be able to address any and every numerical problem. (Once again, Persi Diaconis would almost certainly differ, as he defended a Bayesian perspective on numerical analysis in the early days of MCMC…) There may be a “Bayesian” solution to this particular problem (and that would nice) and there may be none (and that would be OK too!), but I am not even convinced I would call this solution “Bayesian”! (Again, let us remember this is mostly for intellectual fun!)

## Gaia

Posted in Statistics, University life with tags , , , , , , , , on September 19, 2012 by xi'an

Today, I attended a meeting at the Paris observatory about the incoming launch of the Gaia satellite and the associated data (mega-)challenges. To borrow from the webpage, “To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.” The amount of data that will be produced by this satellite is staggering: Gaia will take pictures of roughly 1Giga pixels that will be processed both on-board and on Earth, transmitting over five years a pentabyte of data that need to be processed fairly efficiently to be at all useful! The European consortium operating this satellite has planned for specific tasks dedicated to data handling and processing, which is a fabulous opportunity for would-be astrostatisticians! (Unsurprisingly, at least half of the tasks are statistics related, either at the noise reduction stage or at the estimation stage.) Another amazing feature of the project is that it will result in open data, the outcome of the observations being open to everyone for analyse… I am clearly looking forward the next meeting to understand better the structure of the data and the challenges simulation methods could help to solve!