Last Thursday night, after a friendly dinner closing the ICMS workshop, I was rushing back to Pollock Halls to catch some sleep before a very early flight. When crossing North Bridge, on top of Waverley station, I then spotted in the crowd a well-known face of a fellow statistician from Cambridge University, on an academic visit to the University of Edinburgh that was completely unrelated with the workshop. Then, today, on my way back from submitting a visa request at the Indian embassy in Paris, I took the RER train for one stop between Gare du Nord and Chatelet. When I stood up from my seat and looked behind me, a senior (and most famous) mathematician was sitting right there, in deep conversation with a colleague about algorithms… Just two of “those” coincidences. (Edinburgh may be propitious to coincidences: at the last ICMS workshop I attended, I ended up in the same Indian restaurant as Marc Suchard, who also was on an academic visit to the University of Edinburgh that was completely unrelated with the workshop!)
Archive for India
Over the last Sunday breakfast I went through Naked Statistics: Stripping the Dread from the Data. The first two pages managed to put me in a prejudiced mood for the rest of the book. To wit: the author starts with some math bashing (like, no one ever bothers to tell us about the uses of high school calculus!) either because he really feels like this or because it pays with the intended audience (like, we are on the same side, pal!), he then shows how he outsmarted his high school math teacher by spotting the exam was not possibly designed for his class and then another math teacher by just… re-inventing the steps leading to Zeno’s paradox (said Zeno of Elea not appearing in the credits of the book, to be sure) and sums it up with an NRA argument: “statistics is like a high-caliber weapon: helpful when used correctly” (p.xiv). Add to that a highly ethnocentric perspective that makes the book hardly readable for anyone outside the US, due to its absolute focus on all things American (exaggerating just a wee bit: who are Lebron James, Kim Kardashian, and Dan Rather?! what is Netflix?! why’s this Donald Rumsfeld guy quoted throughout the book?! how do they play baseball?! What do NBA, NHL, and SAT stand for?! &tc.)—as best illustrated by the facts that it took Charles Wheelan three months to realise a (golf) laser measuring instrument he had received could be in another unit that feet, namely meters!, and that he considers paying 100 rupees for a chai (मसाला चाय) in India a cheap price when this amount roughly corresponds to the average daily salary there…—. Top the whole thing with the fact that the author has already written a Naked Economics and seemingly found gold. (I am desperate for the incoming Naked Paleopathology tome in the series!) And there you get me stuck with such a highly negative a priori about Naked Statistics that I could not shake it off for the rest of the book.
“This book will not make you a statistical expert (…) This book is not a textbook.” (p.xv)
With this warning in mind about my bias, let’s get on with what’s in this book. The above tells us what isn’t. To quote further from the author, the book “has been designed to introduce the statistical concepts with the most relevance to everyday life“ (p.xv). Naked Statistics goes over the basic notions of statistics (mean, standard deviation, correlation, linear regression, testing, design, polling), gives a sprinkle of probability background (counting models and the central limit theorem, which Wheelan considers as part of statistics), and spend the remaining chapters warning the reader(s) about the possible missuses of models and statistical tools if implemented in the wrong situations or with the wrong type of data. (There are a few graphs, but they are not particularly inspiring.) All this done with the minimum amount of maths formulae, mostly hidden in footnotes and appendices. (But then why adding an extra formula for σ when one is given just before for σ²?!) Sometimes, the minimum is not enough, as demonstrated by the “formula for calculating the correlation coefficient” (p.61) which takes a whole page of text to get around this absurdity of not using maths symbols like Σ and concludes with the lame “I’ll wave my hands and let the computer do the work” (p.61)! Somehow surprisingly, given the low-key nature of the book, it includes a final appendix on statistical software. From Excel, to SAS, Stata, and …R! While I am pleased at this inclusion, it sounds very much orthogonal to the purpose and the intended audience of Naked Statistics. I cannot fathom anyone reading the book and then immediately embarking upon writing an R code without stopping by a statistics textbook or formal training. (Incidentally, the author reproduces the usual confusion between free and open source, p.259.) Continue reading
On the morning I returned from Varanasi and the ISBA meeting there, I had to give my R final exam (along with three of my colleagues in Paris-Dauphine). This year, the R course was completely in English, exam included, which means I can post it here as it may attract more interest than the French examens of past years…
I just completed grading my 32 copies, all from exam A, which takes a while as I have to check (and sometimes recover) the R code, and often to correct the obvious mistakes to see if the deeper understanding of the concepts is there. This year student cohort is surprisingly homogeneous: I did not spot any of the horrors I may have mentioned in previous posts.
I must alas acknowledge a grievous typo in the version of Exam B that was used the day of the final: cutting-and-pasting from A to B, I forgot to change the parameters in Exercise 2, asking them to simulate a Gamma(0,1). It is only after half an hour that a bright student pointed out the impossibility… We had tested the exams prior to printing them but this somehow escaped the four of us!
Now, as I was entering my grades into the global spreadsheet, I noticed a perfect… lack of correlation between those and the grades at the midterm exam. I wonder what that means: I could be grading at random, the levels in November and in January could be uncorrelated, some students could have cheated in November and others in January, student’s names or file names got mixed up, …? A rather surprising outcome!
On day #2, besides my talk on “empirical Bayes” (ABCel) computation (mostly recycled from Varanasi, photos included), Christophe Andrieu gave a talk on exact approximations, using unbiased estimators of the likelihood and characterising estimators garanteeing geometric convergence (bounded weights, essentially, which is a condition popping out again and again in the Monte Carlo literature). Then Art Owen (father of empirical likelihood among other things!) spoke about QMC for MCMC, a topic that always intringued me.
Indeed, while I see the point of using QMC for specific integration problems, I am more uncertain about its relevance for statistics as a simulation device. Having points uniformly distributed over the unit hypercube in a much more efficient way than a random sample is not helping much when only a tiny region of the unit hypercube, namely the one where the likelihood concentrates, matters. (In other words, we are rarely interested in the uniform distribution over the unit hypercube: we instead want to simulate from a highly irregular and definitely concentrated distribution.) I have the same reservation about the applicability of stratified sampling: the strata have to be constructed in relation with the target distribution. The method Art advocates using a CUD (completely uniformly distributed) sequence as the underlying (deterministic) pseudo-unifom sequence. Highly interesting and I want to read the paper in greater details, but the fact that most simulation steps use a random number of uniforms seems detrimental to the performances of the method in general.
After a lunch break at a terrific BBQ place, with a stop at Lake Alice to watch the alligator(s) I had missed during my morning run, I was able this time to attend till the end Xiao-Li Meng’s talk, where he presented new improvements on bridge sampling based on location-scale (or warping) transforms of the original two-samples to make them share mean and variance. Hani Doss concluded the meeting with a talk on the computation of Bayes factors when using (non-parametric) Dirichlet mixture priors, whose resolution does not require simulations for each value of the scale parameter of the Dirichlet prior, thanks to a Radon-Nykodim derivative representation. (Which nicely connected with Art’s talk in that the latter mentioned therein that most simulation methods are actually based on Riemann integration rather than Lebesgue integration. Hani’s representation is not, with nested sampling being another example.)
We ended up the day with a(nother) barbecue outside, under the stars, in the peace and quiet of a local wood, with wine and laughs, just like George would have concluded the workshop. This was a fitting ending to a meeting dedicated to his memory…
As the cold wave in Varanasi caught me by surprise, I asked the conference organisers for a place to buy a down jacket and they kindly drove me to a nice store called Woodland within the city. I purchased a cheap down-like jacket there (as demonstrated by the newspaper excerpt!) that solved my problem. And I thus discovered a brand that looked surprisingly similar to Timberland, slowly coming to realise this was the whole point: change Timber into Wood, slightly modify the tree in the logo, and you get a local brand that recycles Timberland designs and products to their own profit… (This seems to be a common occurrence in India, judging from this New York Times article.) Anyway, it is rather entertaining to visit the Woodland website, as they mimic major outdoor brand websites like Patagonia or Petzl, but do not offer any material one could seriously consider taking hiking and even less climbing! (Besides the jacket that managed to keep me warm for the rest of the meeting!, I also bought a cheap pair of sneakers and that quickly proved to be a mistake, as the fit is only approximate and the material of poor quality.)