Archive for Statistics

[super]power to X-edit

Posted in Kids, Statistics, University life with tags , , , on February 22, 2012 by xi'an

Having reached 2000 reputation credits on Cross Validated, I now have the privilege to edit others’ posts:

We believe in the power of community editing. That means once you’ve generated enough reputation, we trust you to edit anything in the system without it going through the peer review system. Not just your posts—anyone’s posts!

Which I already did in the past with superusers’ supervision. Hopefully, this promotion will not induce spending even more time on the forum… And increase my stress at reading often too often questions written by people obviously too lazy to open a stat manual. (I know, I know, no one forces me to read them!)

epidemiology in Le Monde

Posted in Books, Statistics, University life with tags , , , , , , , , , on February 19, 2012 by xi'an

Quite an interesting weekend Le Monde issue: a fourth (2 pages!) of the science folder is devoted to epidemiology… In the statistical sense. (The subtitle is actually Strengths and limitations of Statistics.) The paper does not delve into technical statistical issues but points out the logical divergence between a case-by-case study and an epidemiological study. The impression that the higher the conditioning (i.e. the more covariates), the better the explanation is a statistical fallacy some of the opponents interviewed in the paper do not grasp. (Which reminded me of Keynes seemingly going the same way.) The short paragraph written on causality and Hill’s criteria is vague enough to concur to the overall remark that causality can never been proved or disproved… The fourth examples illustrating the strengths and limitations are tobacco vs. lung cancer, a clear case except for R.A. Fisher!, mobile phones vs. brain tumors, a not yet conclusive setting, hepatitis B vaccine vs. sclerosis, lacking data (the pre-2006 records were destroyed for legal reasons), and leukemia vs. nuclear plants, with a significant [?!] correlation between the number of cases and the distance to a nuclear plant. (The paper was inspired by a report recently published by the French Académie de Médecine on epidemiology in France.) The science folder also includes a review of a recent Science paper by Wilhite and Fong on the coercive strategies used by some journals/editors to increase their impact factor, e.g., “you cite Leukemia [once in 42 references]. Consequently, we kindly ask you to add references of articles published in Leukemia to your present article”.

Principles of Applied Statistics

Posted in Books, Statistics, University life with tags , , , , , , , , , , , on February 13, 2012 by xi'an

This book by David Cox and Christl Donnelly, Principles of Applied Statistics, is an extensive coverage of all the necessary steps and precautions one must go through when contemplating applied (i.e. real!) statistics. As the authors write in the very first sentence of the book, “applied statistics is more than data analysis” (p.i); the title could indeed have been “Principled Data Analysis”! Indeed, Principles of Applied Statistics reminded me of how much we (at least I) take “the model” and “the data” for granted when doing statistical analyses, by going through all the pre-data and post-data steps that lead to the “idealized” (p.188) data analysis. The contents of the book are intentionally simple, with hardly any mathematical aspect, but with a clinical attention to exhaustivity and clarity. For instance, even though I would have enjoyed more stress on probabilistic models as the basis for statistical inference, they only appear in the fourth chapter (out of ten) with error in variable models. The painstakingly careful coverage of the myriad of tiny but essential steps involved in a statistical analysis and the highlight of the numerous corresponding pitfalls was certainly illuminating to me.  Just as the book refrains from mathematical digressions (“our emphasis is on the subject-matter, not on the statistical techniques as such p.12), it falls short from engaging into detail and complex data stories. Instead, it uses little grey boxes to convey the pertinent aspects of a given data analysis, referring to a paper for the full story. (I acknowledge this may be frustrating at times, as one would like to read more…) The book reads very nicely and smoothly, and I must acknowledge I read most of it in trains, métros, and planes over the past week. (This remark is not  intended as a criticism against a lack of depth or interest, by all means [and medians]!)

A general principle, sounding superficial but difficult to implement, is that analyses should be as simple as possible, but not simpler.” (p.9)

To get into more details, Principles of Applied Statistics covers the (most!) purposes of statistical analyses (Chap. 1), design with some special emphasis (Chap. 2-3), which is not surprising given the record of the authors (and “not a moribund art form”, p.51), measurement (Chap. 4), including the special case of latent variables and their role in model formulation, preliminary analysis (Chap. 5) by which the authors mean data screening and graphical pre-analysis, [at last!] models (Chap. 6-7), separated in model formulation [debating the nature of probability] and model choice, the later being  somehow separated from the standard meaning of the term (done in §8.4.5 and §8.4.6), formal [mathematical] inference (Chap. 8), covering in particular testing and multiple testing, interpretation (Chap. 9), i.e. post-processing, and a final epilogue (Chap. 10). The readership of the book is rather broad, from practitioners to students, although both categories do require a good dose of maturity, to teachers, to scientists designing experiments with a statistical mind. It may be deemed too philosophical by some, too allusive by others, but I think it constitutes a magnificent testimony to the depth and to the spectrum of our field.

Of course, all choices are to some extent provisional.“(p.130)

As a personal aside,  I appreciated the illustration through capture-recapture models (p.36) with a remark of the impact of toe-clipping on frogs, as it reminded me of a similar way of marking lizards when my (then) student Jérôme Dupuis was working on a corresponding capture-recapture dataset in the 90′s. On the opposite, while John Snow‘s story [of using maps to explain the cause of cholera] is alluring, and his map makes for a great cover, I am less convinced it is particularly relevant within this book.

The word Bayesian, however, became more widely used, sometimes representing a regression to the older usage of flat prior distributions supposedly representing initial ignorance, sometimes meaning models in which the parameters of interest are regarded as random variables and occasionaly meaning little more than that the laws of probability are somewhere invoked.” (p.144)

My main quibble with the book goes, most unsurprisingly!, with the processing of Bayesian analysis found in Principles of Applied Statistics (pp.143-144). Indeed, on the one hand, the method is mostly criticised over those two pages. On the other hand, it is the only method presented with this level of details, including historical background, which seems a bit superfluous for a treatise on applied statistics. The drawbacks mentioned are (p.144)

  • the weight of prior information or modelling as “evidence”;
  • the impact of “indifference or ignorance or reference priors”;
  • whether or not empirical Bayes modelling has been used to construct the prior;
  • whether or not the Bayesian approach is anything more than a “computationally convenient way of obtaining confidence intervals”

The empirical Bayes perspective is the original one found in Robbins (1956) and seems to find grace in the authors’ eyes (“the most satisfactory formulation”, p.156). Contrary to MCMC methods, “a black box in that typically it is unclear which features of the data are driving the conclusions” (p.149)…

If an issue can be addressed nonparametrically then it will often be better to tackle it parametrically; however, if it cannot be resolved nonparametrically then it is usually dangerous to resolve it parametrically.” (p.96)

Apart from a more philosophical paragraph on the distinction between machine learning and statistical analysis in the final chapter, with the drawback of using neural nets and such as black-box methods (p.185), there is relatively little coverage of non-parametric models, the choice of “parametric formulations” (p.96) being openly chosen. I can somehow understand this perspective for simpler settings, namely that nonparametric models offer little explanation of the production of the data. However, in more complex models, nonparametric components often are a convenient way to evacuate burdensome nuisance parameters…. Again, technical aspects are not the focus of Principles of Applied Statistics so this also explains why it does not dwell intently on nonparametric models.

A test of meaningfulness of a possible model for a data-generating process is whether it can be used directly to simulate data.” (p.104)

The above remark is quite interesting, especially when accounting for David Cox’ current appreciation of ABC techniques. The impossibility to generate from a posited model as some found in econometrics precludes using ABC, but this does not necessarily mean the model should be excluded as unrealistic…

The overriding general principle is that there should be a seamless flow between statistical and subject-matter considerations.” (p.188)

As mentioned earlier, the last chapter brings a philosophical conclusion on what is (applied) statistics. It is stresses the need for a careful and principled use of black-box methods so that they preserve a general framework and lead to explicit interpretations.

teachin’ (math?) stat…

Posted in Statistics, Travel, University life with tags , , , , on January 24, 2012 by xi'an

Arthur Charpentier (from the awesome Freakonometrics) pointed out to me those two blogs about teaching statistics. One by Meg Dillon about the joy of teaching statistics in France, of all places!, and entitled Statistics à la Mode. And another one by Douglas Andrews commenting on the first one, entitled the Big Mistake: teaching stat as though it was math… (It appeared on an ASA community blog/forum I did not know about.)

…there is almost invariably a peculiar pair of caveats presented as from on high: Never accept the alternative hypothesis, and ever say the probability is 0.95 that the mean lies in a 95% confidence interval for the mean.” Meg Dillon, After Math

Both blogs managed to bemuse me (this is not going to be a very coherent post, I am afraid!): the first one because it has this condescending tone of pure mathematicians about statistics or at least statistics course (i.e. “anyone can teach statistics!” mixed with “I hate teaching statistics!”) that I meet too often, esp. this side of the pond. Plus it seemed to miss the fundamental distinction between probability and statistics (check the above quote). And it did not say why the contents of the French course was much nicer than the equivalent designed by Meg Dillon at her university (except for the fact that she could use measure theory from the start). Maybe the French idiosyncrasy the author basks in is the fact that statistics is not recognised as a field in French universities (there is no stat department for instance) but is instead a subfield of mathematics…

…stat is a different intellectual discipline.  She longs for a so-called stat course based on sigma-algebras and probability spaces.  Well, that has been tried many times over many years, and it fails miserably at helping students understand the important stat concepts.” Douglas Andrews, ASA Blog Viewer

The second post is making sense in stressing that stat is not math. (Or rather, as it should have been stated, it is not only math.) And that (non-statistician) mathematicians should get some preliminary training or exposure to real data when teaching statistics courses. I can certainly remember a few of my (French) stat teachers who had never approached data in their whole life! However, the comment that “foundation of stat is in empirical science and in learning from observed data, not in math” seems to go overboard. As it echoes in negative the complaint from the math teacher that intro statistics courses were “a hodgepodge of recipes” with no mathematical backbone. My feeling there is that, while we certainly do not need measure theory for the earliest statistics courses (Riemann integration is good enough for my second and third year students), we have to anchor statistical techniques into a mathematical bed to avoid them looking as a bag of tricks. I remember after my first (mathematical) statistics course on being puzzled by the lack of direction and/or the multiplicity, when compared with a standard math course. I was missing the decision-theoretic part that was to come later! Had I been exposed to a non-mathematical intro stat course, I do not think I would have persevered in this field! (And I would have moved to differential geometry instead…)

WSC 2[0]11

Posted in pictures, Statistics, Travel, University life with tags , , , , , , , , , on December 12, 2011 by xi'an

I have now registered for the WSC 2011 conference and I am looking forward the first day of talks tomorrow. Especially since, reading from the abstracts to the talks, it sounds as if many participants have a different understanding of the word simulation than I have. (I had the same impression this summer when taking part in a half-day of talks in Lancaster.) I am however slightly worried at having prepared my (advanced) tutorial for the right crowd, being unable to judge the background of the audience. Some of the talks are highly technical, others seem much more elementary… (I spent the whole night and morning, except for a fairly long and great run in the hills at sunrise, collating and adapting my slides from my graduate course and from different talks. The outcome is on slideshare.)