Archive for testing

truth or truthiness [book review]

Posted in Books, Kids, pictures, Statistics, University life with tags , , , , , , , , , , , , on March 21, 2017 by xi'an

This 2016 book by Howard Wainer has been sitting (!) on my desk for quite a while and it took a long visit to Warwick to find a free spot to quickly read it and write my impressions. The subtitle is, as shown on the picture, “Distinguishing fact from fiction by learning to think like a data scientist”. With all due respect to the book, which illustrates quite pleasantly the dangers of (pseudo-)data mis- or over- (or eve under-)interpretation, and to the author, who has repeatedly emphasised those points in his books and tribunes opinion columns, including those in CHANCE, I do not think the book teaches how to think like a data scientist. In that an arbitrary neophyte reader would not manage to handle a realistic data centric situation without deeper training. But this collection of essays, some of which were tribunes, makes for a nice reading  nonetheless.

I presume that in this post-truth and alternative facts [dark] era, the notion of truthiness is familiar to most readers! It is often based on a misunderstanding or a misappropriation of data leading to dubious and unfounded conclusions. The book runs through dozens of examples (some of them quite short and mostly appealing to common sense) to show how this happens and to some extent how this can be countered. If not avoided as people will always try to bend, willingly or not, the data to their conclusion.

There are several parts and several themes in Truth or Truthiness, with different degrees of depth and novelty. The more involved part is in my opinion the one about causality, with illustrations in educational testing, psychology, and medical trials. (The illustration about fracking and the resulting impact on Oklahoma earthquakes should not be in the book, except that there exist officials publicly denying the facts. The same remark applies to the testing cheat controversy, which would be laughable had not someone ended up the victim!) The section on graphical representation and data communication is less exciting, presumably because it comes after Tufte’s books and message. I also feel the 1854 cholera map of John Snow is somewhat over-exploited, since he only drew the map after the epidemic declined.  The final chapter Don’t Try this at Home is quite anecdotal and at the same time this may the whole point, namely that in mundane questions thinking like a data scientist is feasible and leads to sometimes surprising conclusions!

“In the past a theory could get by on its beauty; in the modern world, a successful theory has to work for a living.” (p.40)

The book reads quite nicely, as a whole and a collection of pieces, from which class and talk illustrations can be borrowed. I like the “learned” tone of it, with plenty of citations and witticisms, some in Latin, Yiddish and even French. (Even though the later is somewhat inaccurate! Si ça avait pu se produire, ça avait dû se produire [p.152] would have sounded more vernacular in my Gallic opinion!) I thus enjoyed unreservedly Truth or Truthiness, for its rich style and critical message, all the more needed in the current times, and far from comparing it with a bag of potato chips as Andrew Gelman did, I would like to stress its classical tone, in the sense of being immersed in a broad and deep culture that seems to be receding fast.

testing R code [book review]

Posted in R, Statistics, Travel with tags , , , , , , on March 1, 2017 by xi'an

When I saw this title among the CRC Press novelties, I immediately ordered it as I though it fairly exciting. Now that I have gone through the book, the excitement has died. Maybe faster than need be as I read it while being stuck in a soulless Schipol airport and missing the only ice-climbing opportunity of the year!

Testing R Code was written by Richard Cotton and is quite short: once you take out the appendices and the answers to the exercises, it is about 130 pages long, with a significant proportion of code and output. And it is about some functions developed by Hadley Wickham from RStudio, for testing the coherence of R code in terms of inputs more than outputs. The functions are assertive and testthat. Intended for run-time versus development-time testing. Meaning that the output versus the input are what the author of the code intends them to be. The other chapters contain advices and heuristics about writing maintainable testable code, and incorporating a testing feature in an R package.

While I am definitely a poorly qualified reader for this type of R books, my disappointment stems from my expectation of a book about debugging R code, which is possibly due to a misunderstanding of the term testing. This is an unrealistic expectation, for sure, as testing for a code to produce what it is supposed to do requires some advanced knowledge of what the output should be, at least in some representative situations. Which means using interface like RStudio is capital in spotting unsavoury behaviours of some variables, if not foolproof in any case.

Harold Jeffreys’ default Bayes factor [for psychologists]

Posted in Books, Statistics, University life with tags , , , , , , on January 16, 2015 by xi'an

“One of Jeffreys’ goals was to create default Bayes factors by using prior distributions that obeyed a series of general desiderata.”

The paper Harold Jeffreys’s default Bayes factor hypothesis tests: explanation, extension, and application in Psychology by Alexander Ly, Josine Verhagen, and Eric-Jan Wagenmakers is both a survey and a reinterpretation cum explanation of Harold Jeffreys‘ views on testing. At about the same time, I received a copy from Alexander and a copy from the journal it had been submitted to! This work starts with a short historical entry on Jeffreys’ work and career, which includes four of his principles, quoted verbatim from the paper:

  1. “scientific progress depends primarily on induction”;
  2. “in order to formalize induction one requires a logic of partial belief” [enters the Bayesian paradigm];
  3. “scientific hypotheses can be assigned prior plausibility in accordance with their complexity” [a.k.a., Occam’s razor];
  4. “classical “Fisherian” p-values are inadequate for the purpose of hypothesis testing”.

“The choice of π(σ)  therefore irrelevant for the Bayes factor as long as we use the same weighting function in both models”

A very relevant point made by the authors is that Jeffreys only considered embedded or nested hypotheses, a fact that allows for having common parameters between models and hence some form of reference prior. Even though (a) I dislike the notion of “common” parameters and (b) I do not think it is entirely legit (I was going to write proper!) from a mathematical viewpoint to use the same (improper) prior on both sides, as discussed in our Statistical Science paper. And in our most recent alternative proposal. The most delicate issue however is to derive a reference prior on the parameter of interest, which is fixed under the null and unknown under the alternative. Hence preventing the use of improper priors. Jeffreys tried to calibrate the corresponding prior by imposing asymptotic consistency under the alternative. And exact indeterminacy under “completely uninformative” data. Unfortunately, this is not a well-defined notion. In the normal example, the authors recall and follow the proposal of Jeffreys to use an improper prior π(σ)∝1/σ on the nuisance parameter and argue in his defence the quote above. I find this argument quite weak because suddenly the prior on σ becomes a weighting function... A notion foreign to the Bayesian cosmology. If we use an improper prior for π(σ), the marginal likelihood on the data is no longer a probability density and I do not buy the argument that one should use the same measure with the same constant both on σ alone [for the nested hypothesis] and on the σ part of (μ,σ) [for the nesting hypothesis]. We are considering two spaces with different dimensions and hence orthogonal measures. This quote thus sounds more like wishful thinking than like a justification. Similarly, the assumption of independence between δ=μ/σ and σ does not make sense for σ-finite measures. Note that the authors later point out that (a) the posterior on σ varies between models despite using the same data [which shows that the parameter σ is far from common to both models!] and (b) the [testing] Cauchy prior on δ is only useful for the testing part and should be replaced with another [estimation] prior when the model has been selected. Which may end up as a backfiring argument about this default choice.

“Each updated weighting function should be interpreted as a posterior in estimating σ within their own context, the model.”

The re-derivation of Jeffreys’ conclusion that a Cauchy prior should be used on δ=μ/σ makes it clear that this choice only proceeds from an imperative of fat tails in the prior, without solving the calibration of the Cauchy scale. (Given the now-available modern computing tools, it would be nice to see the impact of this scale γ on the numerical value of the Bayes factor.) And maybe it also proceeds from a “hidden agenda” to achieve a Bayes factor that solely depends on the t statistic. Although this does not sound like a compelling reason to me, since the t statistic is not sufficient in this setting.

In a differently interesting way, the authors mention the Savage-Dickey ratio (p.16) as a way to represent the Bayes factor for nested models, without necessarily perceiving the mathematical difficulty with this ratio that we pointed out a few years ago. For instance, in the psychology example processed in the paper, the test is between δ=0 and δ≥0; however, if I set π(δ=0)=0 under the alternative prior, which should not matter [from a measure-theoretic perspective where the density is uniquely defined almost everywhere], the Savage-Dickey representation of the Bayes factor returns zero, instead of 9.18!

“In general, the fact that different priors result in different Bayes factors should not come as a surprise.”

The second example detailed in the paper is the test for a zero Gaussian correlation. This is a sort of “ideal case” in that the parameter of interest is between -1 and 1, hence makes the choice of a uniform U(-1,1) easy or easier to argue. Furthermore, the setting is also “ideal” in that the Bayes factor simplifies down into a marginal over the sample correlation only, under the usual Jeffreys priors on means and variances. So we have a second case where the frequentist statistic behind the frequentist test[ing procedure] is also the single (and insufficient) part of the data used in the Bayesian test[ing procedure]. Once again, we are in a setting where Bayesian and frequentist answers are in one-to-one correspondence (at least for a fixed sample size). And where the Bayes factor allows for a closed form through hypergeometric functions. Even in the one-sided case. (This is a result obtained by the authors, not by Jeffreys who, as the proper physicist he was, obtained approximations that are remarkably accurate!)

“The fact that the Bayes factor is independent of the intention with which the data have been collected is of considerable practical importance.”

The authors have a side argument in this section in favour of the Bayes factor against the p-value, namely that the “Bayes factor does not depend on the sampling plan” (p.29), but I find this fairly weak (or tongue in cheek) as the Bayes factor does depend on the sampling distribution imposed on top of the data. It appears that the argument is mostly used to defend sequential testing.

“The Bayes factor (…) balances the tension between parsimony and goodness of fit, (…) against overfitting the data.”

In fine, I liked very much this re-reading of Jeffreys’ approach to testing, maybe the more because I now think we should get away from it! I am not certain it will help in convincing psychologists to adopt Bayes factors for assessing their experiments as it may instead frighten them away. And it does not bring an answer to the vexing issue of the relevance of point null hypotheses. But it constitutes a lucid and innovative of the major advance represented by Jeffreys’ formalisation of Bayesian testing.

More/less incriminating digits from the Iranian election

Posted in Statistics with tags , , , , , , on June 21, 2009 by xi'an

Following my previous post where I commented on Roukema’s use of Benford’s Law on the first digits of the counts, I saw on Andrew Gelman’s blog a pointer to a paper in the Washington Post, where the arguments are based instead on the last digit. Those should be uniform, rather than distributed from Benford’s Law, There is no doubt about the uniformity of the last digit, but the claim for “extreme unlikeliness” of the frequencies of those digits made in the paper is not so convincing. Indeed, when I uniformly sampled 116 digits in {0,..,9}, my very first attempt produced the highest frequency to be 20.5% and the lowest to be 5.9%. If I run a small Monte Carlo experiment with the following R program,

fre=0
for (t in 1:10^4){
   h=hist(sample(0:9,116,rep=T),plot=F)$inten;
   fre=fre+(max(h)>.16)*(min(h)<.05)
   }

the percentage of cases when this happens is 15%, so this is not “extremely unlikely” (unless I made a terrible blunder in the above!!!)… Even moving the constraint to

(max(h)>.169)*(min(h)<.041)

does not produce a very unlikely probability, since it is then 0.0525.

The second argument looks at the proportion of last and second-to-last digits that are adjacent, i.e. with a difference of ±1 or ±9. Out of the 116 Iranian results, 62% are made of non-adjacent digits. If I sample two vectors of 116 digits in {0,..,9} and if I consider this occurrence, I do see an unlikely event. Running the Monte Carlo experiment

repa=NULL
for (t in 1:10^5){
    dife=(sample(0:9,116,rep=T)-sample(0:9,116,rep=T))^2
    repa[t]=sum((dife==1))+sum((dife==81))
    }
repa=repa/116

shows that the distribution of repa is centered at .20—as it should, since for a given second-to-last digit, there are two adjacent last digits—, not .30 as indicated in the paper, and that the probability of having a frequency of .38 or more of adjacent digit is estimated as zero by this Monte Carlo experiment. (Note that I took 0 and 9 to be adjacent and that removing this occurrence would further lower the probability.)

Random generators

Posted in Statistics with tags , , , on June 9, 2009 by xi'an

A post on Revolutions discussed physical random generators against standard computer-based generators. It is interesting in that it reveals how “common sense” can get things wrong in this area. First, there is an almost atavistic misgiving that a sequence produced by a function

x_{n+1} = F(x_n) \text{ or } x_{n+1} = F(x_n,y_n), y_{n+1}=G(x_n,y_n)

—the second sequence corresponding to several seeds used in parallel, as possible with R RNGkind—cannot be a “true” random generator when F and G are standard functions. The criticism truly applies to the object, namely the sequence produced, that can be exactly reconstructed if F and G are known, but not to its uses in Monte Carlo simulation. When F (or G) is unknown, the null hypothesis

x_{n+1} | x_n \sim \mathcal{U}(0,1)

cannot be defaulted [rejected] by a statistical test, no matter how long the sequence is, no matter which aspect of uniformity is being tested. (Note that the post mentions randomness in many occurrences without precising this is about uniformity.) The difficulty with pseudo-random generators is rather that they are used for continuous distributions while being represented by a finite number of digits, i.e. with a certain precision. So a statistical test is bound to fail if you look far enough in the digits without adapting the random generator for an higher precision. The second criticism about standard random generators is related with this point, namely that exploring the sequence of values in a finite set with a deterministic function is necessarily periodic. This is correct, but using several seeds in parallel multiplies the size of the finite set to such an extent that it becomes irrelevant for all but purely abstract purposes! The third misconception is that “truly” random is better (“more random“!) than pseudo-random and that, if physical generators are available on a computer, this is much better than R runif. This is a wrong perception in that “truly” random generators are producing flows of numbers that are indeed unpredictable and “random”, but that the law of the random phenomenon cannot be exactly asserted. The example provided in the post of the dicing machine is typical: it produces a sequence of millions of dice values a day—R sample does the same in 0.012 seconds—but there is no reason the dice are exactly well-balanced and the outcome is perfectly uniform over 1,2,3,4,5,6…

Bayesian p-values (2)

Posted in Statistics with tags , , on May 8, 2009 by xi'an

“What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.” H. Jeffreys, Theory of Probability

Looking a bit further into the literature about Bayesian p-values, I read the Statistical Science paper of Bayarri and Castellanos on Bayesian Checking of the Second Levels of Hierarchical Models, which considers extensions of the surprise measures of Bayarri and Berger (2000, JASA, see preprint) in the setting of a normal hierarchical model. (Here is a link to a very preliminary version of the paper.) While I quite appreciate the advances contained in those papers, I am still rather reticent to put forward those measures of surprise as decision tools and, given the propensity of users to hijack tools towards their own use, I fear they would end up being used exactly as p-values, namely interpreted as probabilities of the null hypothesis.

Here are some comments on the paper that I made during my trip back from Montpellier yesterday evening. The most natural tool in building a Bayesian p-value seems to me to use the probability of a tail event under the predictive

\mathbf{h(y|x) = \int f(y|\theta) \pi(\theta|x) \text{d} \theta}

and using P^h(t(Y)\ge t(x) |x) to define the “surprise” means that

  1. the tail event is evaluated under a distribution that is “most” favourable to x, since it is based on the posterior distribution of \theta given x. [This point somehow relates to the “using the data twice” argument that I do not really understand in this setting: conditional on x, this is a proper Bayesian way of evaluating the probability of the event \{t(Y)\ge t(x)\}. (What one does with this evaluation is another issue!)]

  2. following Andrew Gelman’s discussion, there is no accounting for the fact that “all models are wrong” and that we are working from within a model trying to judge of the adequacy of this model, in a Munchausen’s way of pull oneself up to the Moon. Again, there is a fair danger in using P^h(t(Y)\ge t(x) |x) as the posterior probability of the model being “true”…

  3. I think the approach does not account for the (decisional) uses of the numerical evaluation, hence is lacking calibration: Is a value of 10¯² small?! Is a value of 10¯³ very small?! Are those absolute or relative values?! And if the value is used to decide for or against a model, what are the consequences of this decision?

  4. the choice of the summary statistic t(x) is quite relevant for the value of the surprise measure and there is no intrinsic choice. For instance, I first thought using the marginal likelihood m(x) would be a relevant choice, but alas this is not invariant under a change of variables.

  5. another [connected] point that is often neglected in model comparison and model evaluation is that sufficient statistics are only sufficient within a given model, but not for comparing nor evaluating this model. For instance, when comparing a Poisson model with a negative binomial model, the sum of the observation is sufficient in both cases but not in the comparison!

Bayesian p-values

Posted in Books, Statistics with tags , , on May 3, 2009 by xi'an

“The posterior probability of the null hypothesis is a point probability and thus not appropriate to use.”

This morning, during my “lazy Sunday” breakfast, due to the extended May first weekend, I was already done with the newspapers of the week and I thus moved to the pile of unread Statistics journals. One intriguing paper was Bayesian p-values by Lin et al. in the latest issue of JRSS Series C. The very concept of a Bayesian p-value is challenging and the very idea would make many (most?) Bayesian cringe! In the most standard perspective, a p-value is a frequentist concept that is incredibly easy to misuse as “the probability that the null hypothesis is true” but also that is intrinsically misguided. There are a lot of papers and books dealing with this issue, including Berger and Wolpert’s (1988) Likelihood Principle (available for free!), and I do not want to repeat here the arguments about the lack of validity of this pervasive quantification of statistical tests. In the case of , the Bayesian p-values were simply defined as tail posterior probabilities and thus were not p-values per se… The goal of the paper was not to define a “new” methodology but to evaluate the impact of the Dirichlet prior parameters on those posterior probabilities. I still do not understand the above quote since, while there is some degree of controversy about using Bayes factors with improper priors, point null hypotheses may still be properly defined…

There is however an interesting larger issue related to this issue. Namely, Bayesian model evaluation: in a standard Bayesian modelling, both sides of an hypothesis are part of the modelling and the Bayes factor evaluates the likelihood of one side versus the other, bypassing the prior probabilities of each side (and hence missing the decision theoretic goal of maximising an utility function). There is however a debate whether or not this should always be the case and about the relevance of deciding about the “truth” of an hypothesis by simply looking at how far in the tails the observation is, as shown for instance by the recent paper of Templeton whose main criticism was on this point. Following this track may lead to different kinds of p-values, as for instance in Verdinelli and Wasserman’s paper (1998) and in our unpublished tech report with Judith Rousseau… Templeton argues that seeing a value that is too far in the tails (whether in a frequentist or a Bayesian sense) is enough to conclude about the falsity of a theory (calling, guess who?!, on Popper himself!). But, even in a scientific perspective, the rejection of an hypothesis must be followed by some kind of action, whose impact should be included within the test itself.

Ps-As a coincidence, the paper Allocation of Resources by Metcalf et al. in the same issue of Series C also contains some reflections about Bayesian p-values, using a predictive posterior probability of divergence as the central quantity

\mathbb{P}\left( D_\theta(Y^\text{rep},y^\text{obs}) > 0 | y^\text{obs} \right)

where obs and rep denote the observed and replicated data and where D is the discrepancy function, equal to zero when the replicated data equals the observed data. The result they obtain on their dataset is a perfect symmetry with a Bayesian p-value of 0.5. They conclude (rightly) that “despite being potentially useful as an informal diagnostic, there is little formal (decision theoretic) justification for the use of Bayesian p-values” (p.169).