Archive for testing

More/less incriminating digits from the Iranian election

Posted in Statistics with tags , , , , , , on June 21, 2009 by xi'an

Following my previous post where I commented on Roukema’s use of Benford’s Law on the first digits of the counts, I saw on Andrew Gelman’s blog a pointer to a paper in the Washington Post, where the arguments are based instead on the last digit. Those should be uniform, rather than distributed from Benford’s Law, There is no doubt about the uniformity of the last digit, but the claim for “extreme unlikeliness” of the frequencies of those digits made in the paper is not so convincing. Indeed, when I uniformly sampled 116 digits in {0,..,9}, my very first attempt produced the highest frequency to be 20.5% and the lowest to be 5.9%. If I run a small Monte Carlo experiment with the following R program,

for (t in 1:10^4){

the percentage of cases when this happens is 15%, so this is not “extremely unlikely” (unless I made a terrible blunder in the above!!!)… Even moving the constraint to


does not produce a very unlikely probability, since it is then 0.0525.

The second argument looks at the proportion of last and second-to-last digits that are adjacent, i.e. with a difference of ±1 or ±9. Out of the 116 Iranian results, 62% are made of non-adjacent digits. If I sample two vectors of 116 digits in {0,..,9} and if I consider this occurrence, I do see an unlikely event. Running the Monte Carlo experiment

for (t in 1:10^5){

shows that the distribution of repa is centered at .20—as it should, since for a given second-to-last digit, there are two adjacent last digits—, not .30 as indicated in the paper, and that the probability of having a frequency of .38 or more of adjacent digit is estimated as zero by this Monte Carlo experiment. (Note that I took 0 and 9 to be adjacent and that removing this occurrence would further lower the probability.)

Random generators

Posted in Statistics with tags , , , on June 9, 2009 by xi'an

A post on Revolutions discussed physical random generators against standard computer-based generators. It is interesting in that it reveals how “common sense” can get things wrong in this area. First, there is an almost atavistic misgiving that a sequence produced by a function

x_{n+1} = F(x_n) \text{ or } x_{n+1} = F(x_n,y_n), y_{n+1}=G(x_n,y_n)

—the second sequence corresponding to several seeds used in parallel, as possible with R RNGkind—cannot be a “true” random generator when F and G are standard functions. The criticism truly applies to the object, namely the sequence produced, that can be exactly reconstructed if F and G are known, but not to its uses in Monte Carlo simulation. When F (or G) is unknown, the null hypothesis

x_{n+1} | x_n \sim \mathcal{U}(0,1)

cannot be defaulted [rejected] by a statistical test, no matter how long the sequence is, no matter which aspect of uniformity is being tested. (Note that the post mentions randomness in many occurrences without precising this is about uniformity.) The difficulty with pseudo-random generators is rather that they are used for continuous distributions while being represented by a finite number of digits, i.e. with a certain precision. So a statistical test is bound to fail if you look far enough in the digits without adapting the random generator for an higher precision. The second criticism about standard random generators is related with this point, namely that exploring the sequence of values in a finite set with a deterministic function is necessarily periodic. This is correct, but using several seeds in parallel multiplies the size of the finite set to such an extent that it becomes irrelevant for all but purely abstract purposes! The third misconception is that “truly” random is better (“more random“!) than pseudo-random and that, if physical generators are available on a computer, this is much better than R runif. This is a wrong perception in that “truly” random generators are producing flows of numbers that are indeed unpredictable and “random”, but that the law of the random phenomenon cannot be exactly asserted. The example provided in the post of the dicing machine is typical: it produces a sequence of millions of dice values a day—R sample does the same in 0.012 seconds—but there is no reason the dice are exactly well-balanced and the outcome is perfectly uniform over 1,2,3,4,5,6…

Bayesian p-values (2)

Posted in Statistics with tags , , on May 8, 2009 by xi'an

“What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.” H. Jeffreys, Theory of Probability

Looking a bit further into the literature about Bayesian p-values, I read the Statistical Science paper of Bayarri and Castellanos on Bayesian Checking of the Second Levels of Hierarchical Models, which considers extensions of the surprise measures of Bayarri and Berger (2000, JASA, see preprint) in the setting of a normal hierarchical model. (Here is a link to a very preliminary version of the paper.) While I quite appreciate the advances contained in those papers, I am still rather reticent to put forward those measures of surprise as decision tools and, given the propensity of users to hijack tools towards their own use, I fear they would end up being used exactly as p-values, namely interpreted as probabilities of the null hypothesis.

Here are some comments on the paper that I made during my trip back from Montpellier yesterday evening. The most natural tool in building a Bayesian p-value seems to me to use the probability of a tail event under the predictive

\mathbf{h(y|x) = \int f(y|\theta) \pi(\theta|x) \text{d} \theta}

and using P^h(t(Y)\ge t(x) |x) to define the “surprise” means that

  1. the tail event is evaluated under a distribution that is “most” favourable to x, since it is based on the posterior distribution of \theta given x. [This point somehow relates to the “using the data twice” argument that I do not really understand in this setting: conditional on x, this is a proper Bayesian way of evaluating the probability of the event \{t(Y)\ge t(x)\}. (What one does with this evaluation is another issue!)]

  2. following Andrew Gelman’s discussion, there is no accounting for the fact that “all models are wrong” and that we are working from within a model trying to judge of the adequacy of this model, in a Munchausen’s way of pull oneself up to the Moon. Again, there is a fair danger in using P^h(t(Y)\ge t(x) |x) as the posterior probability of the model being “true”…

  3. I think the approach does not account for the (decisional) uses of the numerical evaluation, hence is lacking calibration: Is a value of 10¯² small?! Is a value of 10¯³ very small?! Are those absolute or relative values?! And if the value is used to decide for or against a model, what are the consequences of this decision?

  4. the choice of the summary statistic t(x) is quite relevant for the value of the surprise measure and there is no intrinsic choice. For instance, I first thought using the marginal likelihood m(x) would be a relevant choice, but alas this is not invariant under a change of variables.

  5. another [connected] point that is often neglected in model comparison and model evaluation is that sufficient statistics are only sufficient within a given model, but not for comparing nor evaluating this model. For instance, when comparing a Poisson model with a negative binomial model, the sum of the observation is sufficient in both cases but not in the comparison!

Bayesian p-values

Posted in Books, Statistics with tags , , on May 3, 2009 by xi'an

“The posterior probability of the null hypothesis is a point probability and thus not appropriate to use.”

This morning, during my “lazy Sunday” breakfast, due to the extended May first weekend, I was already done with the newspapers of the week and I thus moved to the pile of unread Statistics journals. One intriguing paper was Bayesian p-values by Lin et al. in the latest issue of JRSS Series C. The very concept of a Bayesian p-value is challenging and the very idea would make many (most?) Bayesian cringe! In the most standard perspective, a p-value is a frequentist concept that is incredibly easy to misuse as “the probability that the null hypothesis is true” but also that is intrinsically misguided. There are a lot of papers and books dealing with this issue, including Berger and Wolpert’s (1988) Likelihood Principle (available for free!), and I do not want to repeat here the arguments about the lack of validity of this pervasive quantification of statistical tests. In the case of , the Bayesian p-values were simply defined as tail posterior probabilities and thus were not p-values per se… The goal of the paper was not to define a “new” methodology but to evaluate the impact of the Dirichlet prior parameters on those posterior probabilities. I still do not understand the above quote since, while there is some degree of controversy about using Bayes factors with improper priors, point null hypotheses may still be properly defined…

There is however an interesting larger issue related to this issue. Namely, Bayesian model evaluation: in a standard Bayesian modelling, both sides of an hypothesis are part of the modelling and the Bayes factor evaluates the likelihood of one side versus the other, bypassing the prior probabilities of each side (and hence missing the decision theoretic goal of maximising an utility function). There is however a debate whether or not this should always be the case and about the relevance of deciding about the “truth” of an hypothesis by simply looking at how far in the tails the observation is, as shown for instance by the recent paper of Templeton whose main criticism was on this point. Following this track may lead to different kinds of p-values, as for instance in Verdinelli and Wasserman’s paper (1998) and in our unpublished tech report with Judith Rousseau… Templeton argues that seeing a value that is too far in the tails (whether in a frequentist or a Bayesian sense) is enough to conclude about the falsity of a theory (calling, guess who?!, on Popper himself!). But, even in a scientific perspective, the rejection of an hypothesis must be followed by some kind of action, whose impact should be included within the test itself.

Ps-As a coincidence, the paper Allocation of Resources by Metcalf et al. in the same issue of Series C also contains some reflections about Bayesian p-values, using a predictive posterior probability of divergence as the central quantity

\mathbb{P}\left( D_\theta(Y^\text{rep},y^\text{obs}) > 0 | y^\text{obs} \right)

where obs and rep denote the observed and replicated data and where D is the discrepancy function, equal to zero when the replicated data equals the observed data. The result they obtain on their dataset is a perfect symmetry with a Bayesian p-value of 0.5. They conclude (rightly) that “despite being potentially useful as an informal diagnostic, there is little formal (decision theoretic) justification for the use of Bayesian p-values” (p.169).

Nested clade versus Bayes?

Posted in Statistics with tags , , , , on March 1, 2009 by xi'an

During a (great) visit to London over the weekend, I was shown a fairly curious paper written by Alan Templeton that basically rejects the use of ABC for hypothesis testing, in favour of Templeton’s own approach, called “nested clade phylogeographical analysis” (NCPA). Since I read the paper in the Eurostar this afternoon, I could not get further information about this approach but most of the criticisms against ABC contained in the paper can be understood on their own.

First, despite its title, this paper is mostly oriented against Bayesian hypothesis testing, confusing the simulation method (ABC) with the underlying Bayesian inferential method of using Bayes factors. Sentences like “the ABC method posits two or more altermative hypotheses and tests their relative fits to some observed statistics”, “no novelty or discovery is allowed in ABC” and “in ABC there is no null hypothesis” illustrate this confusion. From this perspective, the criticisms in the paper are those usually found against the standard Bayesian (i.e. Jeffreys’) approach to testing of hypotheses, like the impossibility of testing the null hypothesis per se, the relativity of Bayes factors, the dependence on prior scenarios, the ignorance of the sampling distribution (or more exactly the lack of integrating over the distribution of the sample), the bias towards over-parameterised models, the impossibility of determining the false-positive rate, all items that have been addressed by the literature in many discussions and will not be repeated here. (There is also the intervention of Karl Popper’s falsifiability that always makes me wary of the statistics papers or books calling for it.) The paper concludes by stating that “the ‘posterior probabilities’ that emerge from ABC are not co-measurable”, which means that they are not to be measured on the same scale but this is bypassing the very nature of Bayesian model choice which compares the posterior probabilities of models by turning the marginal likelihoods into probabilities (see again Jeffreys).

Some criticisms are however correctly directed at ABC as an approximation method, but I also find difficulties with most of them. First, Templeton considers ABC (but in fact the Bayes Factor) to be a goodness-of-fit criterion because it depends on a distance|| s'-s||between a simulated statistics and the observed statistic. Again, this is confusing Bayesian inference (which relies on a marginal likelihood) and simulation technology (which approximates the events'=sbased on this distance). In connection, a second criticism is that the missing “dimensionality of the models” invalidates the inference based on the ABC method, again missing the point that ratios of marginal likelihoods are directly comparable and that they are not chi-squared goodness-of-fit statistics. (Templeton introduces the notion of co-measurability to make the criticism sound more rigorous but this is a concept I have never heard used in Statistics and anyway it does not apply here.) A third attack is more puzzling in that it mixes both simulation and inference and observables and parameters: Fig. 3 in the paper plots three “posterior distributions” (densities) corresponding to three models under comparison but uses a sufficient statistic s to index the first axis. The argument then goes as follows: since ABC only considers statistics s’ such that|| s'-s||is small, it is missing the big picture (and is not Bayesian inference either)! This does not make sense, especially when considering that ABC is not longer A(pproximative) when this distance is equal to zero. It repeatedly confuses the simulation of the auxiliary sufficient statistics (in the space of the observables) and the Bayesian inference (that is on principle unrelated with the simulation method!). The fourth argument against ABC is that there is no convergence result in Beaumont et al. (2002), especially about the choice of\deltaand the paper calls to Billingsley (1986) himself for support. This is [again] rather off-the-point since the convergence of the method is a Monte Carlo type of convergence that has nothing to do with “the impact of the sample size”. When\deltagoes to zero, the method always converges. If one wants to consider things a bit deeper, for a given Monte Carlo sample size, Beaumont et al.’s (2002) uses a non-parametric conditional expectation which also converges as the Monte Carlo sample size goes to infinity. Convergence is thus not addressed in the original papers because it is rather obvious.

I thus found this reading quite entertaining, both because of the repeated occurrence of severe criticisms of Bayesian hypothesis testing and because it happened to occur in a specific field, phylogeny, as it occurred before in other fields like astronomy. The additional appeal of the paper was the confusion between inference and simulation-based approximations, also found in earlier criticisms of MCMC for instance. (Note that there is a strong debate in this community about NCPA itself, as in Petit (2007) and Beaumont and Panchal (2008), but that this is not the purpose of this post!) This means to me that (a) we need to do a better job of explaining the fundamentals of Bayesian hypothesis testing and (b) users of Statistics rarely have the luxury of getting to the original concepts but most often derive them from previous applications in their own field, with the inherent dangers…