## Archive for statistical tests

## How many subjects? [not a book review]

Posted in Books, pictures, Statistics with tags Brett Kavanaugh, Christine Blasey, power, statistical significance, statistical tests, tests, textbook on September 24, 2018 by xi'an## MDL multiple hypothesis testing

Posted in Books, pictures, Statistics, Travel, University life with tags Australia, Bayesian tests of hypotheses, EM algorithm, minimal description length principle, mixtures of distributions, Monash University, Robert Menzies, seminar, statistical tests, Victoria on September 1, 2016 by xi'an

“This formulation reveals an interesting connection between multiple hypothesis testing and mixture modelling with the class labels corresponding to the accepted hypotheses in each test.”

**A**fter my seminar at Monash University last Friday, David Dowe pointed out to me the recent work by Enes Makalic and Daniel Schmidt on minimum description length (MDL) methods for multiple testing as somewhat related to our testing by mixture paper. Work which appeared in the proceedings of the *4th Workshop on Information Theoretic Methods in Science and Engineering (WITMSE-11)*, that took place in Helsinki, Finland, in 2011. Minimal encoding length approaches lead to choosing the model that enjoys the smallest coding length. Connected with, e.g., Rissannen‘s approach. The extension in this paper consists in considering K hypotheses at once on a collection of m datasets (the *multiple* then bears on the datasets rather than on the hypotheses). And to associate an hypothesis index to each dataset. When the objective function is the sum of (generalised) penalised likelihoods [as in BIC], it leads to selecting the “minimal length” model for each dataset. But the authors introduce weights or probabilities for each of the K hypotheses, which indeed then amounts to a mixture-like representation on the exponentiated codelengths. Which estimation by optimal coding was first proposed by Chris Wallace in his book. This approach eliminates the model parameters at an earlier stage, e.g. by maximum likelihood estimation, to return a quantity that only depends on the model index and the data. *In fine*, the purpose of the method differs from ours in that the former aims at identifying an appropriate hypothesis for each group of observations, rather than ranking those hypotheses for the entire dataset by considering the posterior distribution of the weights in the later. The mixture has somehow more of a substance in the first case, where separating the datasets into groups is part of the inference.

## contemporary issues in hypothesis testing

Posted in pictures, Statistics, Travel, University life with tags Andrew Gelman, Bayes factors, Bayesian foundations, Bayesian statistics, Coventry, CRiSM, England, Fall, hypothesis testing, Jim Berger, Joris Mulder, statistical tests, University of Warwick, workshop on May 3, 2016 by xi'an**N**ext Fall, on 15-16 September, I will take part in a CRiSM workshop on hypothesis testing. In our department in Warwick. The registration is now open [until Sept 2] with a moderate registration free of £40 and a call for posters. Jim Berger and Joris Mulder will both deliver a plenary talk there, while Andrew Gelman will alas give a remote talk from New York. (A terrific poster by the way!)

## statistical significance as explained by The Economist

Posted in Books, Statistics, University life with tags False positive, p-values, refereeing, statistical significance, statistical tests, testing of hypotheses, The Economist on November 7, 2013 by xi'an**T**here is a long article in The Economist of this week (also making the front cover), which discusses how and why many published research papers have unreproducible and most often “wrong” results. Nothing immensely new there, esp. if you read Andrew’s blog on a regular basis, but the (anonymous) writer(s) take(s) pains to explain how this related to statistics and in particular statistical testing of hypotheses. The above is an illustration from this introduction to statistical tests (and their interpretation).

“First, the statistics, which if perhaps off-putting are quite crucial.”

It is not the first time I spot a statistics backed article in this journal and so assume it has either journalists with a statistics background or links with (UK?) statisticians. The description of why statistical tests can err is fairly (Type I – Type II) classical. Incidentally, it reports a finding of Ioannidis that when reporting a positive at level 0.05, the expectation of a false positive rate of one out of 20 is “highly optimistic”. An evaluation opposed to, e.g., Berger and Sellke (1987) who reported a too-early rejection in a large number of cases. More interestingly, the paper stresses that this classical approach ignores “the unlikeliness of the hypothesis being tested”, which I interpret as the prior probability of the hypothesis under test.

“Statisticians have ways to deal with such problems. But most scientists are not statisticians.”

The paper also reports about the lack of power in most studies, report that I find a bit bizarre and even meaningless in its ability to compute an overall power, all across studies and researchers and even fields. Even in a single study, the alternative to “no effect” is composite, hence has a power that depends on the unknown value of the parameter. Seeking a single value for the power requires some prior distribution on the alternative.

“Peer review’s multiple failings would matter less if science’s self-correction mechanism—replication—was in working order.”

The next part of the paper covers the failings of peer review, of which I discussed in the ISBA Bulletin, but it seems to me too easy to blame the ref in failing to spot statistical or experimental errors, when lacking access to the data or to the full experimental methodology and when under pressure to return (for free) a report within a short time window. The best that can be expected is that a referee detects the implausibility of a claim or an obvious methodological or statistical mistake. These are not math papers! And, as pointed out repeatedly, not all referees are statistically numerate….

“Budding scientists must be taught technical skills, including statistics.”

The last part discusses of possible solutions to achieve reproducibility and hence higher confidence in experimental results. Paying for independent replication is the proposed solution but it can obviously only apply to a small margin of all published results. And having control bodies testing at random labs and teams following a major publication seems rather unrealistic, if only for filling the teams of such bodies with able controllers… An interesting if pessimistic debate, *in fine*. And fit for the International Year of Statistics.

## accurate ABC: comments by Oliver Ratman [guest post]

Posted in R, Statistics, University life with tags ABC, ABC in Rome, Approximate Bayesian computation, distribution-free tests, Gatsby, London, non-parametric test, Roma, statistical tests on May 31, 2013 by xi'an*Here are comments by Olli following my post:*

I think we found a general means to obtain accurate ABC in the sense of matching the posterior mean or MAP exactly, and then minimising the KL distance between the true posterior and its ABC approximation subject to this condition. The construction works on an auxiliary probability space, much like indirect inference. Now, we construct this probability space empirically, this is where our approach differs first from indirect inference and this is where we need the “summary values” (>1 data points on a summary level; see Figure 1 for clarification). Without replication, we cannot model the distribution of summary values but doing so is essential to construct this space. Now, lets focus on the auxiliary space. We can fiddle with the tolerances (on a population level) and m so that on this space, the ABC approximation has the aforesaid properties. All the heavy technical work is in this part. Intuitively, as m increases, the power increases for sufficiently regular tests (see Figure 2) and consequently, for calibrated tolerances, the ABC approximation on the auxiliary space goes tighter. This offsets the broadening effect of the tolerances, so having non-identical lower and upper tolerances is fine and does not hurt the approximation. Now, we need to transport the close-to-exact ABC approximation on the auxiliary space back to the original space. We need some assumptions here, and given our time series example, it seems these are not unreasonable. We can reconstruct the link between the auxiliary space and the original parameter space as we accept/reject. This helps us understand (with the videos!) the behaviour of the transformation and to judge if its properties satisfy the assumptions of Theorems 2-4. While we offer some tools to understand the behaviour of the link function, yes, we think more work could be done here to improve on our first attempt to accurate ABC.

Now some more specific comments:

“The paper also insists over and over on sufficiency, which I fear is a lost cause.” To clarify, all we say is that on the simple auxiliary space, sufficient summaries are easily found. For example, if the summary values are normally distributed, the sample mean and the sample variance are sufficient statistics. Of course, this is not the original parameter space and we only transform the sufficiency problem into a change of variable problem. This is why we think that inspecting and understanding the link function is important.

“Another worry is that the … test(s) rel(y) on an elaborate calibration”. We provide some code here for everyone to try out. In our examples, this did not slow down ABC considerably. We generally suppose that the distribution of the summary values is simple, like Gaussian, Exponential, Gamma, ChiSquare, Lognormal. In these cases, the ABC approximation takes on an easy-enough-to-calibrate-fast functional form on the auxiliary space.

“This Theorem 3 sounds fantastic but makes me uneasy: unbiasedness is a sparse property that is rarely found in statistical problems. … Witness the use of “essentially unbiased” in Fig. 4.” What Theorem 3 says is that if unbiasedness can be achieved on the simple auxiliary space, then there are regularity conditions under which these properties can be transported back to the original parameter space. We hope to illustrate these conditions with our examples, and to show that they hold in quite general cases such as the time series application. The thing in Figure 4 is that the sample autocorrelation is not an unbiased estimator of the population autocorrelation. So unbiasedness does not quite hold on the auxiliary space and the conditions of Theorem 3 are not satisfied. Nevertheless, we found this bias to be rather negligible in our example and the bigger concern was the effect of the link function.

*And here are Olli’s slides:*

## accurate ABC?

Posted in Statistics, Travel, University life with tags ABC, Approximate Bayesian computation, distribution-free tests, Gatsby, London, non-parametric test, statistical tests on May 29, 2013 by xi'an**A**s posted in the previous entry, Olli Ratman, Anton Camacho, Adam Meijer, and Gé Donker arXived their paper on accurate ABC. A paper which *[not whose!]* avatars I was privy to in the past six months! While I acknowledge the cleverness of the reformulation of the core ABC accept/reject step as a statistical test, and while we discussed the core ideas with Olli and Anton when I visited Gatsby, the paper still eludes me to some respect… Here is why. *(Obviously, you should read this rich & challenging paper first for the comments to make any sense! And even then they may make little sense…)*

**T**he central idea of this accurate ABC *[aABC? A²BC?]* is that, if the distribution of the summary statistics is known and if replicas of those summary statistics are available for the true data (and less problematically for the generated data), then a classical statistical test can be turned into a natural distance measure for each statistics and even “natural” bounds can be found on that distance, to the point of recovering most properties of the original posterior distribution… A first worry is this notion that the statistical distribution of a collection of summary statistics is available in closed form: this sounds unrealistic even though it may not constitute a major contention issue. Indeed, replacing a tailored test with a distribution-free test of identical location parameter could not hurt that much. [Just the power. If that matters… *See bellow.*] The paper also insists over and over on *sufficiency*, which I fear is a lost cause. In my current understanding of ABC, the loss of some amount of information contained in the data should be acknowledged and given a write-off as a Big Data casualty. (See, e.g., Lemma 1.)

**A**nother worry is that the rephrasing of the acceptance distance as the maximal difference for a particular test relies on an elaborate calibration, incl. α, c^{+}, τ^{+}, &tc. (I am not particularly convinced by the calibration in terms of the power of the test being maximised at the point null value. Power?! *See bellow, once again.*) When cumulating tests and aiming at a nominal α level, the orthogonality of the test statistics in Theorem 1(iii) is puzzling and I think unrealistic.

**T**he notion of *accuracy* that is central to the paper and its title corresponds to the power of every test being maximal at the true value of the parameter. And somehow to the ABC approximation being maximises at the true parameter, even though I am lost by then [i.e. around eqn (18)] about the meaning of ρ^{*}… The major result in the paper is however that, under the collection of assumptions made therein, the ABC MLE and MAP versions are equal to their exact counterparts. And that these versions are also unbiased. This Theorem 3 sounds fantastic but makes me uneasy: unbiasedness is a sparse property that is rarely found in statistical problems. Change the parameterisation and you loose unbiasedness. And even the possibility to find an unbiased estimator. Since this difficulty does not appear in the paper, I would conclude that either the assumptions are quite constraining or the result holds in a weaker sense… (Witness the use of “essentially unbiased” in Fig. 4.)

**T**his may be a wee rude comment (even for a Frenchman) but I also felt the paper could be better written in that notations pop in unannounced. For instance, on page 2, x [the data] becomes x^{1:n} becomes s_{k}^{1:n}. This seems to imply that the summary statistics are observed repeatedly over the true sample. Unless n=1, this does not seem realistic. (I do not understand everything in Example 1, in particular the complaint that the ABC solutions were *biased* for finite values of n. That sounds like an odd criticism of Bayesian estimators. Now, it seems the paper is very intent on achieving unbiasedness! So maybe it should be called the aAnsBC algorithm for “not-so-Bayes!) I am also puzzled by the distinction between summary values and summary statistics. This sounds like insisting on having a large enough iid dataset. Or, on page 5, the discussion that the summary parameters are replaced by estimates seems out of context because this adds an additional layer of notation to the existing summary “stuff”… With the additional difficulty that Lemma 1 assumes reparameterisation of the model in terms of those summary parameters. I also object to the point null hypotheses being written in terms of a point estimate, i.e. of a quantity depending on the data x: it sounds like confusing the test [procedure] with the test [problem]. Another example: I read several times Lemma 5 about the calibration of the number of ABC simulations m but cannot fathom what this m is calibrated against. It seems only a certain value of m achieves the accurate correspondence with the genuine posterior, which sounds counter-intuitive. Last counter-example: some pictures seemed to be missing in the Appendix, but as it happened, it is only my tablet being unable to process them! S2 is actually a movie about link functions, really cool!

**I**n conclusion, this is indeed a rich and challenging paper. I am certain I will get a better understanding by listening to Olli’s talk in Roma. And discussing with him.

## Randomness through computation

Posted in Books, Statistics, University life with tags A Search for Certainty, Alan Sokal, Alan Turing, Andrei Kolmogorov, Apple II, Benford's Law, computation, Hector Zenil, Kurt Gödel, Martin-Löf, Ockham's razor, pseudo-random generator, randomness, statistical tests, Wolfram Research on June 22, 2011 by xi'an**A** few months ago, I received a puzzling advertising for this book, ** Randomness through Computation**, and I eventually ordered it, despite getting a rather negative impression from reading the chapter written by Tomasso Toffoli… The book as a whole is definitely perplexing (even when correcting for this initial bias) and I would not recommend it to readers interested in simulation, in computational statistics or even in the philosophy of randomness. My overall feeling is indeed that, while there are genuinely informative and innovative chapters in this book, some chapters read more like newspeak than scientific material (mixing the Second Law of Thermodynamics, Gödel’s incompleteness theorem, quantum physics, and NP completeness within the same sentence) and do not provide a useful entry on the issue of randomness. Hence, the book is not contributing in a significant manner to my understanding of the notion.

*(This post also appeared on the Statistics Forum.)*Continue reading