## Shravan’s comments on “Valen in Le Monde” [guest post]

Posted in Books, Statistics, University life with tags , , , , , , , on November 22, 2013 by xi'an

[Those are comments sent yesterday by Shravan Vasishth in connection with my post. Since they are rather lengthy, I made them into a post. Shravan is also the author of The foundations of Statistics and we got in touch through my review of the book . I may address some of his points later, but, for now, I find the perspective of a psycholinguist quite interesting to hear.]

Christian, Is the problem for you that the p-value, however low, is only going to tell you the probability of your data (roughly speaking) assuming the null is true, it’s not going to tell you anything about the probability of the alternative hypothesis, which is the real hypothesis of interest.

However, limiting the discussion to (Bayesian) hierarchical models (linear mixed models), which is the type of model people often fit in repeated measures studies in psychology (or at least in psycholinguistics), as long as the problem is about figuring out P(θ>0) or P(θ>0), the decision (to act as if θ>0) is going to be the same regardless of whether one uses p-values or a fully Bayesian approach. This is because the likelihood is going to dominate in the Bayesian model.

Andrew has objected to this line of reasoning by saying that making a decision like θ>0 is not a reasonable one in the first place. That is true in some cases, where the result of one experiment never replicates because of study effects or whatever. But there are a lot of effects which are robust and replicable, and where it makes sense to ask these types of questions.

One central issue for me is: in situations like these, using a low p-value to make such a decision is going to yield pretty similar outcomes compared to doing inference using the posterior distribution. The machinery needed to do a fully Bayesian analysis is very intimidating; you need to know a lot, and you need to do a lot more coding and checking than when you fit an lmer type of model.

It took me 1.5 to 2 years of hard work (=evenings spent not reading novels) to get to the point that I knew roughly what I was doing when fitting Bayesian models. I don’t blame anyone for not wanting to put their life on hold to get to such a point. I find the Bayesian method attractive because it actually answers the question I really asked, namely is θ>0 or θ<0? This is really great, I don’t have beat around the bush any more! (there; I just used an exclamation mark). But for the researcher unwilling (or more likely: unable) to invest the time into the maths and probability theory and the world of BUGS, the distance between a heuristic like a low p-value and the more sensible Bayesian approach is not that large.

## uniformly most powerful Bayesian tests???

Posted in Books, Statistics, University life with tags , , , , , , , on September 30, 2013 by xi'an

“The difficulty in constructing a Bayesian hypothesis test arises from the requirement to specify an alternative hypothesis.”

Vale Johnson published (and arXived) a paper in the Annals of Statistics on uniformly most powerful Bayesian tests. This is in line with earlier writings of Vale on the topic and good quality mathematical statistics, but I cannot really buy the arguments contained in the paper as being compatible with (my view of) Bayesian tests. A “uniformly most powerful Bayesian test” (acronymed as UMBT)  is defined as

“UMPBTs provide a new form of default, nonsubjective Bayesian tests in which the alternative hypothesis is determined so as to maximize the probability that a Bayes factor exceeds a specified threshold”

which means selecting the prior under the alternative so that the frequentist probability of the Bayes factor exceeding the threshold is maximal for all values of the parameter. This does not sound very Bayesian to me indeed, due to this averaging over all possible values of the observations x and comparing the probabilities for all values of the parameter θ rather than integrating against a prior or posterior and selecting the prior under the alternative with the sole purpose of favouring the alternative, meaning its further use when the null is rejected is not considered at all and catering to non-Bayesian theories, i.e. trying to sell Bayesian tools as supplementing p-values and arguing the method is objective because the solution satisfies a frequentist coverage (at best, this maximisation of the rejection probability reminds me of minimaxity, except there is no clear and generic notion of minimaxity in hypothesis testing).

## Olli à/in/im Paris

Posted in Statistics, Travel, University life with tags , , , , , , , , , , , , on May 27, 2013 by xi'an

Warning: Here is an old post from last October I can at last post since Olli just arXived the paper on which this talk was based (more to come, before or after Olli’s talk in Roma!).

Oliver Ratman came to give a seminar today at our Big’MC seminar series. It was an extension of the talk I attended last month in Bristol:

10:45 Oliver Ratmann (Duke University and Imperial College) – “Approximate Bayesian Computation based on summaries with frequency properties”

Approximate Bayesian Computation (ABC) has quickly become a valuable tool in many applied fields, but the statistical properties obtained by choosing a particular summary, distance function and error threshold are poorly understood. In an effort to better understand the effect of these ABC tuning parameters, we consider summaries that are associated with empirical distribution functions. These frequency properties of summaries suggest what kind of distance function are appropriate, and the validity of the choice of summaries can be assessed on the fly during Monte Carlo simulations. Among valid choices, uniformly most powerful distances can be shown to optimize the ABC acceptance probability. Considering the binding function between the ABC model and the frequency model of the summaries, we can characterize the asymptotic consistency of the ABC maximum-likelhood estimate in general situations. We provide examples from phylogenetics and dynamical systems to demonstrate that empirical distribution functions of summaries can often be obtained without expensive re-simulations, so that the above theoretical results are applicable in a broad set of applications. In part, this work will be illustrated on fitting phylodynamic models that capture the evolution and ecology of interpandemic influenza A (H3N2) to incidence time series and the phylogeny of H3N2′s immunodominant haemagglutinin gene.

I however benefited enormously from hearing the talk again and also from discussing the fundamentals of his approach before and after the talk (in the nearest Aussie pub!). Olli’s approach is (once again!) rather iconoclastic in that he presents ABC as a testing procedure, using frequentist tests and concepts to build an optimal acceptance condition. Since he manipulates several error terms simultaneously (as before), he needs to address the issue of multiple testing but, thanks to a switch between acceptance and rejection, null and alternative, the individual α-level tests get turned into a global α-level test.

## testing via credible sets

Posted in Statistics, University life with tags , , , , , , , , , , , on October 8, 2012 by xi'an

Måns Thulin released today an arXiv document on some decision-theoretic justifications for [running] Bayesian hypothesis testing through credible sets. His main point is that using the unnatural prior setting mass on a point-null hypothesis can be avoided by rejecting the null when the point-null value of the parameter does not belong to the credible interval and that this decision procedure can be validated through the use of special loss functions. While I stress to my students that point-null hypotheses are very unnatural and should be avoided at all cost, and also that constructing a confidence interval is not the same as designing a test—the former assess the precision in the estimation, while the later opposes two different and even incompatible models—, let us consider Måns’ arguments for their own sake.

The idea of the paper is that there exist loss functions for testing point-null hypotheses that lead to HPD, symmetric and one-sided intervals as acceptance regions, depending on the loss func. This was already found in Pereira & Stern (1999). The issue with these loss functions is that they involve the corresponding credible sets in their definition, hence are somehow tautological. For instance, when considering the HPD set and T(x) as the largest HPD set not containing the point-null value of the parameter, the corresponding loss function is

$L(\theta,\varphi,x) = \begin{cases}a\mathbb{I}_{T(x)^c}(\theta) &\text{when }\varphi=0\\ b+c\mathbb{I}_{T(x)}(\theta) &\text{when }\varphi=1\end{cases}$

parameterised by a,b,c. And depending on the HPD region.

Måns then introduces new loss functions that do not depend on x and still lead to either the symmetric or the one-sided credible intervals.as acceptance regions. However, one test actually has two different alternatives (Theorem 2), which makes it essentially a composition of two one-sided tests, while the other test returns the result to a one-sided test (Theorem 3), so even at this face-value level, I do not find the result that convincing. (For the one-sided test, George Casella and Roger Berger (1986) established links between Bayesian posterior probabilities and frequentist p-values.) Both Theorem 3 and the last result of the paper (Theorem 4) use a generic and set-free observation-free loss function (related to eqn. (5.2.1) in my book!, as quoted by the paper) but (and this is a big but) they only hold for prior distributions setting (prior) mass on both the null and the alternative. Otherwise, the solution is to always reject the hypothesis with the zero probability… This is actually an interesting argument on the why-are-credible-sets-unsuitable-for-testing debate, as it cannot bypass the introduction of a prior mass on Θ0!

Overall, I furthermore consider that a decision-theoretic approach to testing should encompass future steps rather than focussing on the reply to the (admittedly dumb) question is θ zero? Therefore, it must have both plan A and plan B at the ready, which means preparing (and using!) prior distributions under both hypotheses. Even on point-null hypotheses.

Now, after I wrote the above, I came upon a Stack Exchange page initiated by Måns last July. This is presumably not the first time a paper stems from Stack Exchange, but this is a fairly interesting outcome: thanks to the debate on his question, Måns managed to get a coherent manuscript written. Great! (In a sense, this reminded me of the polymath experiments of Terry Tao, Timothy Gower and others. Meaning that maybe most contributors could have become coauthors to the paper!)

## Large-scale Inference

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , on February 24, 2012 by xi'an

Large-scale Inference by Brad Efron is the first IMS Monograph in this new series, coordinated by David Cox and published by Cambridge University Press. Since I read this book immediately after Cox’ and Donnelly’s Principles of Applied Statistics, I was thinking of drawing a parallel between the two books. However, while none of them can be classified as textbooks [even though Efron's has exercises], they differ very much in their intended audience and their purpose. As I wrote in the review of Principles of Applied Statistics, the book has an encompassing scope with the goal of covering all the methodological steps  required by a statistical study. In Large-scale Inference, Efron focus on empirical Bayes methodology for large-scale inference, by which he mostly means multiple testing (rather than, say, data mining). As a result, the book is centred on mathematical statistics and is more technical. (Which does not mean it less of an exciting read!) The book was recently reviewed by Jordi Prats for Significance. Akin to the previous reviewer, and unsurprisingly, I found the book nicely written, with a wealth of R (colour!) graphs (the R programs and dataset are available on Brad Efron’s home page).

I have perhaps abused the “mono” in monograph by featuring methods from my own work of the past decade.” (p.xi)

Sadly, I cannot remember if I read my first Efron’s paper via his 1977 introduction to the Stein phenomenon with Carl Morris in Pour la Science (the French translation of Scientific American) or through his 1983 Pour la Science paper with Persi Diaconis on computer intensive methods. (I would bet on the later though.) In any case, I certainly read a lot of the Efron’s papers on the Stein phenomenon during my thesis and it was thus with great pleasure that I saw he introduced empirical Bayes notions through the Stein phenomenon (Chapter 1). It actually took me a while but I eventually (by page 90) realised that empirical Bayes was a proper subtitle to Large-Scale Inference in that the large samples were giving some weight to the validation of empirical Bayes analyses. In the sense of reducing the importance of a genuine Bayesian modelling (even though I do not see why this genuine Bayesian modelling could not be implemented in the cases covered in the book).

Large N isn’t infinity and empirical Bayes isn’t Bayes.” (p.90)

The core of Large-scale Inference is multiple testing and the empirical Bayes justification/construction of Fdr’s (false discovery rates). Efron wrote more than a dozen papers on this topic, covered in the book and building on the groundbreaking and highly cited Series B 1995 paper by Benjamini and Hochberg. (In retrospect, it should have been a Read Paper and so was made a “retrospective read paper” by the Research Section of the RSS.) Frd are essentially posterior probabilities and therefore open to empirical Bayes approximations when priors are not selected. Before reaching the concept of Fdr’s in Chapter 4, Efron goes over earlier procedures for removing multiple testing biases. As shown by a section title (“Is FDR Control “Hypothesis Testing”?”, p.58), one major point in the book is that an Fdr is more of an estimation procedure than a significance-testing object. (This is not a surprise from a Bayesian perspective since the posterior probability is an estimate as well.)

Scientific applications of single-test theory most often suppose, or hope for rejection of the null hypothesis (…) Large-scale studies are usually carried out with the expectation that most of the N cases will accept the null hypothesis.” (p.89)

On the innovations proposed by Efron and described in Large-scale Inference, I particularly enjoyed the notions of local Fdrs in Chapter 5 (essentially pluggin posterior probabilities that a given observation stems from the null component of the mixture) and of the (Bayesian) improvement brought by empirical null estimation in Chapter 6 (“not something one estimates in classical hypothesis testing”, p.97) and the explanation for the inaccuracy of the bootstrap (which “stems from a simpler cause”, p.139), but found less crystal-clear the empirical evaluation of the accuracy of Fdr estimates (Chapter 7, ‘independence is only a dream”, p.113), maybe in relation with my early career inability to explain Morris’s (1983) correction for empirical Bayes confidence intervals (pp. 12-13). I also discovered the notion of enrichment in Chapter 9, with permutation tests resembling some low-key bootstrap, and multiclass models in Chapter 10, which appear as if they could benefit from a hierarchical Bayes perspective. The last chapter happily concludes with one of my preferred stories, namely the missing species problem (on which I hope to work this very Spring).