## testing via credible sets

Posted in Statistics, University life with tags , , , , , , , , , , , on October 8, 2012 by xi'an

Måns Thulin released today an arXiv document on some decision-theoretic justifications for [running] Bayesian hypothesis testing through credible sets. His main point is that using the unnatural prior setting mass on a point-null hypothesis can be avoided by rejecting the null when the point-null value of the parameter does not belong to the credible interval and that this decision procedure can be validated through the use of special loss functions. While I stress to my students that point-null hypotheses are very unnatural and should be avoided at all cost, and also that constructing a confidence interval is not the same as designing a test—the former assess the precision in the estimation, while the later opposes two different and even incompatible models—, let us consider Måns’ arguments for their own sake.

The idea of the paper is that there exist loss functions for testing point-null hypotheses that lead to HPD, symmetric and one-sided intervals as acceptance regions, depending on the loss func. This was already found in Pereira & Stern (1999). The issue with these loss functions is that they involve the corresponding credible sets in their definition, hence are somehow tautological. For instance, when considering the HPD set and T(x) as the largest HPD set not containing the point-null value of the parameter, the corresponding loss function is

$L(\theta,\varphi,x) = \begin{cases}a\mathbb{I}_{T(x)^c}(\theta) &\text{when }\varphi=0\\ b+c\mathbb{I}_{T(x)}(\theta) &\text{when }\varphi=1\end{cases}$

parameterised by a,b,c. And depending on the HPD region.

Måns then introduces new loss functions that do not depend on x and still lead to either the symmetric or the one-sided credible intervals.as acceptance regions. However, one test actually has two different alternatives (Theorem 2), which makes it essentially a composition of two one-sided tests, while the other test returns the result to a one-sided test (Theorem 3), so even at this face-value level, I do not find the result that convincing. (For the one-sided test, George Casella and Roger Berger (1986) established links between Bayesian posterior probabilities and frequentist p-values.) Both Theorem 3 and the last result of the paper (Theorem 4) use a generic and set-free observation-free loss function (related to eqn. (5.2.1) in my book!, as quoted by the paper) but (and this is a big but) they only hold for prior distributions setting (prior) mass on both the null and the alternative. Otherwise, the solution is to always reject the hypothesis with the zero probability… This is actually an interesting argument on the why-are-credible-sets-unsuitable-for-testing debate, as it cannot bypass the introduction of a prior mass on Θ0!

Overall, I furthermore consider that a decision-theoretic approach to testing should encompass future steps rather than focussing on the reply to the (admittedly dumb) question is θ zero? Therefore, it must have both plan A and plan B at the ready, which means preparing (and using!) prior distributions under both hypotheses. Even on point-null hypotheses.

Now, after I wrote the above, I came upon a Stack Exchange page initiated by Måns last July. This is presumably not the first time a paper stems from Stack Exchange, but this is a fairly interesting outcome: thanks to the debate on his question, Måns managed to get a coherent manuscript written. Great! (In a sense, this reminded me of the polymath experiments of Terry Tao, Timothy Gower and others. Meaning that maybe most contributors could have become coauthors to the paper!)

## Large-scale Inference

Posted in Books, R, Statistics, University life with tags , , , , , , , , , , , , , , , on February 24, 2012 by xi'an

Large-scale Inference by Brad Efron is the first IMS Monograph in this new series, coordinated by David Cox and published by Cambridge University Press. Since I read this book immediately after Cox’ and Donnelly’s Principles of Applied Statistics, I was thinking of drawing a parallel between the two books. However, while none of them can be classified as textbooks [even though Efron's has exercises], they differ very much in their intended audience and their purpose. As I wrote in the review of Principles of Applied Statistics, the book has an encompassing scope with the goal of covering all the methodological steps  required by a statistical study. In Large-scale Inference, Efron focus on empirical Bayes methodology for large-scale inference, by which he mostly means multiple testing (rather than, say, data mining). As a result, the book is centred on mathematical statistics and is more technical. (Which does not mean it less of an exciting read!) The book was recently reviewed by Jordi Prats for Significance. Akin to the previous reviewer, and unsurprisingly, I found the book nicely written, with a wealth of R (colour!) graphs (the R programs and dataset are available on Brad Efron’s home page).

I have perhaps abused the “mono” in monograph by featuring methods from my own work of the past decade.” (p.xi)

Sadly, I cannot remember if I read my first Efron’s paper via his 1977 introduction to the Stein phenomenon with Carl Morris in Pour la Science (the French translation of Scientific American) or through his 1983 Pour la Science paper with Persi Diaconis on computer intensive methods. (I would bet on the later though.) In any case, I certainly read a lot of the Efron’s papers on the Stein phenomenon during my thesis and it was thus with great pleasure that I saw he introduced empirical Bayes notions through the Stein phenomenon (Chapter 1). It actually took me a while but I eventually (by page 90) realised that empirical Bayes was a proper subtitle to Large-Scale Inference in that the large samples were giving some weight to the validation of empirical Bayes analyses. In the sense of reducing the importance of a genuine Bayesian modelling (even though I do not see why this genuine Bayesian modelling could not be implemented in the cases covered in the book).

Large N isn’t infinity and empirical Bayes isn’t Bayes.” (p.90)

The core of Large-scale Inference is multiple testing and the empirical Bayes justification/construction of Fdr’s (false discovery rates). Efron wrote more than a dozen papers on this topic, covered in the book and building on the groundbreaking and highly cited Series B 1995 paper by Benjamini and Hochberg. (In retrospect, it should have been a Read Paper and so was made a “retrospective read paper” by the Research Section of the RSS.) Frd are essentially posterior probabilities and therefore open to empirical Bayes approximations when priors are not selected. Before reaching the concept of Fdr’s in Chapter 4, Efron goes over earlier procedures for removing multiple testing biases. As shown by a section title (“Is FDR Control “Hypothesis Testing”?”, p.58), one major point in the book is that an Fdr is more of an estimation procedure than a significance-testing object. (This is not a surprise from a Bayesian perspective since the posterior probability is an estimate as well.)

Scientific applications of single-test theory most often suppose, or hope for rejection of the null hypothesis (…) Large-scale studies are usually carried out with the expectation that most of the N cases will accept the null hypothesis.” (p.89)

On the innovations proposed by Efron and described in Large-scale Inference, I particularly enjoyed the notions of local Fdrs in Chapter 5 (essentially pluggin posterior probabilities that a given observation stems from the null component of the mixture) and of the (Bayesian) improvement brought by empirical null estimation in Chapter 6 (“not something one estimates in classical hypothesis testing”, p.97) and the explanation for the inaccuracy of the bootstrap (which “stems from a simpler cause”, p.139), but found less crystal-clear the empirical evaluation of the accuracy of Fdr estimates (Chapter 7, ‘independence is only a dream”, p.113), maybe in relation with my early career inability to explain Morris’s (1983) correction for empirical Bayes confidence intervals (pp. 12-13). I also discovered the notion of enrichment in Chapter 9, with permutation tests resembling some low-key bootstrap, and multiclass models in Chapter 10, which appear as if they could benefit from a hierarchical Bayes perspective. The last chapter happily concludes with one of my preferred stories, namely the missing species problem (on which I hope to work this very Spring).

## on using the data twice…

Posted in Books, Statistics, University life with tags , , , , , , , , on January 13, 2012 by xi'an

As I was writing my next column for CHANCE, I decided I will include a methodology box about “using the data twice”. Here is the draft. (The second part is reproduced verbatim from an earlier post on Error and Inference.)

Several aspects of the books covered in this CHANCE review [i.e., Bayesian ideas and data analysis, and Bayesian modeling using WinBUGS] face the problem of “using the data twice”. What does that mean? Nothing really precise, actually. The accusation of “using the data twice” found in the Bayesian literature can be thrown at most procedures exploiting the Bayesian machinery without actually being Bayesian, i.e.~which cannot be derived from the posterior distribution. For instance, the integrated likelihood approach in Murray Aitkin’s Statistical Inference avoids the difficulties related with improper priors πi by first using the data x to construct (proper) posteriors πii|x) and then secondly using the data in a Bayes factor

$\int_{\Theta_1}f_1(x|\theta_1) \pi_1(\theta_1|x)\,\text{d}\theta_1\bigg/ \int_{\Theta_2}f_2(x|\theta_2)\pi_2(\theta_2|x)\,\text{d}\theta_2$

as if the posteriors were priors. This obviously solves the improperty difficulty (see. e.g., The Bayesian Choice), but it creates a statistical procedure outside the Bayesian domain, hence requiring a separate validation since the usual properties of Bayesian procedures do not apply. Similarly, the whole empirical Bayes approach falls under this category, even though some empirical Bayes procedures are asymptotically convergent. The pseudo-marginal likelihood of Geisser and Eddy (1979), used in  Bayesian ideas and data analysis, is defined by

$\hat m(x) = \prod_{i=1}^n f_i(x_i|x_{-i})$

through the marginal posterior likelihoods. While it also allows for improper priors, it does use the same data in each term of the product and, again, it is not a Bayesian procedure.

Once again, from first principles, a Bayesian approach should use the data only once, namely when constructing the posterior distribution on every unknown component of the model(s).  Based on this all-encompassing posterior, all inferential aspects should be the consequences of a sequence of decision-theoretic steps in order to select optimal procedures. This is the ideal setting while, in practice,  relying on a sequence of posterior distributions is often necessary, each posterior being a consequence of earlier decisions, which makes it the result of a multiple (improper) use of the data… For instance, the process of Bayesian variable selection is on principle clean from the sin of “using the data twice”: one simply computes the posterior probability of each of the variable subsets and this is over. However, in a case involving many (many) variables, there are two difficulties: one is about building the prior distributions for all possible models, a task that needs to be automatised to some extent; another is about exploring the set of potential models. First, ressorting to projection priors as in the intrinsic solution of Pèrez and Berger (2002, Biometrika, a much valuable article!), while unavoidable and a “least worst” solution, means switching priors/posteriors based on earlier acceptances/rejections, i.e. on the data. Second, the path of models truly explored by a computational algorithm [which will be a minuscule subset of the set of all models] will depend on the models rejected so far, either when relying on a stepwise exploration or when using a random walk MCMC algorithm. Although this is not crystal clear (there is actually plenty of room for supporting the opposite view!), it could be argued that the data is thus used several times in this process…

## Error and Inference [arXived]

Posted in Books, Statistics, University life with tags , , , , , , , on November 29, 2011 by xi'an

Following my never-ending series of posts on the book Error and Inference, (edited) by Deborah Mayo and Ari Spanos (and kindly sent to me by Deborah), I decided to edit those posts into a (slightly) more coherent document, now posted on arXiv. And to submit it as a book review to Siam Review, even though I had not high expectations it fits the purpose of the journal: the review was rejected between the submission to arXiv and the publication of this post!