## robust Bayesian FDR control with Bayes factors [a reply]

*(Following my earlier discussion of his paper, Xiaoquan Wen sent me this detailed reply.)*

I think it is appropriate to start my response to your comments by introducing a little bit of the background information on my research interest and the project itself: I consider myself as an applied statistician, not a theorist, and I am interested in developing theoretically sound and computationally efficient methods to solve practical problems. The FDR project originated from a practical application in genomics involving hypothesis testing. The details of this particular application can be found in this published paper, and the simulations in the manuscript are also designed for a similar context. In this application, the null model is trivially defined, however there exist finitely many alternative scenarios for each test. We proposed a Bayesian solution that handles this complex setting quite nicely: in brief, we chose to model each possible alternative scenario parametrically, and by taking advantage of Bayesian model averaging, Bayes factor naturally ended up as our test statistic. We had no problem in demonstrating the resulting Bayes factor is much more powerful than the existing approaches, even accounting for the prior (mis-)modeling for Bayes factors. However, in this genomics application, there are potentially tens of thousands of tests need to be simultaneously performed, and FDR control becomes necessary and challenging.

We actually started with a full Bayesian approach following the framework proposed by Newton et al (2004) by treating multiple testing as a mixture of null and (many components of) alternative models, and intrinsically making inference w.r.t the proportion of true nulls (I will call this quantity pi0 from this point on). The difficulty is related to the inference of pi0: although for controlling FDR, we were only interested in identifying two classes (null and non-null) from the mixture, the mixture itself can have many more components induced by different alternative models. Through simulations, we observed that the inference of pi0, e.g. its credible interval, is highly sensitive to its prior specification in the mixture model: e.g. the uniform prior led to severe underestimating of pi0 and consequently inflated FDR. (It should be noted, in the simulations, the assumed alternative models and data generative alternative models are reasonably similar, and we also explicitly model the many components of alternatives.) Although applying a more sophisticated Bayesian inference framework (e.g. Dirichlet process type of prior) might resolve this issue, we did not pursue this direction, largely because of the concerns of computational burdens on this large scale data problem. Instead, we turned to the famous Bayes/Non-Bayes compromise, i.e., treating Bayes factor as a test statistic and applying permutations to find its p-value and apply p-value based FDR control procedure. We were actually satisfied with this approach to certain degree and published our application paper using this solution. Until recently, when we started analyzing a scaled-up data set, we realized the permutation p-value scheme hit some computational bottleneck and became impractical. This motivated me to re-think a computational efficient solution to control FDR using Bayes factors.

By telling this story, I’d hope that you would agree with me on the following points:

1. We care very much about the alternative hypothesis and that is why we choose Bayes factors (which require explicit parametric modeling of the alternatives) in the first place. The first quote from the manuscript in your comments,

“Although the Bayesian FDR control is conceptually straightforward, its practical performance is susceptible to alternative model misspecifications. In comparison, the p-value based frequentist FDR control procedures demand only adequate behavior of p-values under the null models and generally ensure targeted FDR control levels, regardless of the distribution of p-values under the assumed alternatives.”

is a critical one, but my emphasis here is that “Bayesian FDR control procedure” is susceptible to “alternative model misspecification”, not so much on Bayes factor itself. I think it boils down to the point if one thinks controlling pre-defined FDR level is a meaningful thing. For us, to compare our Bayes factor solution with the existing frequentist approaches in the application mentioned above, we had to value this standard. Finally, from a mathematical point of view, this can be achieved with Bayes factors in principle, as illustrated in the Bayes/Non-Bayes compromise.

2. I think the mixture model formulation of multiple hypothesis testing problem is quite common in both frequentist and Bayesian practice. But my observation is that the main difficulty seems to be making accurate inference of pi0 when alternative models are highly heterogeneous. In simulation 1 of the manuscript, we made a comparison with the localFDR procedure where the alternatives are inferred by a spline Poisson regression density estimation method with default parameters, and in cases, we found pi0 severely under-estimated (and fdr inflated) if care is not taken. I believe that sophisticated Bayesian/parametric solutions might resolve this issue, but most of the things I know do not scale up computationally to the practical problem that we cared about.

With these explained, I am now ready to present my interpretation of the proposed Bayesian FDR procedure: the proposed procedure attempts to guard pre-defined FDR level rigorously in a computational efficient way, even when the alternatives are highly heterogeneous and/or “accurate” parametric specification is difficult to find (as a result, the accurate inference of pi0 is computationally non-trivial). The proposed procedure has at least two major advantages over the Bayes/Non-Bayes compromise solution:

1. The computational efficiency: the EBF procedure described in the paper does not need any permutation; the QBF procedure may need permutations to estimate the median of Bayes factor distribution under the null but required permutations would be much less than the numbers required to accurately estimate p-values.

2. The procedure is actually controlling for Bayesian FDR not Frequentist FDR, which might be a theoretical advantage and also differs from B/N-B compromise.

Furthermore, I’d like interpret the proposed procedure a computational approximation (albeit a conservative one) to the true Bayesian FDR control procedure described in Newton et al (2004), Mueller et al (2004), and Mueller et al (2006), and therefore purely Bayesian, philosophically speaking. To see this, it is critical to interpret hat-pi0 (equations 3.1 to 3.3 on page 9 in the manuscript) as a probability upper bound of the posterior distribution of pi0 regardless of its prior specification. The “regardless” part comes from an argument of Bayesian asymptotics when number of the simultaneous tests is large (explained also on page 9), which is not an unreasonable assumption given the applications we are facing everyday in genomics. This also should provide intuitions for the way we estimate hat-pi0 by simply applying the law of large numbers (LLN). I agree with your assessment that the procedure looks frequentist-y, but at the same time, I don’t believe that LLN is a patent of frequentist and the part of the procedure should be put into the big picture of the overall scheme. I also acknowledge that this procedure is conservative, but in a similar level as Storey’s procedure and generally better than the Benjamini-Hochberg procedure.

I have to say that I am a bit surprised that you selected the second quote as a representative summary from the manuscript — it really is not. It only aimed to provide a little intuition that sample mean of the BFs carries information about pi0, nothing further. If I could pick a summarizing quote from the manuscript, I’d like choose the following one from the discussion section:

“We have introduced a Bayesian FDR control procedure with Bayes factors that is robust to misspecifications of alternative models. This feature should provide peace of mind for practitioners who are attempting parametric Bayesian models in multiple hypothesis testing. Nevertheless, within our framework, the model specification still dictates the overall performance, e.g., a badly designed alternative model would have very little power and would therefore be useless. Our central message throughout this paper has been that various FDR control procedures have little practical difference if the same or similar test statistics are applied; however, our proposed procedure encourages well-designed parametric modeling approaches to obtain more powerful test statistics.”

Finally, on your final comments that the distribution of Bayes factors can be useful to calibrate approximations, I completely agree! As a matter of fact, in a paper currently in press at *Annals of Applied Statistics*, Matthew Stephens and I have utilized the exact idea to calibrate the approximate Bayes factors computed from the Laplace approximation for finite sample size situations, and it worked amazingly well (see appendix D)!

## Leave a Reply