## on using the data twice…

As I was writing my next column for CHANCE, I decided I will include a methodology box about “using the data twice”. Here is the draft. (The second part is reproduced verbatim from an earlier post on Error and Inference.)

Several aspects of the books covered in this CHANCE review [i.e., Bayesian ideas and data analysis, and Bayesian modeling using WinBUGS] face the problem of “using the data twice”. What does that mean? Nothing really precise, actually. The accusation of “using the data twice” found in the Bayesian literature can be thrown at most procedures exploiting the Bayesian machinery without actually being Bayesian, i.e.~which cannot be derived from the posterior distribution. For instance, the integrated likelihood approach in Murray Aitkin’s Statistical Inference avoids the difficulties related with improper priors πi by first using the data x to construct (proper) posteriors πii|x) and then secondly using the data in a Bayes factor

$\int_{\Theta_1}f_1(x|\theta_1) \pi_1(\theta_1|x)\,\text{d}\theta_1\bigg/ \int_{\Theta_2}f_2(x|\theta_2)\pi_2(\theta_2|x)\,\text{d}\theta_2$

as if the posteriors were priors. This obviously solves the improperty difficulty (see. e.g., The Bayesian Choice), but it creates a statistical procedure outside the Bayesian domain, hence requiring a separate validation since the usual properties of Bayesian procedures do not apply. Similarly, the whole empirical Bayes approach falls under this category, even though some empirical Bayes procedures are asymptotically convergent. The pseudo-marginal likelihood of Geisser and Eddy (1979), used in  Bayesian ideas and data analysis, is defined by

$\hat m(x) = \prod_{i=1}^n f_i(x_i|x_{-i})$

through the marginal posterior likelihoods. While it also allows for improper priors, it does use the same data in each term of the product and, again, it is not a Bayesian procedure.

Once again, from first principles, a Bayesian approach should use the data only once, namely when constructing the posterior distribution on every unknown component of the model(s).  Based on this all-encompassing posterior, all inferential aspects should be the consequences of a sequence of decision-theoretic steps in order to select optimal procedures. This is the ideal setting while, in practice,  relying on a sequence of posterior distributions is often necessary, each posterior being a consequence of earlier decisions, which makes it the result of a multiple (improper) use of the data… For instance, the process of Bayesian variable selection is on principle clean from the sin of “using the data twice”: one simply computes the posterior probability of each of the variable subsets and this is over. However, in a case involving many (many) variables, there are two difficulties: one is about building the prior distributions for all possible models, a task that needs to be automatised to some extent; another is about exploring the set of potential models. First, ressorting to projection priors as in the intrinsic solution of Pèrez and Berger (2002, Biometrika, a much valuable article!), while unavoidable and a “least worst” solution, means switching priors/posteriors based on earlier acceptances/rejections, i.e. on the data. Second, the path of models truly explored by a computational algorithm [which will be a minuscule subset of the set of all models] will depend on the models rejected so far, either when relying on a stepwise exploration or when using a random walk MCMC algorithm. Although this is not crystal clear (there is actually plenty of room for supporting the opposite view!), it could be argued that the data is thus used several times in this process…

### 6 Responses to “on using the data twice…”

1. Matthieu Authier Says:

Hello,

thanks very much for the answer. I was asking about the horseshoe because a part of the Biometrika paper deals with thresholding, and conditional on the threshold dropping some variables. I was feeling that using the horseshoe prior and reporting posteriors from this model is enough. That is, we don’t need to re-run a model with only the predictors that were selected before, since that would be using the data twice (but given my experience so far with the publication process in the field of ecology, this is probably what will be asked by reviewers).
Hence my question about the horseshoe. I feel that it could fall also in your category (e) in practice but really is fine on its own (d).
Or in Tolkienian prose, the horseshoe may be then some sort of “One model to find them all and in the prior shrink them”…
Sorry for bugging you with probably naïve questions and mille mercis d’avoir pris le temps de répondre.
Cheers
Matthieu

• Thanks for the precision! Even though I had handled a previous version of this paper, I did not remember the re-run part (and I have not re-checked the paper so far). Now, this is entering a grey area albeit far from the Mordor of “using the data twice” (to keep with the Tolkienian theme!). In a Bayesian analysis with different models, the global posterior distribution is the weighted sum of all the posteriors for the different models. Once a model is selected based on this global posterior, we do not rerun the chosen model since it already is part of the global model. I would thus say it’s mostly kosher.

2. Rather than make it an “all or nothing” principle, one needs an epistemological criterion to distinguish pejorative from unpejorative double counting. That is what I hope the “severity principle” achieves, at least for someone who cares to evaluate procedures according to how well or poorly specific errors are controlled/ruled out. When double counting does warrant an adjustment, we appeal essentially to the sampling distribution. But I’m not sure what the Bayesian justification for caring is. Unless one foists it on the prior, but I’d still want to know the grounds, then the evidential relationship between data and hypothesis, for likelihoodists, should have nothing to do with looking at or using up the data—isn’t that part of the Bayesian magic?

• Thanks! Using a Bayesian approach where all unknowns are endowed with prior probabilities (for models) or prior distributions (for parameters), and where the errors are evaluated under a loss function, the data is only used once in the (global) posterior, which is then the only entity needed to conduct the decision process. A genuine Bayesian approach cannot therefore “use the data twice”. Only approximations to Bayesian procedures open the Pandora box of re-using the data, some of them in obviously detrimental fashions… Correcting for this can only be done in an un-Bayesian way, using asymptotics and indeed the sampling distribution.

3. Authier Matthieu Says:

Hello,
as an ecologist, I am very often confronted to the issue of model selection. Most of the time, this is done by some automated procedure that computes the AIC of all possible variable combination. I suspect then what we’re doing more variable selection than model selection. Anyway, the process is tedious and looks a bit like an overkill : most of the models considered have not been thought about (after all, R does this on its own without thinking) and some are just not plausible.
I read recently the Biometrika paper of Carvahlo, Scott and Polson about the horsehoe estimator for sparse signals. I would be very curious to know your thoughts about this procedure. I have used the horseshoe in a logistic regression about clutch size in a seabird. It was very useful in shrinking some spurious signals (evaluated from posterior predictive checks… I guess this is using the data twice…).
My more focus question is then : do you think the horseshoe prior can solve this “using the data twice” problem?
Yours sincerely
Matthieu

• Matthieu: thanks for the comments. There are several issues:
(a) “using the data twice” is not related to Bayesian model selection per se in that we can run proper Bayesian model selection w/o using the data twice. It is also possible to “use the data twice” in non-Bayesian settings.
(a’) by the way variable selection is model choice [for me].
(b) Lasso is a penalised likelihood technique for variable comparison / nested model comparison. It has a vague Bayesian flavour that was precised by Park and Casella (2008).
(c) AIC, BIC, &C, Lasso, are on principle free from the sin of “using the data twice”, being all of the likelihood ratio type.
(c’) this is less clear for DIC!
(d) Carvahlo, Scott and Polson have reanalysed the Bayesian lasso by picking another prior on the penalty coefficients. This is strictly Bayesian, hence does not use the data twice.
(e) in practice, people use mixes of Bayesian and non-Bayesian techniques, like pluggin-hyperparameters, which end up using the data more than once…

Merci et à bientôt!