stopping rule impact

Here is a question from my friend Shravan Vasishth about the consequences of using a stopping rule:

Psycholinguists and psychologists often adopt the following type of data-gathering procedure: The experimenter gathers n data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (n more data points). Since time and money are limited, he might decide to stop anyway at sample size, say, some multiple of n. One can play with different scenarios here. A typical n might be 10 or 15.

This approach would give us a distribution of t-values and p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses (I also track the t-statistic, in order to compare it with my stopping rule code below).

Here’s a simulation showing what happens. I wanted to ask you whether this simulation makes sense. I assume here that the experimenter gathers 10 data points, then checks for significance (p<0.05 or not). If it’s not significant, he gets more data (10 more data points). Since time and money are limited, he might decide to stop anyway at sample size 60. This gives us p-values under repeated sampling. Theoretically, under the standard assumptions of frequentist methods, we expect a Type I error to be 0.05. This is the case in standard analyses:
##Standard:
pvals<-NULL
tstat_standard<-NULL
n<-10 # sample size
nsim<-1000 # number of simulations
stddev<-1 # standard dev
mn<-0 ## mean

for(i in 1:nsim){
  samp<-rnorm(n,mean=mn,sd=stddev)
  pvals[i]<-t.test(samp)$p.value
  tstat_standard[i]<-t.test(samp)$statistic}

## Type I error rate: about 5% as theory says:
table(pvals<0.05)[2]/nsim
But the situation quickly deteriorates as soon as adopt the strategy I outlined above:
pvals<-NULL
tstat<-NULL
## how many subjects can I run?
upper_bound<-n*6

for(i in 1:nsim){
## at the outset we have no significant result:
 significant<-FALSE
## null hyp is going to be true,
## so any rejection is a mistake.
## take sample:
 x<-rnorm(n,mean=mn,sd=stddev)
 while(!significant & length(x)<upper_bound){
  ## if not significant:
  if(t.test(x)$p.value>0.05){
  ## get more data:
   x<-append(x,rnorm(n,mean=mn,sd=stddev))
  ## otherwise stop:
  } else {significant<-TRUE}}
## will be either significant or not:
 pvals[i]<-t.test(x)$p.value
 tstat[i]<-t.test(x)$statistic}
Now let’s compare the distribution of the t-statistic in the standard case vs with the above stopping rule. We get fatter tails with the above stopping rule, as shown by the histogram below.

Is this a correct way to think about the stopping rule problem?

stoppin

To which I replied the following:

By adopting a stopping rule on a random iid sequence, you favour values in the sequence that agree with your stopping condition, hence modify the distribution of the outcome. To take an extreme example, if you draw N(0,1) variates until the empirical average is between -2 and 2, the average thus produced cannot remain N(0,1/n) but have a different distribution.

The t-test statistic you build from your experiment is no longer distributed as a uniform variate because of the stopping rule: the sample(x₁,…,x_10m) (with random size 10m [resulting from increases in the sample size by adding 10 more observations at a time] is distributed from

$\prod_{i=1}^{10m} \phi(x_i) \times \prod_{j=1}^{m-1} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05} \times \mathbb{I}_{t(x_1,\ldots,x_{10m})<.05}$

if 10m<60 [assuming the maximal acceptable sample size is 60] and from

$\prod_{i=1}^{60} \phi(x_i) \times \prod_{j=1}^{5} \mathbb{I}_{t(x_1,\ldots,x_{10j})>.05}$

otherwise. The histogram at the top of this post is the empirical distribution of the average of those observations, clearly far from a normal distribution.

This entry was posted on May 9, 2014 at 12:14 am and is filed under Books, R, Statistics, University life with tags bias, p-values, probability theory, stopping rule. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

4 Responses to “stopping rule impact”

John K. Kruschke Says:
May 10, 2014 at 4:15 am

Much ink has been spilled on this issue of “optional stopping.” An alternative is to sample until a desired degree of precision is reached; e.g., this.

Reply
Dag Hjermann Says:
May 9, 2014 at 1:32 pm

Very nice blog note. Also, running
hist(pvals)
after each of the two parts of the R script is pretty convincing. Actually the probability of making a Type I error increases to ca. 0.15.

Reply
stopping rule impact ← Patient 2 Earn Says:
May 9, 2014 at 6:30 am

[…] article was first published on Xi'an's Og » R, and kindly contributed to […]

Reply
Ken Butler Says:
May 9, 2014 at 3:30 am

This kind of idea goes back to about the 50s under the name “sequential analysis”. There, you either (a) stop and reject the null, (b) stop and fail to reject the null, or (c) keep sampling. The procedure is designed so that it terminates with probability 1. The rules tell you what to do you get (approximately) your chosen probabilities of type I and type II errors. (I think you get an “expected sample size” which helps with study planning.) But your intuition is sound, since the rejection region is different for a sequential test than for a regular one.

Reply

Xi'an's Og

stopping rule impact

Share:

Related

4 Responses to “stopping rule impact”

Leave a comment Cancel reply