Peeking Not Considered Harmful
You may have received the impression that it is bad to “peek” at experiment results before your experiment finishes. Sometimes, people call it “snooping”, which certainly sounds bad. I bring good news: done correctly, peeking is not only “okay”, it is good and optimal! Peeking, if done correctly, can greatly reduce the time it takes to experiment!
First, what exactly is peeking? A peek is when we look at the results of an experiment and make a decision. Of course, “just looking” changes nothing and doesn’t matter, but combining looking with a decision (to continue running the experiment or to stop and declare it a winner/loser) fundamentally changes the statistical problem and we have to account for that.
Warnings about peeking are sometimes overdone, and you might get the sense that it’s the most “scientific” to do zero peeking and only make decisions at the end of the experiment. And, sure, zero peeking does achieve the minimum worst-case experiment sample size for a given power level. Alternatively, you may have heard from various A/B testing vendors that they use “always valid p-values” so it’s fine to peek as much as you want.
The truth is that the optimal amount of peeking is not once (at the end of the experiment) and it is not constantly — even if we have correct p-values. The optimal amount of peeking is a compromise between two competing goals: (1) stopping the experiment early if we know we have a winner/loser and (2) limiting the worst case power of the experiment. These two goals are directly opposed, but from the perspective of minimizing expected experiment duration, neither dominates and both enter into the decision problem.
The mild hot take in this blog is that while we should account for peeking when forming confidence intervals and p-values, peeking is not bad — in fact, it is good! You should probably be peeking more if you currently only peek once or twice an experiment! But we shouldn’t be peeking every day. It’s a balance.
Why Do We Need To Adjust For Peeking
Suppose we have a coin and you want to test whether the coin is fair — that is, whether the probability of getting heads or tails is the same.
In one type of test, you flip the coin 100 times and look for whether there is any evidence that the coin is unfair after recording all the flips. This test is just like an ordinary experiment without any peeking until the end of the experiment.
Now, suppose that you design the test slightly differently. You’d prefer to flip the coin as few times as possible. Repetitive coin flipping is boring-with-a-capital-B. You decide that after 50 flips, you’ll check if you’ve got enough evidence to conclude the coin is unfair and if so, you’ll stop right there and save yourself 50 flips. If not, you’ll just have to power through to the end and then make the decision after 100 flips.
Let’s suppose that in reality the coin is fair and that we are doing a standard hypothesis test. The probability of finding the coin unfair (incorrectly) after the first 50 flips is A and that the probability of finding it unfair after the full 100 flips is also A because we are running the same-sized hypothesis test both times. In hypothesis testing, A is called the “significance level”, which a lot of people set at 5%.
The probability that we end up finding the coin is unfair from the second experiment is:
Clearly: Pr(Unfair at both 50, 100) < Pr(Unfair at 50) = A because the event on the left hand side is rarer. So we have: Pr(Unfair after 50 or after 100) > A.
We can see this directly with simulations. We simulate 1000 flips with a peek at the 500’th flip. In the simulation, the probability of determining the coin is unfair (incorrectly) is almost twice as high when we peek as when we always flip the coin 1000 times (8.8% rejection rate relative to a theoretical 5% rejection rate). So this is a major problem. If we ignored it, we’d make errors far too frequently.
Thankfully, it is possible to revise how we do the test so that we don’t get too many rejections. The key idea is to make it more difficult to find the coin unfair in any given check so that Pr(Unfair After 50 or Unfair After 100) = A. The right critical value in this scenario is ~2.18 instead of the standard 1.96. Using this critical value, the coin is unfair 4.8% of the time relative to a 5% rejection rate, i.e. it has the right rejection rate.
A really bad thing to do is to peek at experiment results constantly without adjusting for the very high peeking levels. In this same example, if we checked whether we had enough evidence to find the coin unfair after every flip, we would find the coin unfair a whopping 76% of the time, despite the coin being fair. Sticking to our pre-experiment peeking selection is very important. It is not just a “theoretical” issue. It is a big deal and we will make bad decisions if we do not respect our peeking selection and adjust the p-values accordingly.
The benefit of peeking is that we can reduce the number of coin flips we have to make if the coin is unfair. If the coin is unfair enough that we get a T-statistic in excess of 2.18, then we can stop early and save some coin flips (experiment time). These savings can add up as better features get into production faster and the team can move on to developing and testing new product ideas.
(R snippet for running the sim)
rm(list=ls(all=TRUE))
set.seed(235314543)
n <- 1000
peek <- 500
fair <- 0.5
nsim <- 100000
unfair <- vector(length=nsim)
adj_unfair <- vector(length=nsim)
cont_unfair <- vector(length=nsim)
for (i in 1:nsim)
{
y <- rbinom(n, prob=fair, size=1)
tstats <- sapply(1:n, function(j) sqrt(j)*abs(mean(y[1:j])-fair)/sd(y[1:j]))
tstat_peek <- tstats[peek]
tstat_total <- tstats[n]
reject_peek <- tstat_peek > 1.96
reject_total <- tstat_total > 1.96
adj_reject_peek <- tstat_peek > 2.18
adj_reject_total <- tstat_total > 2.18
unfair[i] <- (reject_peek | reject_total)
adj_unfair[i] <- (adj_reject_peek | adj_reject_total)
cont_unfair[i] <- max(tstats, na.rm=TRUE) > 1.96
}Selecting Peeking Periods
Okay, so how many peeking periods should we choose? Should we avoid peeking altogether and just be patient — isn’t that the most “scientific” thing to do?
Peeking does reduce power for a fixed sample size. And, in fact, if we use a sample sizing tool that can handle peeking, we will see that fewer peeking periods implies smaller total sample sizes, but this can be misleading. The more peeking periods we have, the worse our worst case experiment duration is. But because we have more peeking periods, the expected (or average) experiment duration can decrease. In general, the fastest experiment designs will have more than 1 peeking period but fewer than daily peeking.
Suppose we have beliefs about likely effect sizes. Then, we can do “optimal” selection of the number of equidistant peeks in the following way:
Suppose we have a function that can produce for any given effect size and peeking configuration, the probability of rejection at a given sample size (this will be the output of a sample sizing/power analysis tool):
Let:
Suppose the peeking periods are evenly spaced so that the sample sizes of an individual period are:
Let R(period, peeking periods, effect, sample size) be the probability of rejection in the given period and no rejection in any prior period (derived from the P function).
We can find the optimal peeking configuration by solving the following problem:
i.e. choose peeking periods to minimize the expected experiment duration by having some prior beliefs about likely effect sizes.
While N(peeking periods) increases with the number of peeking periods, N(peeking periods) / peeking periods usually decreases so the probability of getting to stop earlier influences the expected experiment duration.
The solution to the above problem has an interior solution! It is not strictly decreasing or increasing in peeking periods.
Conclusion
The bottom line is that for the fastest experiments, peeking should be a part of the experiment design. I’ve generally seen the above optimization problem yield about 3–5 peeking periods for a four-week experiment. So more than 1, but not daily!
Caveats
All this only applies if we can set the critical values to account for the fact that we are/are not peeking.
If we are using some third-party tool that automatically does “always valid p-values” and there’s no way for us to change that, then there’s no reason not to peek all the time — in fact, we should to minimize experiment duration given that constraint. The advantage of not peeking every day is that the critical values are smaller than in the always valid case, and if we can’t adjust those, than that advantage goes away.
If the third-party tool can’t correct for a pre-specified number of peeks and only uses standard critical values, then we can’t peek given that constraint.
Thanks for reading!
Zach
Connect at: https://linkedin.com/in/zlflynn/ .
If you want my help with any Experimentation, Analytics, etc. problem, click here.






