Sequential testing with general loss function

9/20/2023

That is why despite the critical value being flat it results in a convex error spending line. The ones showing the greatest discrepancies gradually end as the test reaches later stages which has a double effect of reducing the total number of tests reaching each subsequent stage and also making it less and less probable that these tests would show large enough differences to reach the critical value we have set. The reason the number is not steady is that tests reaching later stages partially depend on outcomes of tests on previous stages. There is a declining number of positive outcomes. However, this is just one possible outcome! If the error rate is to be maintained across the whole set of possible outcomes from the test (all outcomes on all possible exit stages), we would have to see zero tests declare a winner for stages two through to and including twelve.īut that is not the case. There are 498 winners in the very first stage, which is what one would expect from a procedure with up to 5% error probability (498/10000 = 4.98%). The graph shows the number of tests declaring a winner at each stage (columns) as well as the error probability spent on each stage (line). Plotted on a graph this stopping rule looks like so: Using the above rule is equivalent to unaccounted peeking with intent to stop. This rule is equivalent to stopping the experiment when a one-sided 95% confidence interval for difference in proportions excludes zero. We have chosen a 0.05 p-value threshold.įurthermore, we will use a simple rule to decide when to stop ( stopping rule): if at a given monitoring stage the p-value for the difference between B (treatment) and A (control) is lower than the chosen significance threshold, we stop the test and declare the variant a winner. You can think of it as a test running for 12 weeks, for example. Let us say we plan to run a test and evaluate its data at 12 monitoring points equally spaced in time. Axis ranges are deliberately kept the same across comparable graphs to aid in direct visual comparisons (exception: right y-axis of graph #2 is very slightly off in order to keep the left y-axis the same). To answer this and other questions I will make ample use of simulations and visual aids. These are covered to a greater extent in chapter 10 of my 2019 book “Statistical Methods in Online A/B Testing” and the cited literature. In this article I will try to explain error spending in an accessible manner and without going into the mathematical details. Sequential testing is based on a concept called error spending which is what allows us to make statistical evaluations of the data continuously or at certain intervals (predefined or not) while retaining the error guarantees you would expect from a frequentist method. See the sequential testing entry of our glossary and the articles linked there for more details on both the benefits and the drawbacks to sequential hypothesis testing.

On the other hand, being able to analyze test data as it gathers and to act on it swiftly has many benefits as long as one can maintain the desired error control throughout the process. In short, peeking with intent to stop breaks the validity of both risk estimates and effect size estimates and largely defeats the very purpose of A/B testing. On the one hand CRO experts, product managers, growth experts, and analysts are becoming more aware of the adverse impact of the misuse of significance tests and confidence intervals that we call peeking. However, this type of sequential analysis is not sequential testing proper as these solutions have generally abandoned the idea of testing and therefore error control, substituting it for what seems like an ersatz decision-making machine (see Bayesian vs Frequentist Inference for more on this).įrequentist sequential testing on the other hand is becoming more popular by the day with the reasons being twofold. Sequential analysis of experimental data from A/B tests has been quite prominent in recent years due to the myriad of Bayesian solutions offered by big industry players.

0 Comments

Sequential testing with general loss function

Leave a Reply.

Author

Archives

Categories