Taming the A/B Testing Rollercoaster with Bayesian Analysis

12 min readJul 19, 2022

Dave Decker, Director of Data Science

Like many startups, Ethos faces a trade-off when it comes to product A/B testing between making decisions quickly and making decisions accurately. Furthermore, we are trying to drive convenience rather than engagement; ideally, a customer comes to our app, finds an appropriate life insurance policy, and then enjoys their life with the peace of mind that comes with protecting their family financially. However, driving convenience means that collecting ideal sample sizes for classical A/B testing could take months, delaying decisions, and backing up our roadmap. Alternatively, peeking at results early with classical null hypothesis testing introduces more false positives if not done carefully. Product managers may also become concerned with early poor performance which may be the result of sampling variance at low sample sizes.

Issues with classical null hypothesis testing

Even if we ignore philosophical debates about the merits of different approaches to statistics, the classical null hypothesis testing framework presents challenges in our environment.

As mentioned above, classical null hypothesis testing requires large sample sizes and test periods when run properly.
p-values are only valid for a given sampling intention. That is, after a sampling size has been predetermined, peeking at results raises the false positive rate above the stated alpha level.
Power is a curve dependent on an unknown effect size. Post-hoc power analysis using an observed effect size is a common, but incorrect, practice. Both due to the previous point and the last point below, using the observed effect size as the basis for a power calculation is likely to exaggerate it.
The rejection region of a test statistic favors the incumbent and hinders the realization of small improvements. In order to avoid “false positives” in rejecting the null hypothesis, the classical paradigm requires overcoming a hurdle of observed effect size. In other words, even if a new treatment arm empirically outperforms an incumbent, the null hypothesis test will encourage the tester to remain with the incumbent unless the empirical results reach a rejection region of the test statistic. From a decision theory point of view, this is problematic not only because it can prevent us from realizing small gains that do not pass significance tests, but also because if there truly is no difference between experiment arms, then there is no down side to choosing the treatment arm anyway.
It’s true that adjustments can be made in the calculations to allow for peeking repeatedly at results, but the “false positive” null hypothesis testing framework still favors the incumbent.
If we also care about estimating the lift of our A/B test winner, then filtering on a null hypothesis test biases our estimated effect size.

This last point is often ignored in both industry and academia, but it is true: if you require statistical significance to choose a winner that improves results over an incumbent, then the estimated improvement is biased, even if the method to estimate that difference was originally unbiased. This is true even if the researcher uses a two-tailed hypothesis test because there is no lift to estimate when we stay with an incumbent. I’ll illustrate this point with simulations.

Estimators have various properties we care about. Three common ones are bias, variance, and consistency. Bias refers to how far off our estimate will be from the true parameter value on average. Variance describes how far from the estimator’s expected value we can expect to be for a given sample, namely the average squared distance from the mean. A property that is often used to judge the trade-off of bias and variance is the mean square error. An estimator is consistent if it approaches the true value as sample sizes grow.

If we are interested in estimating the difference in conversion rates between treatment and control experiences, then the difference in observed conversion rates in our test is an unbiased estimator of this difference, and it is a consistent estimator as well. Further, it is the unbiased estimate with the lowest variance. Sounds pretty great, but we run into trouble as soon as we start filtering on null hypothesis tests.

For instance, consider the following plot of simulated conversion rate experiments in which the true underlying difference is known to be 0.001 in favor of the treatment arm with a baseline conversion rate of 0.01 for the control. Sample sizes of 163K per branch were chosen to align with a desired alpha of 0.05 and power of 0.8.

Figure 1: null hypothesis rejection region with 163K samples

80% of these simulated results are “significant” (in blue), which is in line with our power calculation. Because we never deploy the 20% of insignificant results, we never use these to estimate the lifts of winners. Therefore, the estimated lift of the remaining significant results is biased upward. In this case, it biases our estimate of the lift by a relative 13%. The situation gets even worse when we use lower sample sizes, even if it’s not the result of early peeking; for instance, one might just settle for a higher alpha or lower power. The next plot shows the same scenario but with one tenth the number of samples per simulated experiment arm.

Figure 2: null hypothesis rejection region with 16K samples

Now the lift conditional on significance is overestimated by 2.76X the real effect size! The situation is exacerbated by repeated evaluations over time, since that raises the probability of a significant result from sampling variance, even when there is no true underlying effect.

A Bayesian Approach

When I first noted these issues, I thought Bayesian methods might offer a compelling solution that could allow for incremental checking of results and better estimates of expected lifts. For the uninitiated, Bayesian methods allow researchers to combine prior information about a model’s parameters with the observed data in a logical way to obtain a posterior probability distribution over the parameters.

However, the first approach I tried with multilevel modeling did not constrain estimated lifts enough in the simulations I ran; sampling variance could still lead to implausible effect size estimates at lower sample sizes. While further researching the subject, I found a paper by Gronau, Raj, and Wagenmakers that framed the problem in a way that could better leverage prior information. They specify a model in section 3.1 that allows the researcher to specify prior knowledge over the difference between experimental arms. I first adopted this framework with a different choice of prior:

In other words, this models each of the two experiment outcomes (y₁ and y₂) with a binomial distribution and places their probabilities in a log odds scale for p̃₁ and p̃₂. It then uses a ψ parameter to represent the difference between the log odds of each arm, centered around the β parameter. We can place a prior distribution of Logistic(0, 1) over β to indicate that we are equally open to any baseline conversion rate; this is equivalent to using a uniform distribution in probability scale. This allows us to use this specification at any point in the funnel. We can place a stronger prior on the difference between the arms using the σ parameter within the Normal distribution for ψ; the smaller we make σ, the less open we are to large lifts in our experiments. We declare σ as prior knowledge, provide data about y₁, y₂, n₁, and n₂, and the model updates estimates for our model parameters. At the end, we have probability distributions quantifying uncertainty about p₁ and p₂, of which we can ask any question we’d like, such as: What’s the probability that arm 1 has a higher conversion rate than arm 2? What’s the probability that arm 1 is not worse than a relative 5% worse than arm 2? And so on.

A quick note on choosing priors

One push back to Bayesian inference is skepticism about prior distributions. Where do they come from? How do we know they are correct? The good news is that priors don’t have to be perfect to be useful, but you do need to be able to justify your priors. One way to give an intuition of why including priors is useful for A/B testing is to think through a simple scenario:

Let’s say, for instance, that the conversion rate from a point in the upper funnel has historically been 1% for several years. And now you run an experiment that slightly changes the value prop message. Now suppose an analyst comes back to you with the results and claims, “Great news: the treatment arm has a 99% conversion rate!” Would you jump up and down for joy, or demand that they go back and check the data? Of course you wouldn’t believe this 99X improvement; it’s implausible. Bayesian inference gives us a way to down-weight implausible effect sizes while considering the hypothesis space.

The Normal distribution prior over ψ allows us to be open to small effect sizes and very skeptical of larger ones, but there are other possibilities. For instance a horseshoe or spike and slab prior would allow us to put a lot of weight on effect sizes at or very close to 0 difference, but be more open to larger effect sizes (an occasional home run). There is no free lunch, however, so such priors would make us more skeptical of moderate effect sizes that we expect with a normal prior. Which prior is best for a model is dependent on its context. Ideally, one would have a large historical catalog of large sample experiments for their app, but of course startups are rarely in that position. So you may rely on other research when choosing a prior.

Extending to the multi-arm case

The paper from Gronau et al, however, did not address experiments with an arbitrary number of experimental arms. In order to adjust the framework from the two armed case to the k-armed case, we need to proceed in a similar manner as a t-test is extended to one-way factor ANOVA with a k number of means, as explained in section 7.2 of Router et al, 2012. The idea is that we want to constrain the differences from the overall mean (in log odds scale) to each of our experiment arms to sum to 0; this is to make up for the over-parameterized model we introduce by including an overall mean and a distance to each of the experiment arms. We can do this by projecting our design matrix into a space with one fewer dimension using a contrast matrix; that is an orthonormal set of contrasts that identify k-1 difference parameters in our model. After fitting our model under this parameterization, we can then project the results back to our original parameter space as the estimates and all tuples of posterior samples will sum to 0. We’ll assign α to represent the k-length vector of fixed effects from the experimental arms from an overall mean β and define α∗ = Q′ₖα, where α∗ is the k-1 vector representation of the effects. We can construct Qₖ as follows:

where Iₖ is a k-sized identity matrix and Jₖ is a k-sized square matrix of 1s. We can then apply eigendecomposition to Ωₖ to obtain a k x (k-1) matrix Qₖ containing the k — 1 eigenvectors corresponding to the non-zero eigenvalues, such that:

See the Router paper for more details (note: I’ve modified their notation to avoid conflicts across models). Our model now looks like this:

X is the design matrix of one-hot encoded treatment arms and Σ is a diagonal matrix with a common variance applied to the diagonal. This common variance is again where we encode our prior beliefs of small effect sizes. The mean parameter within the multivariate normal is a 0 vector of length k-1. This is what we’ve coded up, and the excellent numpyro package provides us with fast MCMC for inference.

Benefits

That may seem like a lot of extra calculations just for A/B testing, but it brings real benefits to our testing and decision making. If we use a prior that reasonably represents the concentration of small effect sizes we expect to see in practice, then we can update our beliefs on the effect sizes as often as we would like without much concern for misleading results. The posterior distribution we obtain after each observation is an accurate reflection of our uncertainty in the differences in treatments, given our prior belief of small effect sizes, model, and data. Our belief about the difference starts centered around 0 with uncertainty covering the plausible effect size range, and then slowly drifts away from 0 as the uncertainty narrows with additional samples. This means we accept a little bit of bias in our estimates toward 0 in exchange for a large reduction in variance. That bias may sound undesirable, but it’s the variance that tends to fool us into poor decision making when we don’t have time or patience for large sample sizes. The estimator is consistent, so it approaches the true underlying effect size as sample sizes grow large. With this estimation procedure, we can safely check results and make incremental decisions about whether to continue tests or call winners, leveraging quantified risks that allow us to weigh the relevant trade-offs.

I demonstrated above that null hypothesis testing biases estimates of lift sizes. It turns out that any filtering on empirical results will bias estimates upward to some degree; if you simply require a treatment arm to have a higher observed conversion rate than the control, not even requiring statistical significance, the estimate will also be biased upward at least a little bit (this is because sampling variance will cause some observed sample differences to be negative, even when the effect is positive). Biasing our estimates toward 0 helps to counteract the upward screening bias.

To demonstrate the calming effect this has on our experimentation, I’ll share the result of a simulated experiment over time. In the framework presented above, I used conversion rates because they are simple to specify and understand, but often in our experiments, we care about differences in revenue per user. The framework above can be extended to include such cases. Below shows a simulation of an experiment where we use both the classical frequentist and Bayesian approaches every 1,000 observations to estimate the difference in revenue per landing. The baseline in the simulations is about $12/landing from a 0.5% conversion rate with a true difference of about $2 in favor of the treatment arm.

This illustrates the volatility of comparing averages (the frequentist approach), which dips down quite far even after 6,000 observations. A product manager might be very tempted to halt a test if they thought the treatment variant was causing them to lose $4 for every visit to the landing page (from a baseline of $12), even if the result wasn’t “significant” yet. The Bayesian method, by contrast, sticks pretty close to 0 early and eventually finds its way up closer to the true difference as it observes more samples, maintaining a reasonable range of uncertainty throughout (30K samples is still quite small for such a low conversion rate and variance used for the revenue).

Finally, we can view what the difference looks like over many such simulations of 30,000 observations each:

Here we can see that the Bayesian estimates (before filtering on winners) are in fact a bit biased toward 0 compared to the true effect size, but the mean squared error is about half that of the frequentist method. If we filter on declared winners, the bias of the remaining frequentist estimates moves above 0 (as shown above), and the initially negative Bayesian bias moves toward 0. This favorable trade-off will remain as long as the chosen prior distribution is a reasonable proxy of our population of test results.

I’d like to thank EJ Wagenmakers for a helpful email discussion on this topic and the pointer to the Router paper.

Dave Decker joined Ethos in October, 2021 as Director of Data Science. When Dave isn’t calculating conditional probabilities at Ethos, he enjoys swimming and learning new things. Interested in joining Dave’s team? Learn more about our career opportunities here.

Taming the A/B Testing Rollercoaster with Bayesian Analysis

Written by Ethos Life