The dark side of A/B testing

As product designers and marketers, we love the clarity that comes from A/B testing. We have an idea; we implement it; then, we let the customers decide whether it’s good or not by running a controlled experiment. We split the users into unbiased groups and watch one treatment overtake the other in a statistically significant, unbiased sample.

Thoroughly impressed by our own analytical rigor, we then scale up the winning treatment and move on. You guessed it: I’m about to poke holes in one of the most sacred practices in tech, A/B testing.

Let's start by acknowledging the good. If you don't do any A/B testing today, you're clearly behind the curve: you're making HIPPO-driven decisions (those driven by a Highly Paid Person's Opinion). The problem is, even if you very much respect your HIPPO, s/he is human and as such, extremely biased in judgment.

If the only thing you go by is HIPPO's intuition, you're blindly exploring the edge of the Grand Canyon. You can get lucky and not fall. Or, like the folks at MSN.com in early 2000s, you can launch a massive redesign of the site that was all driven by their HIPPO -- and watch all business metrics tank immediately afterwards, not knowing what caused the fall and not being able to roll-back quickly.

But you're better than this. You do A/B testing and you challenge your own assumptions -- as well as the assumptions of your HIPPO. Wait, you're not out of the woods either. Let's explore a few pitfalls.

1. Temporal effects: What worked today may not work a month from now

Each A/B test, by definition, has a duration. After a certain amount of time passes (which you, of course, determined by calculating an unbiased, statistically significant sample size), you make a call. Option B is better than option A. You then scale up option B and move on with your life, onto the next test.

But what if user behavior was different during your test period? What if the novelty of option B is what made it successful? And then, after a few months pass, this option becomes ineffective? Concrete example from Grubhub: We make restaurant recommendations; there's a baseline algorithm A that we show on our site. When we roll out a challenger algorithm B, one reason the challenger can win is because it simply exposes new options. But will this lift be sustained? What if users just try those new restaurants recommended by algorithm B and then stop paying attention to recommendations in the module, just like they did for algorithm A?

There's a flip side to this. In Facebook's case, with the Newsfeed, any meaningful modification causes core metrics to tank -- simply because the customer base is so used to the old version. So you’d reject EVERY test if you were to end it after a week -- Facebook users produce more than enough of a sample size to end every test after a week! This, of course, would be a mistake because you haven't waited long enough for user reactions to stabilize.

You might ask, "Can't I just run all my A/B tests forever?" That is, what if after a month, you scale up the winning option B to 95 percent, keep the other option at 5 percent, and keep monitoring the metrics? This way, you're capturing most of the business benefit of the winning option but you can still react if the pesky temporal effect bites you. Yes, you can do that; you can even do an advanced version of this approach, a multi-armed bandit, in which your A/B testing system homes in on the best option automatically, continuously increasing the exposure of the winning variant.

However, there’s one significant issue with this method: It pollutes your codebase. Having to fork the logic makes the code hard to maintain. Also, it makes it very difficult to experience your product in the same way a customer would. The proliferation of user experience forks creates nooks and crannies that you just never test and stumble upon. Also known as bugs. Thus, don't do it for a long time and don't do this with every test.

One other possible defense is to rerun your tests occasionally. Confirm that winners are still winners, especially the most salient ones from many moons ago.

2. Interaction effects: Great separately, terrible together

Imagine you're working in a large organization that has multiple work streams for outbound customer communications -- that is, emails and push notifications. One of your teams is working on a brand new "abandoned cart" push notification. Another is working on a new email with product recommendations for customers. Both of these ideas are coded up and are being A/B tested at the same time. Each of them wins, so you scale both. Then BOOM, the moment you scale both, your business metrics tank.

What?!? How can that be? You tested each of the options! Well, this is happening because you're over-messaging your customers. Each of the ideas separately didn't cross that barrier, but together, they do. And the effect of annoyance (why are they pinging me so much?!) is overtaking the positive.

You've just experienced another rub of A/B testing. There's a built-in assumption in the overall framework -- that tests are independent and don't affect each other. As you can see from the example above, this assumption can be false, and in ways that aren't as obvious as the example above.

To make sure this doesn't happen, have someone be the responsible party for the full set of A/B tests that are going on. This person will be able to call out potential interaction effects. If you see one, just sequence the relevant tests instead of parallelizing them.

3. The pesky confidence interval: The more tests you run, the higher the chance of error

If your organization culturally promotes the idea of experimentation, one "wrong" way it can manifest is by folks running a whole bunch of tiny tests. You know those: Increase the font size by one point, swap the order of the two modules, change a couple of words in the product description. Besides the fact that these changes will most likely not allow your organization to become the visionary of your industry (heh), there's a poorly-understood statistics issue biting you here, too.

Every time you judge an A/B test and claim option B to be better than option A, you're running a statistical calculation based on a t-test. Inside that calculation, there's a concept of a confidence interval: the level of certainty that you are comfortable with. Set it at 90 percent, and 10 percent of the conclusions that your A/B testing framework gives you will be wrong -- it'll say that option B is better than option A, while in reality, that's not the case.

Now, what happens if you run 20 tiny tests, each with a 10 percent probability of a false positive? Your chance of finding a winner by mistake is then (1 - 90 percent to the power of 20). That is, 88 percent. That's right, your A/B testing framework will show you at least one, and likely two "fake" winners from your set of 20 significant-result tests, possibly providing a feedback loop to the experimenting team that there's indeed gold there.

How do you avoid this issue? Have someone look at the list of tests. Disallow a zillion tiny modifications. Be extra-cautious if you're testing version eight of a concept that just keeps failing.

The tactical issues I've outlined here are all too easy to run into when you adopt A/B testing as a philosophy for your marketing and product teams. These aren't trivial faux-pas that only amateurs succumb to; they’re surprisingly common. Make sure to inoculate your team from them.

Alex Weinstein is SVP of Growth at Grubhub and author of the Technology + Entrepreneurship blog http://www.alexweinstein.net, where he explores data-driven decision making in the face of uncertainty. Prior to Grubhub, he led growth and marketing technologies efforts at eBay.

1. Temporal effects: What worked today may not work a month from now

2. Interaction effects: Great separately, terrible together

3. The pesky confidence interval: The more tests you run, the higher the chance of error

More