Why most A/B tests give you bullshit results

By now, anyone in product or marketing knows what A/B testing is. What we don’t know, or at least won’t admit, is that too many A/B tests yield nothing.

Too often they measure meaningless variants, produce inconclusive results, and nothing comes from them. Of course, some A/B tests yield real, meaningful, actionable results. Those are the ones you hear about. We've all seen the articles. Company X increases conversions 38% with this simple trick. Hell, I've written some of them.

But those success stories have hidden the grey underbelly of testing and experimentation.

AppSumo revealed that only 1 out of 8 tests produce results. Kaiser Fung estimates that 80 to 90 percent of the A/B tests he’s run yield statistically insignificant results.

Yet many new testers walk into A/B testing thinking it'll be quick and easy to get results. After running a handful of simple tests, they think they’ll find the right color for this button or the right tweak to that subject line, and conversions will, poof, increase by 38% like magic.

Then they start running tests on their apps or sites, and reality suddenly sets in. Tests are inconclusive. They yield “statistically insignificant” results and no valuable insights about the product or users. What’s happening? Where’s that 38% bump and subsequent pat on the back?

Don't get frustrated. If you're going to be running A/B tests, you're going to have some tests that fail to produce meaningful results you can learn from. But if you run good tests, you'll have fewer failures and more successes. By running thoughtful A/B tests, you'll get more statistically significant results and real learnings to improve your product.

Seven out of 8 A/B tests are inconclusive, and we don't talk about them enough.

Imagine you’re tossing two coins, 20 times each. Coin A lands on heads 12 times. And Coin B lands on heads nine times. You wouldn’t rush out proclaiming you’ve found a coin that is 33% more successful at landing on heads, right? From your understanding of coins, you know the difference is simply by chance. It’s not statistically significant.

Now if you tossed each coin another 180 times, and Coin A landed on heads 120 times and Coin B landed on heads 90 times, clearly something significant is happening. But, again, we know that isn’t what would happen. After 200 tosses, there might still be a small difference in how many times each landed on heads, but it would be chance. Any difference is just noise.

That might seem like a silly experiment. Of course, two coins aren’t going to perform noticeably different. But, honestly, this is precisely why so many A/B tests yield inconclusive results. We waste our time testing variants without any real meaningful differences and, unsurprisingly, we end up with a bunch of tests with statistically insignificant results.

And if anyone is to blame, it’s that stupid button example’s fault

The button color experiment is the “Hello, World!” of A/B testing. It’s a simple example that does an excellent job of explaining the concept. And so, without fail, any time A/B testing is being explained for the first time, someone is using the button color example, where one variant of a page has a green purchase button and one has a red button. You run the test and see which color button has a higher conversion rate.

The truth is, some companies have conducted the button experiment and actually received meaningful results to improve their product. If you want your user to interact with something, there is certainly value to making it stand out. That said, as most who have run the experiment have discovered, while button color is an excellent way to describe A/B testing, it’s rarely a meaningful way to improve your product.

I ran my own meaningless test about two months ago

Mixpanel rarely sends out emails to our master list. We usually only email out our new articles to users who have subscribed to our blog. But it had been some time since a large send, so we got the okay to email the latest in our Grow & Tell series to a large chunk of our users. It seemed like the perfect opportunity to run a really quick A/B test.

The email had a subject line of “Why 15 million users weren’t good enough for this mobile trivia app." But I’d heard that starting out an email with your company name can improve the open rate, so I made a variant with the subject line, “Mixpanel -- Why 15 million users weren’t good enough for this mobile trivia app.” Easy, right? And if it performed better, we could put what we learned to use, starting every subject with our name, increasing open rates on all of our emails, and hopefully increasing results.

The email went out to hundreds of thousands of users, split between the two versions. And then I waited impatiently for my success to come rolling in.

When the results did come in, they could not have been less statistically significant. The subject line without “Mixpanel” had a 22.75% open rate. The subject with “Mixpanel” had a 22.73% open rate. A difference of .02%.

Hundreds of thousands of emails sends later, the difference in my test was 20 opens. For all intents and purposes, I was flipping coins.

Even with such a large sample size, there just wasn’t enough contrast in my test to yield significant results. I learned nothing, except to take my tests more seriously.

So what could I have done to get more significant results?

Well, first, I could have tested a completely different subject line altogether -- like the less scintillating but more semantic article title of “Why QuizUp turned the fastest-growing game in history into a social platform.” That contrast would have had a much greater chance of producing statistically significant results.

But even then, what would I have learned besides the fact that one did better than the other? What actions would I have taken from it? Perhaps if I tested it a few more times I could reach the large conclusion of whether our readers prefer scintillating subject lines or semantic ones.

My test was meaningless because it wasn’t constructed well and it wasn’t part of a larger strategy asking meaty questions about what matters to our readers. It was quick and simple, but it didn’t go anywhere. A/B testing is never as easy as it seems. If you want results, it takes work. Either put in the time to thoughtfully and strategically test many little things hoping to find an array of small improvements, like different pictures, slightly different designs, and changes in the text of your calls to action. This is one camp of A/B testers, the "optimize your way to success" testers. The other camp includes those who develop out features of the product and test drastically different experiences, like reworking the process of user onboarding.

You can find valuable lessons and improve your product with A/B testing, but it takes some hard work

I’m not the only one mulling on this. Recently, I spoke with Hari Ananth, cofounder of Jobr, about some not-so-meaningless A/B tests the company conducted to improve user acquisition.

“We wanted to improve our onboarding flow to get more users in the app and swiping,” Hari, told me.

Jobr is an app that allows job seekers to swipe, Tinder-style, through curated job opportunities.

“We identified two crucial steps in our funnel and built a sufficiently wide list of variants for each experiment to ensure proper coverage. After sending enough traffic through each variant, we were able to incorporate the optimized flow and boost conversions by 225%.”

Jobr essentially rebuilt its onboarding process, informed by data on where users dropped out of the previous process.

Cozi’s A/B tests were more in the “optimize your way to success” camp. In an Office Hours talk last summer, Cozi product owner Tara Pugh recalled the company's own process of removing friction from its user onboarding.

After testing hypothesis after hypothesis, Tara and the team at Cozi were able to incorporate bits of learning into the flow. Some were small aesthetics tweaks, like switching to a lighter background. Others where larger changes that asked the user to do fewer steps and removed friction from the process -- like prepopulating forms and eliminating check boxes.

No single change resulted in a major increase in conversions. But combined, the improvements raised the signup completion rate from 55% to 76%.

Run tests that produce meaningful results

It wasn’t random that these experiments were able to escape the common failure of A/B testing to deliver meaningful results. The experiments were constructed to test meaningful aspects of the product, aspects that had a strong impact on how the user would behave. And, of course, the experiments were run enough times to produce statistically significant results.

So if you’re sick of bullshit results and you want to produce that 38% lift in conversions to get that pat on the back, then put in the work. Take the time to construct meaningful A/B tests and you’ll get meaningful results.

Justin Megahan is Content Marketing Manager at Mixpanel.