Stop testing unless you have the traffic - Travis Street

Experimentation is one of those product practices that sounds rigorous but gets cargo-culted more than almost anything else in SaaS.

Teams run A/B tests because running A/B tests feels like doing product properly.

They set up the test, split the traffic, wait a week, look at the numbers and make a decision. It looks like science. But most of the time it’s just a waste of time.

Statistical significance is not optional. It’s the entire point.

To detect a meaningful difference between two variants you need enough users moving through the experiment to rule out random variation as the explanation for any difference you see. That number is higher than most teams think and it scales with how small the effect you’re trying to detect is.

If you’re looking for a 20% improvement in conversion, you need less traffic than if you’re looking for a 5% improvement.

But most meaningful product changes don’t move metrics by 20%. Real-world effect sizes in mature products are often 2-5%. At that level, the traffic requirement to reach significance is substantial.

Run the numbers before you run the test. There are simple calculators that tell you exactly how many users you need per variant, at what baseline conversion rate, to detect an effect of a given size at a given confidence level. Most teams don’t use them. They run the experiment for a week, get an inconclusive result and either call it too early or quietly ignore the data.

Here’s a rough guide: if you have fewer than a few thousand active users moving through the specific flow you’re testing, you almost certainly don’t have enough traffic to run a meaningful experiment on anything with a modest effect size. You’re not doing science. You’re generating noise and dressing it up as data.

What happens when you run underpowered experiments is two things, both bad…

The first is false positives. Random variation produces apparent winners. You ship the variant, nothing improves, and you’ve now made a product decision based on statistical noise. The product gets worse or stays the same and nobody connects it back to the underpowered test from six weeks ago.

The second is false negatives. A change that would actually improve the product looks flat because you didn’t have enough traffic to detect the real effect. You kill a good idea because your experiment wasn’t set up to see it.

Both outcomes are worse than not running the experiment at all.

At least without the experiment you know you’re operating on judgment. With an underpowered experiment you think you have evidence when you don’t.

Defining what success and failure look like before you start is the other half of the problem and it’s just as common.

Teams run experiments without defining upfront what a successful result looks like, what a failed result looks like, and critically, what they will do in each case.

Without that definition, experiments become Rorschach tests. Everyone sees what they want to see in the data. The variant that was supposed to improve activation also moved session length and changed a downstream metric and now there’s a meeting about what it all means and three weeks later nothing has been decided.

Before you run any experiment, write down:

The single metric this experiment is designed to move. One metric. Not three, not a dashboard. One.
The minimum effect size that would constitute a meaningful improvement worth shipping. Be specific. “Conversion improves by at least 8%” not “conversion goes up.”
The sample size required to detect that effect at 95% confidence. Calculate it, write it down, don’t start reading results until you hit it.
What you will do if the result is positive. Ship it? Roll it out to 100%? Run a follow-up?
What you will do if the result is negative. Kill the variant? Iterate? Accept the current state?
What you will do if the result is flat or inconclusive. This is the most important one because it’s the most common outcome and the one teams are least prepared for.

Flat doesn’t mean the change is neutral. It usually means your experiment wasn’t set up to tell you anything. Knowing that in advance changes how you respond to it.

The simple truth is most early and mid-stage SaaS teams don’t have enough traffic to run a stistically significant experiment. They don’t have enough users, they don’t have enough traffic through specific flows, and they don’t have enough volume to run clean experiments on anything subtle.

If this is you, don’t run experiments. Do something else.

Qualitative research scales down where quantitative doesn’t. Five user interviews will tell you more about why your onboarding flow is failing than an underpowered A/B test.

Session recordings, support tickets, direct conversations with churned users. These are not inferior substitutes for experimentation. For teams without the traffic to experiment properly, they’re the right tools.

Ship and measure directionally. Make the change, watch the metrics over a meaningful time period, apply judgment. This is less rigorous than a controlled experiment but it’s more honest than a fake one. You know you’re making a judgment call. That’s better than believing you’ve run a scientific test when you haven’t.