Free A/B Test Calculator

Calculate the sample size for your A/B test and determine statistical significance of your results.

A/B Test Calculator

Plan your test or analyze your results

Quick presets
%
%

The smallest relative change you want to detect

(optional)

Required Sample Size

Visitors per variation
8,155
Total visitors needed
16,310

You need 8,155 visitors per variation (16,310 total across 2 variations) to detect a 20% relative change from a 5% baseline with 95% confidence and 80% power.

What Is A/B Test Sample Size?

A/B test sample size is the number of visitors each variation in your experiment needs before you can draw reliable conclusions. Think of it as the minimum amount of evidence required to confidently say “Variant B is better than Variant A” — or that there is no meaningful difference.

Running an A/B test without calculating sample size upfront is like flipping a coin 5 times and concluding it is biased. With too few observations, random noise drowns out real effects. You might declare a winner when the difference is just luck, or miss a genuine improvement because your test lacked the power to detect it.

Underpowered tests are one of the most common and costly mistakes in experimentation. They waste engineering time building variants, waste traffic that could have been monetized, and worst of all, they lead to bad decisions. Properly calculating sample size before launching a test ensures your time and traffic are well spent.

How to Calculate Sample Size for A/B Testing

The sample size formula for a two-proportion z-test balances four factors: your baseline conversion rate, the minimum effect you want to detect, and your tolerance for two types of errors — false positives and false negatives.

The Formula

n = (Z1-α/2 + Z1-β)2 × [p1(1-p1) + p2(1-p2)] / δ2
n= sample size per group
Z1-α/2= z-score for desired significance (1.96 for 95%)
Z1-β= z-score for desired power (0.84 for 80%)
p1, p2= baseline and expected conversion rates
δ= absolute difference to detect (p2 - p1)

Worked Example

Suppose your checkout page converts at 5% (p1 = 0.05) and you want to detect a 20% relative improvement — meaning a change from 5% to 6% (p2 = 0.06, δ = 0.01). With 95% significance and 80% power:

Baseline rate
5%
MDE (relative)
20%
Significance
95%
Power
80%

Plugging these values into the formula gives approximately 25,000 visitors per variation (50,000 total). At 1,000 daily visitors, the test would need to run for at least 50 days.

Understanding the Key Inputs

Baseline Conversion Rate

Your current conversion rate is the foundation of every sample size calculation. Get this number from your analytics platform — not from gut feeling. Use data from the exact page, funnel step, or action you plan to test, filtered to the same audience and time period. Seasonal swings, marketing campaigns, or recent product changes can all shift your baseline, so use the most recent representative period (typically the last 30 days).

Minimum Detectable Effect (MDE)

The MDE is the smallest improvement you want your test to reliably detect. It can be expressed as a relative change (e.g., 20% improvement from baseline) or an absolute change (e.g., 1 percentage point increase). Smaller MDEs require exponentially larger sample sizes. A practical approach: ask “what is the smallest improvement that would justify the effort of implementing this change?” — that is your MDE.

Statistical Significance (Alpha)

The significance level (α) controls your false positive rate — the probability of declaring a winner when there is actually no difference. At 95% significance (α = 0.05), there is a 5% chance of a false positive. Higher significance levels (like 99%) give more certainty but require larger sample sizes. For most business A/B tests, 95% is the standard.

Statistical Power (Beta)

Power (1 - β) is the probability of detecting a real effect when one exists. At 80% power, you have a 20% chance of missing a real improvement (a false negative). Increasing power to 90% reduces the miss rate to 10% but increases the required sample size by about 30%. For most tests, 80% power offers the best trade-off between reliability and practicality.

One-Tailed vs Two-Tailed Tests

A two-tailed test checks whether the variant is different from the control in either direction (better or worse). A one-tailed test only checks one direction. While one-tailed tests have more power for a given sample size, they cannot detect effects in the unexpected direction. If your variant could potentially hurt conversion (and most can), a two-tailed test is the safer choice. Our calculator defaults to two-tailed.

How Long Should You Run an A/B Test?

Test duration depends on two factors: your required sample size and your daily traffic. Divide total sample size by daily visitors to get the minimum number of days. However, there are important rules beyond simple math:

1

Run for at least 7 days

Even if you reach your sample size in 3 days, keep the test running to capture a full week of traffic. User behavior varies by day of the week — Tuesday shoppers behave differently than Saturday shoppers.

2

Account for business cycles

If your business has monthly patterns (paydays, billing cycles), consider running tests for 2-4 full weeks to capture the complete cycle.

3

Never stop early for significance

If you check results daily and stop when p < 0.05, your actual false positive rate can be 20-30% instead of 5%. Commit to a sample size before you start.

4

Avoid testing during anomalies

Major sales events, product launches, or marketing campaigns create unusual traffic. Tests running during these periods may not reflect normal behavior.

How to Calculate Statistical Significance

After your test reaches the required sample size, you need to determine whether the observed difference between control and variant is real or due to chance. This is where the post-test significance calculator comes in.

The standard approach is a two-proportion z-test. It works by comparing the observed conversion rates while accounting for the natural variability (sampling error) in your data. The test calculates a z-statistic and corresponding p-value that tells you the probability of seeing your results if there were truly no difference.

The Z-Test Steps

1
Calculate pooled proportion: Combine all conversions and visitors to get the overall rate: p̂ = (x1 + x2) / (n1 + n2)
2
Calculate standard error: SE = √[p̂(1-p̂)(1/n1 + 1/n2)]
3
Calculate z-score: Z = (p2 - p1) / SE
4
Calculate p-value: Convert the z-score to a probability using the normal distribution. For a two-tailed test: p = 2 × (1 - Φ(|Z|))

Common A/B Testing Mistakes

Stopping tests early (peeking)

Checking results daily and stopping when you see significance dramatically inflates false positives. With daily peeking, a test designed for 5% error can have a 20-30% false positive rate. Always pre-commit to a sample size.

Running underpowered tests

Testing with too few visitors means you are unlikely to detect real improvements. An underpowered test does not prove "no effect" — it proves nothing. Use our sample size calculator before launching any test.

Testing too many variations

Each additional variation multiplies your required sample and increases the chance of false positives through multiple comparisons. Unless you have very high traffic, limit tests to 2-3 variations.

Ignoring business cycles

A test that runs Monday to Thursday misses weekend behavior. Tests during a flash sale capture abnormal behavior. Run tests for at least 1-2 full business cycles.

Not documenting hypotheses

Without a clear hypothesis before the test, it is easy to rationalize any outcome as expected. Write down what you expect to happen and why before launching.

Changing test parameters mid-flight

Adjusting traffic allocation, modifying variants, or changing success metrics during a test invalidates your statistical framework. If you need changes, start a new test.

Frequently Asked Questions

To calculate sample size, you need four inputs: your baseline conversion rate, the minimum detectable effect (MDE) you want to detect, your desired statistical significance level (typically 95%), and statistical power (typically 80%). The formula combines these using z-scores from the normal distribution. Our calculator above does this automatically — just enter your values and get an instant result.

A "good" MDE depends on your business context. For most A/B tests, a relative MDE of 5-20% is practical. Smaller MDEs require much larger sample sizes. The key is to choose the smallest effect that would be meaningful for your business — if a 2% improvement in conversion rate would significantly impact revenue, use 2% as your MDE.

Run your test until you reach the required sample size, with a minimum of 7 days to capture weekly traffic patterns (weekday vs. weekend behavior). Never stop a test early just because results look significant — this leads to false positives. Our calculator estimates duration based on your daily traffic.

Statistical significance (1 - alpha) is the probability of avoiding a false positive — concluding there is a difference when there is not. Statistical power (1 - beta) is the probability of detecting a real difference when one exists. A 95% significance level with 80% power means you have a 5% chance of a false positive and a 20% chance of missing a real effect.

No, stopping tests early based on interim significance is called "peeking" and dramatically inflates your false positive rate. At 95% significance, you might see a p-value below 0.05 multiple times during the test by pure chance. Always run the test to the pre-calculated sample size or use a sequential testing framework designed for continuous monitoring.

A p-value is the probability of observing a difference as extreme as (or more extreme than) your test results, assuming there is actually no difference between control and variant. A p-value of 0.03 means there is a 3% chance the observed difference is due to random chance. If your p-value is below your significance threshold (e.g., 0.05), the result is considered statistically significant.

Use a two-tailed test in almost all cases. A two-tailed test checks whether the variant is significantly different (better or worse) from the control. A one-tailed test only checks one direction. While one-tailed tests require smaller sample sizes, they can miss negative effects. Our calculator uses two-tailed tests by default for maximum reliability.

Large sample sizes are needed to distinguish real effects from random noise. The smaller the effect you want to detect, the more data you need. A 1% absolute change in conversion rate requires roughly 25x more visitors than a 5% change. This is why choosing a practical MDE is critical — testing for unrealistically small effects wastes time and traffic.

An underpowered test (too few visitors) means you are unlikely to detect a real improvement even if one exists. You might conclude "no significant difference" when the variant is actually better. This wastes the effort of building and running the test. Worse, you might make incorrect decisions based on noisy, unreliable data.

Each additional variation multiplies the required total sample size. A test with 2 variations (control + 1 variant) needs 2 × N visitors. With 4 variations, you need 4 × N. More variations also increase the risk of false positives through multiple comparisons. Unless you have very high traffic, stick to 2-3 variations per test.

You Know Which Variant Won. Now Understand Why.

Userloom shows you exactly what users did differently in each variant — with session replays, heatmaps, and user journey analytics. Go beyond p-values and understand the behavior behind the numbers.

Try Userloom Free