A/B Testing Setup: Statistical Design, Implementation, | OpsBlu Docs

A/B Testing Setup: Statistical Design, Implementation,

How to design, implement, and analyze A/B tests with statistical rigor. Covers sample size calculation, randomization, experiment tracking,...

What Makes a Valid A/B Test

An A/B test (split test) compares two versions of a page or feature by randomly assigning users to each version and measuring which performs better on a predefined metric. The key word is randomly — without proper randomization, your results are meaningless.

A valid test requires:

  1. A clear hypothesis ("Changing the CTA from 'Learn More' to 'Start Free Trial' will increase signups by 10%").
  2. A single primary metric (conversion rate, revenue per visitor, engagement rate).
  3. Sufficient sample size calculated before the test starts.
  4. Random, even traffic assignment between variants.
  5. A predetermined test duration — no peeking and stopping early.

Sample Size Calculation

Calculate the sample size before starting. Running until you "see a winner" inflates false positive rates (this is called the peeking problem).

Formula inputs:

  • Baseline conversion rate: Your current conversion rate (e.g., 3.2%).
  • Minimum detectable effect (MDE): The smallest improvement you care about (e.g., 10% relative lift → 3.2% to 3.52%).
  • Statistical power: Probability of detecting a real effect (standard: 80%).
  • Significance level: False positive threshold (standard: 5%, α = 0.05).

Quick reference for a 5% baseline conversion rate:

Minimum Detectable Effect Sample Size Per Variant Total (Both Variants)
20% relative (5% → 6%) ~14,500 ~29,000
10% relative (5% → 5.5%) ~58,000 ~116,000
5% relative (5% → 5.25%) ~231,000 ~462,000

Lower baseline rates and smaller effects require dramatically more traffic. If your site gets 1,000 visits/day, a 10% relative lift on a 5% conversion rate takes ~58 days per variant.

Tools: Evan Miller's calculator, Optimizely's calculator, or calculate in Python:

from statsmodels.stats.power import NormalIndPower
from statsmodels.stats.proportion import proportion_effectsize

baseline = 0.05
effect = proportion_effectsize(0.05, 0.055)  # 5% → 5.5%
power_analysis = NormalIndPower()
sample_size = power_analysis.solve_power(
    effect_size=effect,
    power=0.8,
    alpha=0.05,
    ratio=1.0
)
print(f"Sample size per variant: {int(sample_size)}")

Implementation Architecture

Client-Side (JavaScript)

// Simple hash-based assignment (deterministic per user)
function getVariant(userId, experimentId) {
  const hash = hashCode(userId + experimentId);
  return hash % 2 === 0 ? 'control' : 'variant';
}

// Assign and track
const variant = getVariant(getUserId(), 'pricing-cta-test');

if (variant === 'variant') {
  document.querySelector('.cta').textContent = 'Start Free Trial';
}

// Log exposure to analytics
gtag('event', 'experiment_exposure', {
  experiment_id: 'pricing-cta-test',
  variant: variant
});

Using a Testing Platform

Google Optimize replacement (sunset July 2023): Use these alternatives:

Platform Type Best For
VWO Visual editor + code Marketing teams needing a WYSIWYG editor
LaunchDarkly Feature flags Engineering teams with existing flag infrastructure
Statsig Product experimentation Product teams needing warehouse-native analysis
Optimizely Enterprise experimentation Large-scale programs with multiple teams
PostHog Open-source Teams wanting self-hosted control

Data Layer Integration

Track experiment exposure and conversion through your data layer for analytics:

// Push experiment assignment to data layer
dataLayer.push({
  event: 'experiment_viewed',
  experiment_id: 'pricing-cta-test',
  experiment_variant: 'variant-b',
  experiment_name: 'Pricing CTA - Start Free Trial'
});

// Push conversion event
dataLayer.push({
  event: 'signup_completed',
  experiment_id: 'pricing-cta-test',
  experiment_variant: 'variant-b'
});

In GA4, create a custom dimension for experiment_variant and use it to segment conversion reports. In Mixpanel or Amplitude, include experiment properties on the relevant events.

Analyzing Results

Statistical Significance Test

After reaching your target sample size, test for significance:

from scipy.stats import chi2_contingency
import numpy as np

# Control: 500 conversions out of 10,000 visitors (5.0%)
# Variant: 550 conversions out of 10,000 visitors (5.5%)
observed = np.array([[500, 9500], [550, 9450]])

chi2, p_value, dof, expected = chi2_contingency(observed)

print(f"p-value: {p_value:.4f}")
print(f"Significant at 95%: {p_value < 0.05}")

# Confidence interval for the difference
from statsmodels.stats.proportion import confint_proportions_2indep
ci_low, ci_high = confint_proportions_2indep(
    550, 10000, 500, 10000, method='newcomb'
)
print(f"95% CI for difference: [{ci_low:.4f}, {ci_high:.4f}]")

Interpreting results:

  • p < 0.05: The difference is statistically significant at 95% confidence. You can deploy the variant.
  • p ≥ 0.05: The difference could be due to chance. Either the test needs more traffic, or the effect is smaller than your MDE.
  • Check confidence intervals: A significant result with a CI of [0.001, 0.01] means the true effect is likely 0.1% to 1% — practically small even if statistically significant.

Guardrail Metrics

Beyond your primary metric, monitor guardrails to catch negative side effects:

  • Revenue per session — a variant might increase signups but decrease revenue if it attracts lower-quality leads.
  • Bounce rate — ensures the change doesn't drive people away.
  • Page load time — new elements or scripts could hurt performance.
  • Error rateJavaScript errors in the variant code.

If a guardrail metric degrades significantly, stop the test regardless of the primary metric result.

Common Pitfalls

Peeking and Early Stopping

Checking results daily and stopping when you see significance dramatically inflates false positives. With daily checks over a 30-day test, you have a ~30% false positive rate instead of 5%. Solution: Calculate sample size upfront and don't look until you reach it. Or use sequential testing methods (SPRT, always-valid p-values) designed for continuous monitoring.

Multiple Testing

Running 5 tests simultaneously with α = 0.05 gives a ~23% chance of at least one false positive. Solution: Apply Bonferroni correction (α / n) or control false discovery rate with Benjamini-Hochberg.

Selection Bias

If you assign variants by alternating visitors (first → control, second → variant), bot traffic or time-of-day effects bias results. Solution: Use hash-based randomization keyed on a persistent user ID.

Survivor Bias

Only counting users who complete the full experiment ignores users who abandoned because of the variant. Solution: Analyze on intent-to-treat basis — include all users assigned to a variant, whether they completed the flow or not.

Underpowered Tests

Running a test with 1,000 visitors when you need 50,000 means you can only detect very large effects (30%+ lifts). Small but meaningful improvements (5-10%) will appear as "no significant difference." Solution: Calculate required sample size before starting and commit to running the full duration.

Next Steps