A/B Testing Analysis
Design and analyze an A/B test comparing a new checkout flow against the existing one, applying statistical hypothesis testing to determine whether the change produces a significant improvement in conversion rates.
Project Overview
You are a data analyst on the product team at an online retailer. The UX team has redesigned the checkout flow to reduce friction: fewer form fields, a progress indicator, and one-click payment options. Before rolling out the new design to all users, you need to run a controlled A/B test. Control group (A) sees the old checkout, treatment group (B) sees the new checkout. Your job is to extract test data with SQL, perform rigorous statistical analysis in Python, and build an Excel calculator stakeholders can reuse for future tests.
Dataset Description
The dataset represents a two-week A/B test with user-level event tracking across the checkout funnel.
| Table | Key Columns | Rows |
|---|---|---|
| experiments | experiment_id, experiment_name, start_date, end_date, traffic_split, status | 5 |
| users | user_id, variant (A/B), signup_date, device_type, country, segment | 20,000 |
| events | event_id, user_id, event_type, event_timestamp, page_url, session_id | 185,000 |
| conversions | conversion_id, user_id, order_id, conversion_date, revenue, items_count | 3,400 |
SQL Analysis
1. Conversion Rates by Variant
Calculate the primary metric: what percentage of users in each variant completed a purchase.
SELECT
u.variant,
COUNT(DISTINCT u.user_id) AS total_users,
COUNT(DISTINCT c.user_id) AS converted_users,
ROUND(
COUNT(DISTINCT c.user_id) * 100.0
/ NULLIF(COUNT(DISTINCT u.user_id), 0), 2
) AS conversion_rate_pct
FROM users u
LEFT JOIN conversions c ON c.user_id = u.user_id
GROUP BY u.variant;
2. User Segmentation Within Test
Break down conversion rates by device type and country to check for interaction effects.
-- By device type
SELECT
u.variant,
u.device_type,
COUNT(DISTINCT u.user_id) AS users,
COUNT(DISTINCT c.user_id) AS conversions,
ROUND(
COUNT(DISTINCT c.user_id) * 100.0
/ NULLIF(COUNT(DISTINCT u.user_id), 0), 2
) AS conv_rate_pct
FROM users u
LEFT JOIN conversions c ON c.user_id = u.user_id
GROUP BY u.variant, u.device_type
ORDER BY u.variant, u.device_type;
-- By country (top 5 countries)
WITH top_countries AS (
SELECT country, COUNT(*) AS cnt
FROM users
GROUP BY country
ORDER BY cnt DESC
LIMIT 5
)
SELECT
u.variant,
u.country,
COUNT(DISTINCT u.user_id) AS users,
COUNT(DISTINCT c.user_id) AS conversions,
ROUND(
COUNT(DISTINCT c.user_id) * 100.0
/ NULLIF(COUNT(DISTINCT u.user_id), 0), 2
) AS conv_rate_pct
FROM users u
JOIN top_countries tc ON tc.country = u.country
LEFT JOIN conversions c ON c.user_id = u.user_id
GROUP BY u.variant, u.country
ORDER BY u.variant, conv_rate_pct DESC;
3. Funnel Step Analysis
Track drop-off at each checkout step to see where the new flow improves retention.
WITH funnel AS (
SELECT
u.variant,
e.event_type,
COUNT(DISTINCT e.user_id) AS users_at_step
FROM users u
JOIN events e ON e.user_id = u.user_id
WHERE e.event_type IN ('cart_view', 'checkout_start', 'payment_info', 'order_confirm')
GROUP BY u.variant, e.event_type
)
SELECT
variant,
event_type,
users_at_step,
LAG(users_at_step) OVER (PARTITION BY variant ORDER BY
CASE event_type
WHEN 'cart_view' THEN 1
WHEN 'checkout_start' THEN 2
WHEN 'payment_info' THEN 3
WHEN 'order_confirm' THEN 4
END
) AS prev_step_users,
ROUND(
users_at_step * 100.0
/ NULLIF(LAG(users_at_step) OVER (PARTITION BY variant ORDER BY
CASE event_type
WHEN 'cart_view' THEN 1
WHEN 'checkout_start' THEN 2
WHEN 'payment_info' THEN 3
WHEN 'order_confirm' THEN 4
END
), 0), 1
) AS step_conversion_pct
FROM funnel
ORDER BY variant,
CASE event_type
WHEN 'cart_view' THEN 1
WHEN 'checkout_start' THEN 2
WHEN 'payment_info' THEN 3
WHEN 'order_confirm' THEN 4
END;
Python Analysis
1. Chi-Square Test for Conversion Rates
import pandas as pd
import numpy as np
from scipy import stats
users_df = pd.read_csv('users.csv')
conv_df = pd.read_csv('conversions.csv')
# Mark converted users
users_df['converted'] = users_df['user_id'].isin(conv_df['user_id']).astype(int)
# Build contingency table
ct = pd.crosstab(users_df['variant'], users_df['converted'], margins=False)
print("Contingency Table:")
print(ct)
print()
# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(ct)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value: {p_value:.6f}")
print(f"Degrees of freedom: {dof}")
print(f"Significant at 0.05: {'Yes' if p_value < 0.05 else 'No'}")
2. Z-Test for Proportions
from statsmodels.stats.proportion import proportions_ztest
# Group data
group_a = users_df[users_df['variant'] == 'A']
group_b = users_df[users_df['variant'] == 'B']
n_a, conv_a = len(group_a), group_a['converted'].sum()
n_b, conv_b = len(group_b), group_b['converted'].sum()
rate_a = conv_a / n_a
rate_b = conv_b / n_b
print(f"Control (A): {conv_a}/{n_a} = {rate_a:.4f}")
print(f"Treatment (B): {conv_b}/{n_b} = {rate_b:.4f}")
print(f"Lift: {(rate_b - rate_a) / rate_a * 100:.2f}%\n")
# Two-proportion z-test
z_stat, p_val = proportions_ztest(
count=[conv_b, conv_a],
nobs=[n_b, n_a],
alternative='larger' # one-sided: B > A
)
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value (one-sided): {p_val:.6f}")
3. Confidence Intervals
from statsmodels.stats.proportion import proportion_confint
# 95% confidence intervals for each variant
ci_a = proportion_confint(conv_a, n_a, alpha=0.05, method='wilson')
ci_b = proportion_confint(conv_b, n_b, alpha=0.05, method='wilson')
print(f"Control (A): {rate_a:.4f} 95% CI [{ci_a[0]:.4f}, {ci_a[1]:.4f}]")
print(f"Treatment (B): {rate_b:.4f} 95% CI [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")
# Confidence interval for the difference
import matplotlib.pyplot as plt
diff = rate_b - rate_a
se_diff = np.sqrt(rate_a * (1 - rate_a) / n_a + rate_b * (1 - rate_b) / n_b)
ci_diff = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)
print(f"\nDifference: {diff:.4f} 95% CI [{ci_diff[0]:.4f}, {ci_diff[1]:.4f}]")
# Plot confidence intervals
fig, ax = plt.subplots(figsize=(8, 4))
labels = ['Control (A)', 'Treatment (B)']
rates = [rate_a, rate_b]
cis = [ci_a, ci_b]
colors = ['#e74c3c', '#2ecc71']
for i, (label, rate, ci, color) in enumerate(zip(labels, rates, cis, colors)):
ax.errorbar(rate, i, xerr=[[rate - ci[0]], [ci[1] - rate]],
fmt='o', color=color, markersize=10, capsize=8, linewidth=2)
ax.set_yticks([0, 1])
ax.set_yticklabels(labels)
ax.set_xlabel('Conversion Rate')
ax.set_title('Conversion Rate with 95% Confidence Intervals')
ax.axvline(x=rate_a, color='gray', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('ab_confidence_intervals.png', dpi=150)
plt.show()
4. Power Analysis and Sample Size
from statsmodels.stats.power import NormalIndPower
power_analysis = NormalIndPower()
# Calculate required sample size per group
# Detect a 2 percentage point lift from 15% baseline
baseline = 0.15
mde = 0.02 # minimum detectable effect
effect_size = mde / np.sqrt(baseline * (1 - baseline))
required_n = power_analysis.solve_power(
effect_size=effect_size,
alpha=0.05,
power=0.80,
alternative='larger'
)
print(f"Baseline conversion rate: {baseline:.0%}")
print(f"Minimum detectable effect: {mde:.0%} absolute")
print(f"Required sample size/group: {int(np.ceil(required_n)):,}")
print(f"Total required participants: {int(np.ceil(required_n)) * 2:,}")
# Power curve
sample_sizes = np.arange(500, 20001, 500)
powers = [power_analysis.solve_power(effect_size=effect_size, nobs1=n, alpha=0.05, alternative='larger') for n in sample_sizes]
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(sample_sizes, powers, color='#3498db', linewidth=2)
ax.axhline(y=0.80, color='red', linestyle='--', label='80% Power Threshold')
ax.axvline(x=required_n, color='green', linestyle='--', alpha=0.7, label=f'Required n = {int(np.ceil(required_n)):,}')
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('Statistical Power')
ax.set_title('Power Curve: Sample Size vs. Statistical Power')
ax.legend()
plt.tight_layout()
plt.savefig('power_curve.png', dpi=150)
plt.show()
Excel Test Calculator
1. Input Section — Create labeled input cells: Visitors A, Conversions A, Visitors B, Conversions B. Use data validation to ensure positive integers. Calculate conversion rates automatically: =Conversions_A / Visitors_A.
2. Z-Test Calculation — Compute the pooled proportion: =(Conv_A + Conv_B) / (Vis_A + Vis_B). Calculate standard error: =SQRT(p_pool * (1 - p_pool) * (1/Vis_A + 1/Vis_B)). Compute z-statistic: =(Rate_B - Rate_A) / SE. Derive p-value: =1 - NORM.S.DIST(z_stat, TRUE).
3. Significance Table — Build a results table showing: Metric, Control, Treatment, Difference, Confidence Interval, P-Value, and Verdict (Significant/Not Significant). Use conditional formatting to highlight green for significant and red for not significant.
4. Sample Size Calculator — Add a tab where users enter baseline rate, MDE, significance level, and desired power. Use the formula: n = (Z_alpha + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2-p1)^2 with cell references for each parameter.
5. Visual Comparison Chart — Create a grouped bar chart showing conversion rates for A vs B with error bars representing 95% confidence intervals. Add a secondary line chart tracking daily conversion rates for both variants over the test duration.
6. Decision Framework — Add a summary cell with an IF formula: =IF(p_value < 0.05, IF(Rate_B > Rate_A, "Ship variant B", "Revert to A"), "Test inconclusive - extend duration"). This gives stakeholders a clear, automated recommendation.
Key Insights
Statistical Significance
A p-value below 0.05 means there is less than a 5% probability the observed difference occurred by chance. However, significance alone does not indicate practical importance — always evaluate effect size alongside.
Sample Size Matters
Running a test with too few participants risks a Type II error (failing to detect a real difference). Power analysis before launching ensures the test can reliably detect the minimum effect size that matters to the business.
Segment Effects
Overall test results can mask segment-level differences. The new checkout may perform well on desktop but poorly on mobile. Always check for interaction effects before making a blanket rollout decision.
Multiple Testing
Checking multiple metrics increases the chance of a false positive. Apply Bonferroni correction or use a pre-registered primary metric to maintain the integrity of your statistical conclusions.
Knowledge Check
What does a p-value of 0.03 mean in an A/B test?
Why is a one-sided z-test used with alternative='larger' in this analysis?
What is statistical power in the context of A/B testing?
The Wilson method is used for confidence intervals instead of the normal approximation because:
What is the purpose of the LAG window function in the funnel analysis SQL query?