A/B Testing Analysis - Data Analytics Projects

Project Overview

You are a data analyst on the product team at an online retailer. The UX team has redesigned the checkout flow to reduce friction: fewer form fields, a progress indicator, and one-click payment options. Before rolling out the new design to all users, you need to run a controlled A/B test. Control group (A) sees the old checkout, treatment group (B) sees the new checkout. Your job is to extract test data with SQL, perform rigorous statistical analysis in Python, and build an Excel calculator stakeholders can reuse for future tests.

Dataset Description

The dataset represents a two-week A/B test with user-level event tracking across the checkout funnel.

Table	Key Columns	Rows
experiments	experiment_id, experiment_name, start_date, end_date, traffic_split, status	5
users	user_id, variant (A/B), signup_date, device_type, country, segment	20,000
events	event_id, user_id, event_type, event_timestamp, page_url, session_id	185,000
conversions	conversion_id, user_id, order_id, conversion_date, revenue, items_count	3,400

SQL Analysis

1. Conversion Rates by Variant

Calculate the primary metric: what percentage of users in each variant completed a purchase.

SELECT
    u.variant,
    COUNT(DISTINCT u.user_id)                      AS total_users,
    COUNT(DISTINCT c.user_id)                      AS converted_users,
    ROUND(
        COUNT(DISTINCT c.user_id) * 100.0
        / NULLIF(COUNT(DISTINCT u.user_id), 0), 2
    )                                              AS conversion_rate_pct
FROM users u
LEFT JOIN conversions c ON c.user_id = u.user_id
GROUP BY u.variant;

2. User Segmentation Within Test

Break down conversion rates by device type and country to check for interaction effects.

-- By device type
SELECT
    u.variant,
    u.device_type,
    COUNT(DISTINCT u.user_id)        AS users,
    COUNT(DISTINCT c.user_id)        AS conversions,
    ROUND(
        COUNT(DISTINCT c.user_id) * 100.0
        / NULLIF(COUNT(DISTINCT u.user_id), 0), 2
    )                                AS conv_rate_pct
FROM users u
LEFT JOIN conversions c ON c.user_id = u.user_id
GROUP BY u.variant, u.device_type
ORDER BY u.variant, u.device_type;

-- By country (top 5 countries)
WITH top_countries AS (
    SELECT country, COUNT(*) AS cnt
    FROM users
    GROUP BY country
    ORDER BY cnt DESC
    LIMIT 5
)
SELECT
    u.variant,
    u.country,
    COUNT(DISTINCT u.user_id)        AS users,
    COUNT(DISTINCT c.user_id)        AS conversions,
    ROUND(
        COUNT(DISTINCT c.user_id) * 100.0
        / NULLIF(COUNT(DISTINCT u.user_id), 0), 2
    )                                AS conv_rate_pct
FROM users u
JOIN top_countries tc ON tc.country = u.country
LEFT JOIN conversions c ON c.user_id = u.user_id
GROUP BY u.variant, u.country
ORDER BY u.variant, conv_rate_pct DESC;

3. Funnel Step Analysis

Track drop-off at each checkout step to see where the new flow improves retention.

WITH funnel AS (
    SELECT
        u.variant,
        e.event_type,
        COUNT(DISTINCT e.user_id) AS users_at_step
    FROM users u
    JOIN events e ON e.user_id = u.user_id
    WHERE e.event_type IN ('cart_view', 'checkout_start', 'payment_info', 'order_confirm')
    GROUP BY u.variant, e.event_type
)
SELECT
    variant,
    event_type,
    users_at_step,
    LAG(users_at_step) OVER (PARTITION BY variant ORDER BY
        CASE event_type
            WHEN 'cart_view'      THEN 1
            WHEN 'checkout_start' THEN 2
            WHEN 'payment_info'   THEN 3
            WHEN 'order_confirm'  THEN 4
        END
    ) AS prev_step_users,
    ROUND(
        users_at_step * 100.0
        / NULLIF(LAG(users_at_step) OVER (PARTITION BY variant ORDER BY
            CASE event_type
                WHEN 'cart_view'      THEN 1
                WHEN 'checkout_start' THEN 2
                WHEN 'payment_info'   THEN 3
                WHEN 'order_confirm'  THEN 4
            END
        ), 0), 1
    ) AS step_conversion_pct
FROM funnel
ORDER BY variant,
    CASE event_type
        WHEN 'cart_view'      THEN 1
        WHEN 'checkout_start' THEN 2
        WHEN 'payment_info'   THEN 3
        WHEN 'order_confirm'  THEN 4
    END;

Python Analysis

1. Chi-Square Test for Conversion Rates

import pandas as pd
import numpy as np
from scipy import stats

users_df = pd.read_csv('users.csv')
conv_df  = pd.read_csv('conversions.csv')

# Mark converted users
users_df['converted'] = users_df['user_id'].isin(conv_df['user_id']).astype(int)

# Build contingency table
ct = pd.crosstab(users_df['variant'], users_df['converted'], margins=False)
print("Contingency Table:")
print(ct)
print()

# Chi-square test
chi2, p_value, dof, expected = stats.chi2_contingency(ct)
print(f"Chi-square statistic: {chi2:.4f}")
print(f"P-value:              {p_value:.6f}")
print(f"Degrees of freedom:   {dof}")
print(f"Significant at 0.05:  {'Yes' if p_value < 0.05 else 'No'}")

2. Z-Test for Proportions

from statsmodels.stats.proportion import proportions_ztest

# Group data
group_a = users_df[users_df['variant'] == 'A']
group_b = users_df[users_df['variant'] == 'B']

n_a, conv_a = len(group_a), group_a['converted'].sum()
n_b, conv_b = len(group_b), group_b['converted'].sum()

rate_a = conv_a / n_a
rate_b = conv_b / n_b

print(f"Control (A): {conv_a}/{n_a} = {rate_a:.4f}")
print(f"Treatment (B): {conv_b}/{n_b} = {rate_b:.4f}")
print(f"Lift: {(rate_b - rate_a) / rate_a * 100:.2f}%\n")

# Two-proportion z-test
z_stat, p_val = proportions_ztest(
    count=[conv_b, conv_a],
    nobs=[n_b, n_a],
    alternative='larger'  # one-sided: B > A
)
print(f"Z-statistic: {z_stat:.4f}")
print(f"P-value (one-sided): {p_val:.6f}")

3. Confidence Intervals

from statsmodels.stats.proportion import proportion_confint

# 95% confidence intervals for each variant
ci_a = proportion_confint(conv_a, n_a, alpha=0.05, method='wilson')
ci_b = proportion_confint(conv_b, n_b, alpha=0.05, method='wilson')

print(f"Control (A):   {rate_a:.4f}  95% CI [{ci_a[0]:.4f}, {ci_a[1]:.4f}]")
print(f"Treatment (B): {rate_b:.4f}  95% CI [{ci_b[0]:.4f}, {ci_b[1]:.4f}]")

# Confidence interval for the difference
import matplotlib.pyplot as plt

diff = rate_b - rate_a
se_diff = np.sqrt(rate_a * (1 - rate_a) / n_a + rate_b * (1 - rate_b) / n_b)
ci_diff = (diff - 1.96 * se_diff, diff + 1.96 * se_diff)
print(f"\nDifference:    {diff:.4f}  95% CI [{ci_diff[0]:.4f}, {ci_diff[1]:.4f}]")

# Plot confidence intervals
fig, ax = plt.subplots(figsize=(8, 4))
labels = ['Control (A)', 'Treatment (B)']
rates = [rate_a, rate_b]
cis = [ci_a, ci_b]
colors = ['#e74c3c', '#2ecc71']

for i, (label, rate, ci, color) in enumerate(zip(labels, rates, cis, colors)):
    ax.errorbar(rate, i, xerr=[[rate - ci[0]], [ci[1] - rate]],
                fmt='o', color=color, markersize=10, capsize=8, linewidth=2)
ax.set_yticks([0, 1])
ax.set_yticklabels(labels)
ax.set_xlabel('Conversion Rate')
ax.set_title('Conversion Rate with 95% Confidence Intervals')
ax.axvline(x=rate_a, color='gray', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('ab_confidence_intervals.png', dpi=150)
plt.show()

4. Power Analysis and Sample Size

from statsmodels.stats.power import NormalIndPower

power_analysis = NormalIndPower()

# Calculate required sample size per group
# Detect a 2 percentage point lift from 15% baseline
baseline = 0.15
mde = 0.02  # minimum detectable effect
effect_size = mde / np.sqrt(baseline * (1 - baseline))

required_n = power_analysis.solve_power(
    effect_size=effect_size,
    alpha=0.05,
    power=0.80,
    alternative='larger'
)
print(f"Baseline conversion rate:     {baseline:.0%}")
print(f"Minimum detectable effect:    {mde:.0%} absolute")
print(f"Required sample size/group:   {int(np.ceil(required_n)):,}")
print(f"Total required participants:  {int(np.ceil(required_n)) * 2:,}")

# Power curve
sample_sizes = np.arange(500, 20001, 500)
powers = [power_analysis.solve_power(effect_size=effect_size, nobs1=n, alpha=0.05, alternative='larger') for n in sample_sizes]

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(sample_sizes, powers, color='#3498db', linewidth=2)
ax.axhline(y=0.80, color='red', linestyle='--', label='80% Power Threshold')
ax.axvline(x=required_n, color='green', linestyle='--', alpha=0.7, label=f'Required n = {int(np.ceil(required_n)):,}')
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('Statistical Power')
ax.set_title('Power Curve: Sample Size vs. Statistical Power')
ax.legend()
plt.tight_layout()
plt.savefig('power_curve.png', dpi=150)
plt.show()

Excel Test Calculator

1. Input Section — Create labeled input cells: Visitors A, Conversions A, Visitors B, Conversions B. Use data validation to ensure positive integers. Calculate conversion rates automatically: =Conversions_A / Visitors_A.

2. Z-Test Calculation — Compute the pooled proportion: =(Conv_A + Conv_B) / (Vis_A + Vis_B). Calculate standard error: =SQRT(p_pool * (1 - p_pool) * (1/Vis_A + 1/Vis_B)). Compute z-statistic: =(Rate_B - Rate_A) / SE. Derive p-value: =1 - NORM.S.DIST(z_stat, TRUE).

3. Significance Table — Build a results table showing: Metric, Control, Treatment, Difference, Confidence Interval, P-Value, and Verdict (Significant/Not Significant). Use conditional formatting to highlight green for significant and red for not significant.

4. Sample Size Calculator — Add a tab where users enter baseline rate, MDE, significance level, and desired power. Use the formula: n = (Z_alpha + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2-p1)^2 with cell references for each parameter.

5. Visual Comparison Chart — Create a grouped bar chart showing conversion rates for A vs B with error bars representing 95% confidence intervals. Add a secondary line chart tracking daily conversion rates for both variants over the test duration.

6. Decision Framework — Add a summary cell with an IF formula: =IF(p_value < 0.05, IF(Rate_B > Rate_A, "Ship variant B", "Revert to A"), "Test inconclusive - extend duration"). This gives stakeholders a clear, automated recommendation.

Key Insights

Statistical Significance

A p-value below 0.05 means there is less than a 5% probability the observed difference occurred by chance. However, significance alone does not indicate practical importance — always evaluate effect size alongside.

Sample Size Matters

Running a test with too few participants risks a Type II error (failing to detect a real difference). Power analysis before launching ensures the test can reliably detect the minimum effect size that matters to the business.

Segment Effects

Overall test results can mask segment-level differences. The new checkout may perform well on desktop but poorly on mobile. Always check for interaction effects before making a blanket rollout decision.

Multiple Testing

Checking multiple metrics increases the chance of a false positive. Apply Bonferroni correction or use a pre-registered primary metric to maintain the integrity of your statistical conclusions.

Knowledge Check

Question 1

What does a p-value of 0.03 mean in an A/B test?

There is a 3% chance the treatment is better If the null hypothesis were true, there is a 3% probability of observing a difference this large or larger The treatment improves conversion by 3% 97% of users preferred the new checkout

Question 2

Why is a one-sided z-test used with alternative='larger' in this analysis?

Because two-sided tests are never valid for A/B testing Because the sample size is too small for a two-sided test Because it makes the p-value smaller and easier to reach significance Because we only care about detecting if the new checkout is better, not worse

Question 3

What is statistical power in the context of A/B testing?

The probability of correctly detecting a true difference when one exists The confidence level of the test results The total number of participants in the test The percentage of users who converted

Question 4

The Wilson method is used for confidence intervals instead of the normal approximation because:

It always produces wider intervals for safety It requires less computational power It provides better coverage, especially for proportions near 0 or 1 and smaller samples It is the only method that works with binary outcomes

Question 5

What is the purpose of the LAG window function in the funnel analysis SQL query?

To count the total number of users in the experiment To access the user count from the previous funnel step so the step-to-step conversion rate can be calculated To filter out users who skipped a step in the checkout process To delay the query execution for performance optimization