Project 3

Product Recommendation Engine

Build a market basket analysis system using the Apriori algorithm to discover product associations and generate "Frequently Bought Together" recommendations.

Project Overview

Difficulty: Advanced

Tools: SQL, Python (pandas, mlxtend, matplotlib, seaborn)

Estimated Time: 8-10 hours

Note: This is a SQL + Python only project. The analysis requires algorithmic processing that is beyond Excel's native capabilities.

You are a data analyst at MegaMart Online, a large e-commerce platform. The product team wants to implement a "Frequently Bought Together" feature on product pages and a "You Might Also Like" section in the shopping cart.

Using Market Basket Analysis and the Apriori algorithm, you will analyze historical purchase data to discover which products are commonly purchased together, then generate actionable association rules.

Dataset Description

The data is stored in a transactional format with each row representing a single item within an order.

Table: order_baskets

Column Data Type Description
order_idINTUnique order identifier
customer_idINTCustomer who placed the order
order_dateDATEDate the order was placed
product_idINT (FK)References products table
product_nameVARCHAR(100)Name of the product
categoryVARCHAR(50)Product category
quantityINTUnits purchased
unit_priceDECIMAL(10,2)Price per unit

Table: products

Column Data Type Description
product_idINT (PK)Unique product identifier
product_nameVARCHAR(100)Product display name
categoryVARCHAR(50)Electronics, Kitchen, Sports, etc.
subcategoryVARCHAR(50)More specific classification
priceDECIMAL(10,2)Current retail price

SQL Analysis

These SQL queries explore co-purchase patterns at the database level before feeding the data into the Apriori algorithm in Python.

Query 1: Product Co-Purchase Frequency

Finds all pairs of products that appear in the same order and counts how often each pair occurs. This is the foundation of market basket analysis.

SELECT
    a.product_name  AS product_a,
    b.product_name  AS product_b,
    COUNT(DISTINCT a.order_id) AS co_purchase_count
FROM order_baskets a
JOIN order_baskets b
    ON a.order_id = b.order_id
    AND a.product_id < b.product_id
GROUP BY a.product_name, b.product_name
HAVING COUNT(DISTINCT a.order_id) >= 5
ORDER BY co_purchase_count DESC
LIMIT 20;

Query 2: Product Pair Analysis with Support

Calculates the support metric for each product pair: the percentage of all orders that contain both products.

WITH total_orders AS (
    SELECT COUNT(DISTINCT order_id) AS order_count
    FROM order_baskets
),
product_pairs AS (
    SELECT
        a.product_name  AS product_a,
        b.product_name  AS product_b,
        COUNT(DISTINCT a.order_id) AS pair_count
    FROM order_baskets a
    JOIN order_baskets b
        ON a.order_id = b.order_id
        AND a.product_id < b.product_id
    GROUP BY a.product_name, b.product_name
)
SELECT
    pp.product_a,
    pp.product_b,
    pp.pair_count,
    ROUND(pp.pair_count * 100.0 / t.order_count, 4) AS support_pct
FROM product_pairs pp
CROSS JOIN total_orders t
WHERE pp.pair_count >= 5
ORDER BY support_pct DESC
LIMIT 20;

Query 3: Category-Level Associations

Identifies which product categories are most commonly purchased together, providing a higher-level view of buying patterns.

SELECT
    a.category   AS category_a,
    b.category   AS category_b,
    COUNT(DISTINCT a.order_id) AS co_occurrence,
    ROUND(
        COUNT(DISTINCT a.order_id) * 100.0 /
        (SELECT COUNT(DISTINCT order_id) FROM order_baskets),
        2
    ) AS support_pct
FROM order_baskets a
JOIN order_baskets b
    ON a.order_id = b.order_id
    AND a.category < b.category
GROUP BY a.category, b.category
ORDER BY co_occurrence DESC;

Query 4: Basket Size Distribution

Understanding basket sizes helps you set appropriate minimum support thresholds for the Apriori algorithm.

SELECT
    items_per_order,
    COUNT(*)       AS order_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM (
    SELECT
        order_id,
        COUNT(DISTINCT product_id) AS items_per_order
    FROM order_baskets
    GROUP BY order_id
) basket_sizes
GROUP BY items_per_order
ORDER BY items_per_order;

Python Analysis

Use the mlxtend library to apply the Apriori algorithm and generate association rules from the transaction data.

Data Preparation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

# Load transaction data
baskets = pd.read_csv('order_baskets.csv', parse_dates=['order_date'])

print(f"Total rows: {len(baskets):,}")
print(f"Unique orders: {baskets['order_id'].nunique():,}")
print(f"Unique products: {baskets['product_name'].nunique():,}")
print(f"Date range: {baskets['order_date'].min().date()} to {baskets['order_date'].max().date()}")

# Create transaction list: each order becomes a list of product names
transactions = baskets.groupby('order_id')['product_name'].apply(list).tolist()

# Filter to orders with 2+ items (single-item orders have no co-purchases)
transactions = [t for t in transactions if len(t) >= 2]
print(f"\nMulti-item transactions: {len(transactions):,}")

Apriori Algorithm

# One-hot encode the transactions
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
basket_df = pd.DataFrame(te_array, columns=te.columns_)

print(f"Transaction matrix shape: {basket_df.shape}")
print(f"(rows=transactions, columns=unique products)")

# Apply Apriori to find frequent itemsets
# min_support=0.01 means the itemset must appear in at least 1% of transactions
frequent_items = apriori(
    basket_df,
    min_support=0.01,
    use_colnames=True,
    max_len=3  # Look for pairs and triplets
)

frequent_items['length'] = frequent_items['itemsets'].apply(len)
print(f"\nFrequent itemsets found: {len(frequent_items)}")
print(f"  Pairs: {(frequent_items['length'] == 2).sum()}")
print(f"  Triplets: {(frequent_items['length'] == 3).sum()}")

# Show top frequent pairs
top_pairs = (
    frequent_items[frequent_items['length'] == 2]
    .sort_values('support', ascending=False)
    .head(15)
)
print("\nTop 15 Frequent Product Pairs:")
for _, row in top_pairs.iterrows():
    items = ', '.join(row['itemsets'])
    print(f"  {items} (support: {row['support']:.4f})")

Association Rules

# Generate association rules from frequent itemsets
rules = association_rules(
    frequent_items,
    metric="lift",
    min_threshold=1.0,
    num_itemsets=len(frequent_items)
)

# Clean up the rules for readability
rules['antecedents'] = rules['antecedents'].apply(lambda x: ', '.join(list(x)))
rules['consequents'] = rules['consequents'].apply(lambda x: ', '.join(list(x)))

# Sort by lift (strongest associations first)
rules_sorted = rules.sort_values('lift', ascending=False)

print(f"Total association rules generated: {len(rules_sorted)}")
print("\nTop 15 Rules by Lift:")
print(rules_sorted[['antecedents', 'consequents', 'support',
                     'confidence', 'lift']].head(15).to_string(index=False))

# Filter high-confidence rules for production recommendations
high_conf = rules_sorted[
    (rules_sorted['confidence'] >= 0.3) &
    (rules_sorted['lift'] >= 1.5)
].copy()

print(f"\nHigh-quality rules (confidence >= 0.3, lift >= 1.5): {len(high_conf)}")
high_conf.to_csv('recommendation_rules.csv', index=False)

Visualizations

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Support vs Confidence scatter
scatter = axes[0].scatter(
    rules_sorted['support'],
    rules_sorted['confidence'],
    c=rules_sorted['lift'],
    cmap='YlOrRd',
    alpha=0.6,
    s=40
)
axes[0].set_xlabel('Support')
axes[0].set_ylabel('Confidence')
axes[0].set_title('Support vs Confidence (colored by Lift)')
plt.colorbar(scatter, ax=axes[0], label='Lift')

# Plot 2: Top 10 rules by lift (horizontal bar)
top_rules = rules_sorted.head(10)
labels = [f"{r['antecedents']} -> {r['consequents']}" for _, r in top_rules.iterrows()]
axes[1].barh(range(len(labels)), top_rules['lift'].values, color='#e07a2f')
axes[1].set_yticks(range(len(labels)))
axes[1].set_yticklabels(labels, fontsize=8)
axes[1].set_xlabel('Lift')
axes[1].set_title('Top 10 Association Rules by Lift')
axes[1].invert_yaxis()

# Plot 3: Category co-occurrence heatmap
cat_pairs = baskets.merge(
    baskets[['order_id', 'category']],
    on='order_id',
    suffixes=('_a', '_b')
)
cat_pairs = cat_pairs[cat_pairs['category_a'] < cat_pairs['category_b']]
cat_matrix = cat_pairs.pivot_table(
    index='category_a',
    columns='category_b',
    values='order_id',
    aggfunc='nunique'
).fillna(0)

sns.heatmap(cat_matrix, annot=True, fmt='.0f', cmap='YlOrRd',
            linewidths=0.5, ax=axes[2])
axes[2].set_title('Category Co-Occurrence Counts')

plt.suptitle('Product Recommendation Analysis', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('recommendation_analysis.png', dpi=150, bbox_inches='tight')
plt.show()

Building the Recommendation Function

def get_recommendations(product_name, rules_df, top_n=5):
    """
    Given a product name, return the top N recommended products
    based on association rules sorted by confidence * lift.
    """
    matching = rules_df[rules_df['antecedents'].str.contains(product_name, case=False)]
    matching = matching.copy()
    matching['score'] = matching['confidence'] * matching['lift']
    matching = matching.sort_values('score', ascending=False)

    recommendations = []
    for _, rule in matching.head(top_n).iterrows():
        recommendations.append({
            'recommended_product': rule['consequents'],
            'confidence': round(rule['confidence'], 3),
            'lift': round(rule['lift'], 2),
            'score': round(rule['score'], 3)
        })

    return pd.DataFrame(recommendations)

# Example usage
print("Recommendations for 'Wireless Bluetooth Earbuds':")
print(get_recommendations('Wireless Bluetooth Earbuds', rules_sorted))

print("\nRecommendations for 'Yoga Mat':")
print(get_recommendations('Yoga Mat', rules_sorted))

Excel Instructions

Note: Market basket analysis and the Apriori algorithm require iterative computation over potentially millions of itemset combinations. This is not feasible in Excel. The SQL queries extract the raw co-purchase data, and Python handles the algorithmic processing. The exported recommendation_rules.csv can be opened in Excel for filtering and presentation purposes, but the analysis itself should be performed in Python.

Key Insights & Recommendations

Present these findings to the product and engineering teams to implement the recommendation system.

Cross-Category Bundles

The strongest associations often cross categories (e.g., phone case + screen protector, yoga mat + resistance bands). Create cross-category bundle discounts on product pages to increase average order value.

Confidence-Based Ranking

Use confidence as the primary ranking metric for "Frequently Bought Together" widgets. A rule with 45% confidence means nearly half of buyers of Product A also buy Product B, making it a strong recommendation.

Lift Threshold for Quality

Only deploy rules with lift > 1.5. A lift of 1.0 means products are purchased together at random rates. Rules below this threshold are noise and will degrade recommendation quality.

Seasonal Variation

Run the analysis quarterly. Product associations change with seasons (e.g., sunscreen + sunglasses in summer vs. gloves + scarves in winter). Stale rules lead to irrelevant recommendations.

Cart-Page vs. Product-Page

Use high-confidence pairs on product detail pages ("Frequently Bought Together") and use triplet rules on the cart page ("Complete Your Order"). This two-tier approach maximizes cross-sell opportunities.

Revenue Impact Estimate

Industry benchmarks show "Frequently Bought Together" features increase average order value by 10-15%. With the top 50 rules covering 60% of product pages, estimated incremental revenue is significant.

Knowledge Check Quiz

Test your understanding of market basket analysis and association rule mining.

Question 1

In association rule mining, what does "support" measure?

Question 2

A rule has confidence of 0.6 and lift of 2.5. What does a lift of 2.5 mean?

Question 3

Why does the SQL co-purchase query use a.product_id < b.product_id in the JOIN condition?

Question 4

What does the TransactionEncoder in mlxtend do?

Question 5

Why should you set a minimum support threshold when running the Apriori algorithm?

← Previous Project Next Project →