Product Recommendation Engine
Build a market basket analysis system using the Apriori algorithm to discover product associations and generate "Frequently Bought Together" recommendations.
Project Overview
Difficulty: Advanced
Tools: SQL, Python (pandas, mlxtend, matplotlib, seaborn)
Estimated Time: 8-10 hours
Note: This is a SQL + Python only project. The analysis requires algorithmic processing that is beyond Excel's native capabilities.
You are a data analyst at MegaMart Online, a large e-commerce platform. The product team wants to implement a "Frequently Bought Together" feature on product pages and a "You Might Also Like" section in the shopping cart.
Using Market Basket Analysis and the Apriori algorithm, you will analyze historical purchase data to discover which products are commonly purchased together, then generate actionable association rules.
- Support: How frequently does a product combination appear in all transactions?
- Confidence: If a customer buys Product A, what is the probability they also buy Product B?
- Lift: Does buying Product A actually increase the likelihood of buying Product B beyond random chance?
Dataset Description
The data is stored in a transactional format with each row representing a single item within an order.
Table: order_baskets
| Column | Data Type | Description |
|---|---|---|
| order_id | INT | Unique order identifier |
| customer_id | INT | Customer who placed the order |
| order_date | DATE | Date the order was placed |
| product_id | INT (FK) | References products table |
| product_name | VARCHAR(100) | Name of the product |
| category | VARCHAR(50) | Product category |
| quantity | INT | Units purchased |
| unit_price | DECIMAL(10,2) | Price per unit |
Table: products
| Column | Data Type | Description |
|---|---|---|
| product_id | INT (PK) | Unique product identifier |
| product_name | VARCHAR(100) | Product display name |
| category | VARCHAR(50) | Electronics, Kitchen, Sports, etc. |
| subcategory | VARCHAR(50) | More specific classification |
| price | DECIMAL(10,2) | Current retail price |
SQL Analysis
These SQL queries explore co-purchase patterns at the database level before feeding the data into the Apriori algorithm in Python.
Query 1: Product Co-Purchase Frequency
Finds all pairs of products that appear in the same order and counts how often each pair occurs. This is the foundation of market basket analysis.
SELECT
a.product_name AS product_a,
b.product_name AS product_b,
COUNT(DISTINCT a.order_id) AS co_purchase_count
FROM order_baskets a
JOIN order_baskets b
ON a.order_id = b.order_id
AND a.product_id < b.product_id
GROUP BY a.product_name, b.product_name
HAVING COUNT(DISTINCT a.order_id) >= 5
ORDER BY co_purchase_count DESC
LIMIT 20;
Query 2: Product Pair Analysis with Support
Calculates the support metric for each product pair: the percentage of all orders that contain both products.
WITH total_orders AS (
SELECT COUNT(DISTINCT order_id) AS order_count
FROM order_baskets
),
product_pairs AS (
SELECT
a.product_name AS product_a,
b.product_name AS product_b,
COUNT(DISTINCT a.order_id) AS pair_count
FROM order_baskets a
JOIN order_baskets b
ON a.order_id = b.order_id
AND a.product_id < b.product_id
GROUP BY a.product_name, b.product_name
)
SELECT
pp.product_a,
pp.product_b,
pp.pair_count,
ROUND(pp.pair_count * 100.0 / t.order_count, 4) AS support_pct
FROM product_pairs pp
CROSS JOIN total_orders t
WHERE pp.pair_count >= 5
ORDER BY support_pct DESC
LIMIT 20;
Query 3: Category-Level Associations
Identifies which product categories are most commonly purchased together, providing a higher-level view of buying patterns.
SELECT
a.category AS category_a,
b.category AS category_b,
COUNT(DISTINCT a.order_id) AS co_occurrence,
ROUND(
COUNT(DISTINCT a.order_id) * 100.0 /
(SELECT COUNT(DISTINCT order_id) FROM order_baskets),
2
) AS support_pct
FROM order_baskets a
JOIN order_baskets b
ON a.order_id = b.order_id
AND a.category < b.category
GROUP BY a.category, b.category
ORDER BY co_occurrence DESC;
Query 4: Basket Size Distribution
Understanding basket sizes helps you set appropriate minimum support thresholds for the Apriori algorithm.
SELECT
items_per_order,
COUNT(*) AS order_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM (
SELECT
order_id,
COUNT(DISTINCT product_id) AS items_per_order
FROM order_baskets
GROUP BY order_id
) basket_sizes
GROUP BY items_per_order
ORDER BY items_per_order;
Python Analysis
Use the mlxtend library to apply the Apriori algorithm and generate association rules from the transaction data.
Data Preparation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder
# Load transaction data
baskets = pd.read_csv('order_baskets.csv', parse_dates=['order_date'])
print(f"Total rows: {len(baskets):,}")
print(f"Unique orders: {baskets['order_id'].nunique():,}")
print(f"Unique products: {baskets['product_name'].nunique():,}")
print(f"Date range: {baskets['order_date'].min().date()} to {baskets['order_date'].max().date()}")
# Create transaction list: each order becomes a list of product names
transactions = baskets.groupby('order_id')['product_name'].apply(list).tolist()
# Filter to orders with 2+ items (single-item orders have no co-purchases)
transactions = [t for t in transactions if len(t) >= 2]
print(f"\nMulti-item transactions: {len(transactions):,}")
Apriori Algorithm
# One-hot encode the transactions
te = TransactionEncoder()
te_array = te.fit(transactions).transform(transactions)
basket_df = pd.DataFrame(te_array, columns=te.columns_)
print(f"Transaction matrix shape: {basket_df.shape}")
print(f"(rows=transactions, columns=unique products)")
# Apply Apriori to find frequent itemsets
# min_support=0.01 means the itemset must appear in at least 1% of transactions
frequent_items = apriori(
basket_df,
min_support=0.01,
use_colnames=True,
max_len=3 # Look for pairs and triplets
)
frequent_items['length'] = frequent_items['itemsets'].apply(len)
print(f"\nFrequent itemsets found: {len(frequent_items)}")
print(f" Pairs: {(frequent_items['length'] == 2).sum()}")
print(f" Triplets: {(frequent_items['length'] == 3).sum()}")
# Show top frequent pairs
top_pairs = (
frequent_items[frequent_items['length'] == 2]
.sort_values('support', ascending=False)
.head(15)
)
print("\nTop 15 Frequent Product Pairs:")
for _, row in top_pairs.iterrows():
items = ', '.join(row['itemsets'])
print(f" {items} (support: {row['support']:.4f})")
Association Rules
# Generate association rules from frequent itemsets
rules = association_rules(
frequent_items,
metric="lift",
min_threshold=1.0,
num_itemsets=len(frequent_items)
)
# Clean up the rules for readability
rules['antecedents'] = rules['antecedents'].apply(lambda x: ', '.join(list(x)))
rules['consequents'] = rules['consequents'].apply(lambda x: ', '.join(list(x)))
# Sort by lift (strongest associations first)
rules_sorted = rules.sort_values('lift', ascending=False)
print(f"Total association rules generated: {len(rules_sorted)}")
print("\nTop 15 Rules by Lift:")
print(rules_sorted[['antecedents', 'consequents', 'support',
'confidence', 'lift']].head(15).to_string(index=False))
# Filter high-confidence rules for production recommendations
high_conf = rules_sorted[
(rules_sorted['confidence'] >= 0.3) &
(rules_sorted['lift'] >= 1.5)
].copy()
print(f"\nHigh-quality rules (confidence >= 0.3, lift >= 1.5): {len(high_conf)}")
high_conf.to_csv('recommendation_rules.csv', index=False)
Visualizations
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
# Plot 1: Support vs Confidence scatter
scatter = axes[0].scatter(
rules_sorted['support'],
rules_sorted['confidence'],
c=rules_sorted['lift'],
cmap='YlOrRd',
alpha=0.6,
s=40
)
axes[0].set_xlabel('Support')
axes[0].set_ylabel('Confidence')
axes[0].set_title('Support vs Confidence (colored by Lift)')
plt.colorbar(scatter, ax=axes[0], label='Lift')
# Plot 2: Top 10 rules by lift (horizontal bar)
top_rules = rules_sorted.head(10)
labels = [f"{r['antecedents']} -> {r['consequents']}" for _, r in top_rules.iterrows()]
axes[1].barh(range(len(labels)), top_rules['lift'].values, color='#e07a2f')
axes[1].set_yticks(range(len(labels)))
axes[1].set_yticklabels(labels, fontsize=8)
axes[1].set_xlabel('Lift')
axes[1].set_title('Top 10 Association Rules by Lift')
axes[1].invert_yaxis()
# Plot 3: Category co-occurrence heatmap
cat_pairs = baskets.merge(
baskets[['order_id', 'category']],
on='order_id',
suffixes=('_a', '_b')
)
cat_pairs = cat_pairs[cat_pairs['category_a'] < cat_pairs['category_b']]
cat_matrix = cat_pairs.pivot_table(
index='category_a',
columns='category_b',
values='order_id',
aggfunc='nunique'
).fillna(0)
sns.heatmap(cat_matrix, annot=True, fmt='.0f', cmap='YlOrRd',
linewidths=0.5, ax=axes[2])
axes[2].set_title('Category Co-Occurrence Counts')
plt.suptitle('Product Recommendation Analysis', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('recommendation_analysis.png', dpi=150, bbox_inches='tight')
plt.show()
Building the Recommendation Function
def get_recommendations(product_name, rules_df, top_n=5):
"""
Given a product name, return the top N recommended products
based on association rules sorted by confidence * lift.
"""
matching = rules_df[rules_df['antecedents'].str.contains(product_name, case=False)]
matching = matching.copy()
matching['score'] = matching['confidence'] * matching['lift']
matching = matching.sort_values('score', ascending=False)
recommendations = []
for _, rule in matching.head(top_n).iterrows():
recommendations.append({
'recommended_product': rule['consequents'],
'confidence': round(rule['confidence'], 3),
'lift': round(rule['lift'], 2),
'score': round(rule['score'], 3)
})
return pd.DataFrame(recommendations)
# Example usage
print("Recommendations for 'Wireless Bluetooth Earbuds':")
print(get_recommendations('Wireless Bluetooth Earbuds', rules_sorted))
print("\nRecommendations for 'Yoga Mat':")
print(get_recommendations('Yoga Mat', rules_sorted))
Excel Instructions
Note: Market basket analysis and the Apriori algorithm require iterative computation over potentially millions of itemset combinations. This is not feasible in Excel. The SQL queries extract the raw co-purchase data, and Python handles the algorithmic processing. The exported recommendation_rules.csv can be opened in Excel for filtering and presentation purposes, but the analysis itself should be performed in Python.
Key Insights & Recommendations
Present these findings to the product and engineering teams to implement the recommendation system.
Cross-Category Bundles
The strongest associations often cross categories (e.g., phone case + screen protector, yoga mat + resistance bands). Create cross-category bundle discounts on product pages to increase average order value.
Confidence-Based Ranking
Use confidence as the primary ranking metric for "Frequently Bought Together" widgets. A rule with 45% confidence means nearly half of buyers of Product A also buy Product B, making it a strong recommendation.
Lift Threshold for Quality
Only deploy rules with lift > 1.5. A lift of 1.0 means products are purchased together at random rates. Rules below this threshold are noise and will degrade recommendation quality.
Seasonal Variation
Run the analysis quarterly. Product associations change with seasons (e.g., sunscreen + sunglasses in summer vs. gloves + scarves in winter). Stale rules lead to irrelevant recommendations.
Cart-Page vs. Product-Page
Use high-confidence pairs on product detail pages ("Frequently Bought Together") and use triplet rules on the cart page ("Complete Your Order"). This two-tier approach maximizes cross-sell opportunities.
Revenue Impact Estimate
Industry benchmarks show "Frequently Bought Together" features increase average order value by 10-15%. With the top 50 rules covering 60% of product pages, estimated incremental revenue is significant.
Knowledge Check Quiz
Test your understanding of market basket analysis and association rule mining.
In association rule mining, what does "support" measure?
A rule has confidence of 0.6 and lift of 2.5. What does a lift of 2.5 mean?
Why does the SQL co-purchase query use a.product_id < b.product_id in the JOIN condition?
What does the TransactionEncoder in mlxtend do?
Why should you set a minimum support threshold when running the Apriori algorithm?