Customer Segmentation
Segment e-commerce customers using RFM analysis and K-Means clustering to enable targeted marketing campaigns and personalized experiences.
Project Overview
Difficulty: Intermediate
Tools: SQL, Python (pandas, sklearn, matplotlib, seaborn), Excel
Estimated Time: 6-8 hours
You are a data analyst at UrbanCart, a mid-size e-commerce company selling fashion, accessories, and lifestyle products. The marketing team wants to move away from one-size-fits-all email campaigns and instead target customers based on their purchasing behavior.
Your goal is to build an RFM (Recency, Frequency, Monetary) segmentation model that groups customers into actionable segments:
- Recency: How recently did the customer make a purchase?
- Frequency: How often do they purchase?
- Monetary: How much do they spend in total?
You will then apply K-Means clustering to discover natural groupings in the data and recommend marketing strategies for each segment.
Dataset Description
This project uses two tables containing transactional and customer data.
Table: customers
| Column | Data Type | Description |
|---|---|---|
| customer_id | INT (PK) | Unique customer identifier |
| customer_name | VARCHAR(100) | Full name |
| VARCHAR(150) | Email address | |
| signup_date | DATE | Account creation date |
| city | VARCHAR(50) | Customer city |
| state | VARCHAR(2) | US state abbreviation |
Table: transactions
| Column | Data Type | Description |
|---|---|---|
| transaction_id | INT (PK) | Unique transaction identifier |
| customer_id | INT (FK) | References customers table |
| transaction_date | DATE | Date of purchase |
| amount | DECIMAL(10,2) | Transaction total in USD |
| product_category | VARCHAR(50) | Fashion, Accessories, Lifestyle |
| payment_method | VARCHAR(20) | credit_card, debit_card, paypal |
SQL Analysis
Start by calculating the RFM metrics directly in SQL. These queries produce the raw data you will feed into the Python clustering step.
Query 1: RFM Metrics per Customer
This is the core query. It calculates recency (days since last purchase), frequency (total orders), and monetary value (total spend) for every customer.
SELECT
c.customer_id,
c.customer_name,
c.email,
DATEDIFF(CURDATE(), MAX(t.transaction_date)) AS recency_days,
COUNT(t.transaction_id) AS frequency,
ROUND(SUM(t.amount), 2) AS monetary_value,
ROUND(AVG(t.amount), 2) AS avg_order_value,
MIN(t.transaction_date) AS first_purchase,
MAX(t.transaction_date) AS last_purchase
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
GROUP BY c.customer_id, c.customer_name, c.email
ORDER BY monetary_value DESC;
Query 2: RFM Score Assignment
Assigns 1-5 scores for each RFM dimension using NTILE window function. A score of 5 is best (most recent, most frequent, highest spend).
WITH rfm_raw AS (
SELECT
c.customer_id,
c.customer_name,
DATEDIFF(CURDATE(), MAX(t.transaction_date)) AS recency_days,
COUNT(t.transaction_id) AS frequency,
ROUND(SUM(t.amount), 2) AS monetary
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
GROUP BY c.customer_id, c.customer_name
)
SELECT
customer_id,
customer_name,
recency_days,
frequency,
monetary,
NTILE(5) OVER (ORDER BY recency_days DESC) AS r_score,
NTILE(5) OVER (ORDER BY frequency ASC) AS f_score,
NTILE(5) OVER (ORDER BY monetary ASC) AS m_score
FROM rfm_raw
ORDER BY (NTILE(5) OVER (ORDER BY recency_days DESC) +
NTILE(5) OVER (ORDER BY frequency ASC) +
NTILE(5) OVER (ORDER BY monetary ASC)) DESC;
Query 3: Customer Purchase Patterns
Analyzes purchasing behavior by category and payment method to enrich the segmentation with behavioral data.
SELECT
c.customer_id,
c.customer_name,
t.product_category,
COUNT(*) AS purchases_in_category,
ROUND(SUM(t.amount), 2) AS category_spend,
t.payment_method,
COUNT(DISTINCT DATE_FORMAT(t.transaction_date, '%Y-%m')) AS active_months
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
GROUP BY c.customer_id, c.customer_name, t.product_category, t.payment_method
ORDER BY c.customer_id, category_spend DESC;
Query 4: Segment Distribution Preview
Combines the RFM scores into a composite segment label and counts how many customers fall into each group.
WITH rfm_scores AS (
SELECT
customer_id,
NTILE(5) OVER (ORDER BY DATEDIFF(CURDATE(), MAX(transaction_date)) DESC) AS r,
NTILE(5) OVER (ORDER BY COUNT(transaction_id) ASC) AS f,
NTILE(5) OVER (ORDER BY SUM(amount) ASC) AS m
FROM transactions
GROUP BY customer_id
),
segments AS (
SELECT *,
CASE
WHEN r >= 4 AND f >= 4 AND m >= 4 THEN 'Champions'
WHEN r >= 3 AND f >= 3 AND m >= 3 THEN 'Loyal Customers'
WHEN r >= 4 AND f <= 2 THEN 'New Customers'
WHEN r <= 2 AND f >= 3 THEN 'At Risk'
WHEN r <= 2 AND f <= 2 THEN 'Lost'
ELSE 'Potential Loyalists'
END AS segment
FROM rfm_scores
)
SELECT
segment,
COUNT(*) AS customer_count,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM segments
GROUP BY segment
ORDER BY customer_count DESC;
Python Analysis
Use Python to perform K-Means clustering on the RFM data and create visualizations of the customer segments.
Data Preparation and RFM Calculation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from datetime import datetime
# Load data
transactions = pd.read_csv('transactions.csv', parse_dates=['transaction_date'])
customers = pd.read_csv('customers.csv')
# Calculate RFM metrics
snapshot_date = transactions['transaction_date'].max() + pd.Timedelta(days=1)
rfm = transactions.groupby('customer_id').agg(
recency=('transaction_date', lambda x: (snapshot_date - x.max()).days),
frequency=('transaction_id', 'count'),
monetary=('amount', 'sum')
).reset_index()
# Merge with customer info
rfm = rfm.merge(customers[['customer_id', 'customer_name']], on='customer_id')
print(f"Total customers: {len(rfm):,}")
print(f"\nRFM Summary Statistics:")
print(rfm[['recency', 'frequency', 'monetary']].describe().round(2))
K-Means Clustering
# Scale the features (K-Means is distance-based, so scaling matters)
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['recency', 'frequency', 'monetary']])
# Find optimal number of clusters using Elbow Method
inertias = []
K_range = range(2, 9)
for k in K_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(rfm_scaled)
inertias.append(km.inertia_)
fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(K_range, inertias, marker='o', color='#e07a2f', linewidth=2)
ax.set_title('Elbow Method - Optimal Number of Clusters', fontsize=14, fontweight='bold')
ax.set_xlabel('Number of Clusters (K)')
ax.set_ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.tight_layout()
plt.savefig('elbow_method.png', dpi=150)
plt.show()
# Fit with optimal K (typically 4-5 for RFM)
optimal_k = 4
km_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
rfm['cluster'] = km_final.fit_predict(rfm_scaled)
# Analyze cluster centers
cluster_summary = rfm.groupby('cluster').agg(
avg_recency=('recency', 'mean'),
avg_frequency=('frequency', 'mean'),
avg_monetary=('monetary', 'mean'),
count=('customer_id', 'count')
).round(1)
print("\nCluster Summary:")
print(cluster_summary)
Segment Labeling and Visualization
# Assign meaningful labels based on cluster characteristics
# (Adjust these based on your actual cluster_summary output)
segment_labels = {
0: 'Champions',
1: 'At Risk',
2: 'New Customers',
3: 'Loyal Customers'
}
rfm['segment'] = rfm['cluster'].map(segment_labels)
# Scatter plot: Frequency vs Monetary colored by segment
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
colors = ['#e07a2f', '#4a90d9', '#50c878', '#e05050']
# Plot 1: Frequency vs Monetary
for i, seg in enumerate(segment_labels.values()):
mask = rfm['segment'] == seg
axes[0].scatter(rfm.loc[mask, 'frequency'], rfm.loc[mask, 'monetary'],
label=seg, alpha=0.6, color=colors[i], s=40)
axes[0].set_xlabel('Frequency (# Orders)')
axes[0].set_ylabel('Monetary ($)')
axes[0].set_title('Frequency vs Monetary')
axes[0].legend(fontsize=9)
# Plot 2: Recency vs Frequency
for i, seg in enumerate(segment_labels.values()):
mask = rfm['segment'] == seg
axes[1].scatter(rfm.loc[mask, 'recency'], rfm.loc[mask, 'frequency'],
label=seg, alpha=0.6, color=colors[i], s=40)
axes[1].set_xlabel('Recency (Days)')
axes[1].set_ylabel('Frequency (# Orders)')
axes[1].set_title('Recency vs Frequency')
axes[1].legend(fontsize=9)
# Plot 3: Segment Distribution (Pie)
seg_counts = rfm['segment'].value_counts()
axes[2].pie(seg_counts, labels=seg_counts.index, autopct='%1.1f%%',
colors=colors, startangle=90)
axes[2].set_title('Segment Distribution')
plt.suptitle('Customer Segmentation Analysis', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('customer_segments.png', dpi=150, bbox_inches='tight')
plt.show()
Segment Profile Report
# Generate a detailed profile for each segment
segment_profile = rfm.groupby('segment').agg(
customers=('customer_id', 'count'),
avg_recency=('recency', 'mean'),
avg_frequency=('frequency', 'mean'),
avg_monetary=('monetary', 'mean'),
median_monetary=('monetary', 'median'),
total_revenue=('monetary', 'sum')
).round(2)
segment_profile['pct_customers'] = (segment_profile['customers'] / len(rfm) * 100).round(1)
segment_profile['pct_revenue'] = (segment_profile['total_revenue'] / rfm['monetary'].sum() * 100).round(1)
print("Segment Profiles:")
print(segment_profile.to_string())
# Export for Excel analysis
rfm.to_csv('rfm_segmented.csv', index=False)
print("\nExported rfm_segmented.csv for Excel analysis.")
Excel Dashboard Instructions
Import the Python-generated rfm_segmented.csv into Excel and build a segmentation analysis workbook.
Step-by-Step Guide
Key Insights & Recommendations
Present these findings to the marketing team with clear, actionable next steps for each segment.
Champions (High R, F, M)
These are your best customers: recent, frequent, and high-spending. Reward them with loyalty programs, exclusive early access, and referral bonuses. They typically represent 15-20% of customers but 40%+ of revenue.
Loyal Customers (High F, M)
Frequent buyers with solid spend. Upsell premium products, offer tiered rewards, and ask for reviews/testimonials. These customers are one step from becoming Champions with the right engagement.
New Customers (High R, Low F)
Recent first-time or second-time buyers. The onboarding experience is critical. Send welcome series emails, provide first-purchase discounts on their second order, and introduce product recommendations based on their first purchase category.
At Risk (Low R, High F)
Previously loyal customers who have not purchased recently. Launch win-back campaigns with personalized offers based on their purchase history. Time-limited discounts create urgency to return.
Lost Customers (Low R, F, M)
Inactive customers with minimal purchase history. Survey them to understand why they left. Cost-effective reactivation attempts (email only, no paid ads) are appropriate since ROI on this segment is low.
Revenue Concentration
Typically the top two segments (Champions + Loyal) account for 60-70% of total revenue from only 30% of customers. This insight justifies investing in retention over acquisition for these groups.
Knowledge Check Quiz
Test your understanding of RFM analysis and customer segmentation concepts.
What does the "R" in RFM analysis measure?
Why must you scale features before running K-Means clustering on RFM data?
What does the NTILE(5) window function do in the RFM scoring query?
Which marketing strategy is most appropriate for the "At Risk" customer segment?
What does the "Elbow Method" help you determine in K-Means clustering?