Customer Segmentation - Data Analytics Projects

Project Overview

Difficulty: Intermediate

Tools: SQL, Python (pandas, sklearn, matplotlib, seaborn), Excel

Estimated Time: 6-8 hours

You are a data analyst at UrbanCart, a mid-size e-commerce company selling fashion, accessories, and lifestyle products. The marketing team wants to move away from one-size-fits-all email campaigns and instead target customers based on their purchasing behavior.

Your goal is to build an RFM (Recency, Frequency, Monetary) segmentation model that groups customers into actionable segments:

Recency: How recently did the customer make a purchase?
Frequency: How often do they purchase?
Monetary: How much do they spend in total?

You will then apply K-Means clustering to discover natural groupings in the data and recommend marketing strategies for each segment.

Dataset Description

This project uses two tables containing transactional and customer data.

Table: customers

Column	Data Type	Description
customer_id	INT (PK)	Unique customer identifier
customer_name	VARCHAR(100)	Full name
email	VARCHAR(150)	Email address
signup_date	DATE	Account creation date
city	VARCHAR(50)	Customer city
state	VARCHAR(2)	US state abbreviation

Table: transactions

Column	Data Type	Description
transaction_id	INT (PK)	Unique transaction identifier
customer_id	INT (FK)	References customers table
transaction_date	DATE	Date of purchase
amount	DECIMAL(10,2)	Transaction total in USD
product_category	VARCHAR(50)	Fashion, Accessories, Lifestyle
payment_method	VARCHAR(20)	credit_card, debit_card, paypal

SQL Analysis

Start by calculating the RFM metrics directly in SQL. These queries produce the raw data you will feed into the Python clustering step.

Query 1: RFM Metrics per Customer

This is the core query. It calculates recency (days since last purchase), frequency (total orders), and monetary value (total spend) for every customer.

SELECT
    c.customer_id,
    c.customer_name,
    c.email,
    DATEDIFF(CURDATE(), MAX(t.transaction_date)) AS recency_days,
    COUNT(t.transaction_id)                       AS frequency,
    ROUND(SUM(t.amount), 2)                       AS monetary_value,
    ROUND(AVG(t.amount), 2)                       AS avg_order_value,
    MIN(t.transaction_date)                       AS first_purchase,
    MAX(t.transaction_date)                       AS last_purchase
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
GROUP BY c.customer_id, c.customer_name, c.email
ORDER BY monetary_value DESC;

Query 2: RFM Score Assignment

Assigns 1-5 scores for each RFM dimension using NTILE window function. A score of 5 is best (most recent, most frequent, highest spend).

WITH rfm_raw AS (
    SELECT
        c.customer_id,
        c.customer_name,
        DATEDIFF(CURDATE(), MAX(t.transaction_date)) AS recency_days,
        COUNT(t.transaction_id)                       AS frequency,
        ROUND(SUM(t.amount), 2)                       AS monetary
    FROM customers c
    JOIN transactions t ON c.customer_id = t.customer_id
    GROUP BY c.customer_id, c.customer_name
)
SELECT
    customer_id,
    customer_name,
    recency_days,
    frequency,
    monetary,
    NTILE(5) OVER (ORDER BY recency_days DESC)  AS r_score,
    NTILE(5) OVER (ORDER BY frequency ASC)       AS f_score,
    NTILE(5) OVER (ORDER BY monetary ASC)        AS m_score
FROM rfm_raw
ORDER BY (NTILE(5) OVER (ORDER BY recency_days DESC) +
          NTILE(5) OVER (ORDER BY frequency ASC) +
          NTILE(5) OVER (ORDER BY monetary ASC)) DESC;

Query 3: Customer Purchase Patterns

Analyzes purchasing behavior by category and payment method to enrich the segmentation with behavioral data.

SELECT
    c.customer_id,
    c.customer_name,
    t.product_category,
    COUNT(*)                  AS purchases_in_category,
    ROUND(SUM(t.amount), 2)  AS category_spend,
    t.payment_method,
    COUNT(DISTINCT DATE_FORMAT(t.transaction_date, '%Y-%m')) AS active_months
FROM customers c
JOIN transactions t ON c.customer_id = t.customer_id
GROUP BY c.customer_id, c.customer_name, t.product_category, t.payment_method
ORDER BY c.customer_id, category_spend DESC;

Query 4: Segment Distribution Preview

Combines the RFM scores into a composite segment label and counts how many customers fall into each group.

WITH rfm_scores AS (
    SELECT
        customer_id,
        NTILE(5) OVER (ORDER BY DATEDIFF(CURDATE(), MAX(transaction_date)) DESC) AS r,
        NTILE(5) OVER (ORDER BY COUNT(transaction_id) ASC)                        AS f,
        NTILE(5) OVER (ORDER BY SUM(amount) ASC)                                  AS m
    FROM transactions
    GROUP BY customer_id
),
segments AS (
    SELECT *,
        CASE
            WHEN r >= 4 AND f >= 4 AND m >= 4 THEN 'Champions'
            WHEN r >= 3 AND f >= 3 AND m >= 3 THEN 'Loyal Customers'
            WHEN r >= 4 AND f <= 2             THEN 'New Customers'
            WHEN r <= 2 AND f >= 3             THEN 'At Risk'
            WHEN r <= 2 AND f <= 2             THEN 'Lost'
            ELSE 'Potential Loyalists'
        END AS segment
    FROM rfm_scores
)
SELECT
    segment,
    COUNT(*)         AS customer_count,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct
FROM segments
GROUP BY segment
ORDER BY customer_count DESC;

Python Analysis

Use Python to perform K-Means clustering on the RFM data and create visualizations of the customer segments.

Data Preparation and RFM Calculation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from datetime import datetime

# Load data
transactions = pd.read_csv('transactions.csv', parse_dates=['transaction_date'])
customers = pd.read_csv('customers.csv')

# Calculate RFM metrics
snapshot_date = transactions['transaction_date'].max() + pd.Timedelta(days=1)

rfm = transactions.groupby('customer_id').agg(
    recency=('transaction_date', lambda x: (snapshot_date - x.max()).days),
    frequency=('transaction_id', 'count'),
    monetary=('amount', 'sum')
).reset_index()

# Merge with customer info
rfm = rfm.merge(customers[['customer_id', 'customer_name']], on='customer_id')

print(f"Total customers: {len(rfm):,}")
print(f"\nRFM Summary Statistics:")
print(rfm[['recency', 'frequency', 'monetary']].describe().round(2))

K-Means Clustering

# Scale the features (K-Means is distance-based, so scaling matters)
scaler = StandardScaler()
rfm_scaled = scaler.fit_transform(rfm[['recency', 'frequency', 'monetary']])

# Find optimal number of clusters using Elbow Method
inertias = []
K_range = range(2, 9)
for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(rfm_scaled)
    inertias.append(km.inertia_)

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(K_range, inertias, marker='o', color='#e07a2f', linewidth=2)
ax.set_title('Elbow Method - Optimal Number of Clusters', fontsize=14, fontweight='bold')
ax.set_xlabel('Number of Clusters (K)')
ax.set_ylabel('Inertia (Within-Cluster Sum of Squares)')
plt.tight_layout()
plt.savefig('elbow_method.png', dpi=150)
plt.show()

# Fit with optimal K (typically 4-5 for RFM)
optimal_k = 4
km_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
rfm['cluster'] = km_final.fit_predict(rfm_scaled)

# Analyze cluster centers
cluster_summary = rfm.groupby('cluster').agg(
    avg_recency=('recency', 'mean'),
    avg_frequency=('frequency', 'mean'),
    avg_monetary=('monetary', 'mean'),
    count=('customer_id', 'count')
).round(1)

print("\nCluster Summary:")
print(cluster_summary)

Segment Labeling and Visualization

# Assign meaningful labels based on cluster characteristics
# (Adjust these based on your actual cluster_summary output)
segment_labels = {
    0: 'Champions',
    1: 'At Risk',
    2: 'New Customers',
    3: 'Loyal Customers'
}
rfm['segment'] = rfm['cluster'].map(segment_labels)

# Scatter plot: Frequency vs Monetary colored by segment
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
colors = ['#e07a2f', '#4a90d9', '#50c878', '#e05050']

# Plot 1: Frequency vs Monetary
for i, seg in enumerate(segment_labels.values()):
    mask = rfm['segment'] == seg
    axes[0].scatter(rfm.loc[mask, 'frequency'], rfm.loc[mask, 'monetary'],
                    label=seg, alpha=0.6, color=colors[i], s=40)
axes[0].set_xlabel('Frequency (# Orders)')
axes[0].set_ylabel('Monetary ($)')
axes[0].set_title('Frequency vs Monetary')
axes[0].legend(fontsize=9)

# Plot 2: Recency vs Frequency
for i, seg in enumerate(segment_labels.values()):
    mask = rfm['segment'] == seg
    axes[1].scatter(rfm.loc[mask, 'recency'], rfm.loc[mask, 'frequency'],
                    label=seg, alpha=0.6, color=colors[i], s=40)
axes[1].set_xlabel('Recency (Days)')
axes[1].set_ylabel('Frequency (# Orders)')
axes[1].set_title('Recency vs Frequency')
axes[1].legend(fontsize=9)

# Plot 3: Segment Distribution (Pie)
seg_counts = rfm['segment'].value_counts()
axes[2].pie(seg_counts, labels=seg_counts.index, autopct='%1.1f%%',
            colors=colors, startangle=90)
axes[2].set_title('Segment Distribution')

plt.suptitle('Customer Segmentation Analysis', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('customer_segments.png', dpi=150, bbox_inches='tight')
plt.show()

Segment Profile Report

# Generate a detailed profile for each segment
segment_profile = rfm.groupby('segment').agg(
    customers=('customer_id', 'count'),
    avg_recency=('recency', 'mean'),
    avg_frequency=('frequency', 'mean'),
    avg_monetary=('monetary', 'mean'),
    median_monetary=('monetary', 'median'),
    total_revenue=('monetary', 'sum')
).round(2)

segment_profile['pct_customers'] = (segment_profile['customers'] / len(rfm) * 100).round(1)
segment_profile['pct_revenue'] = (segment_profile['total_revenue'] / rfm['monetary'].sum() * 100).round(1)

print("Segment Profiles:")
print(segment_profile.to_string())

# Export for Excel analysis
rfm.to_csv('rfm_segmented.csv', index=False)
print("\nExported rfm_segmented.csv for Excel analysis.")

Excel Dashboard Instructions

Import the Python-generated rfm_segmented.csv into Excel and build a segmentation analysis workbook.

Step-by-Step Guide

Import the CSV: Open Excel, go to Data > From Text/CSV, and import rfm_segmented.csv. Verify that recency, frequency, and monetary columns are formatted as numbers.

Build RFM Scoring Matrix: Add three new columns: R_Score, F_Score, M_Score. Use =IF(recency<30,5,IF(recency<60,4,IF(recency<90,3,IF(recency<180,2,1)))) for R_Score. Apply similar tiered logic for F_Score and M_Score based on percentile thresholds.

Create Composite Score: Add a column RFM_Total = R_Score + F_Score + M_Score. This gives each customer a score from 3 (worst) to 15 (best).

Segment Pivot Table: Insert a PivotTable with segment in Rows. Add customer_id (Count), monetary (Sum and Average), recency (Average), and frequency (Average) to Values. This gives you the segment summary.

RFM Distribution Chart: Create a PivotTable of RFM_Total scores (Rows) vs Count of customers (Values). Insert a Column Chart to visualize the distribution of total RFM scores.

Segment Comparison Chart: From the segment pivot, insert a Clustered Bar Chart comparing average monetary value across segments. Add a secondary axis with customer count per segment.

Conditional Formatting: Apply color scales to the R_Score, F_Score, and M_Score columns (green=5, red=1). Apply Data Bars to the monetary column so high spenders are immediately visible.

Customer Lookup Sheet: Create a lookup sheet with a dropdown (Data Validation > List) for segment names. Use FILTER or VLOOKUP formulas to display all customers in the selected segment, showing their name, email, RFM scores, and last purchase date.

Dashboard Assembly: Create a Dashboard sheet with a segment summary table at top, the distribution chart on the left, the comparison chart on the right, and a segment-specific action plan below each chart.

Action Plan Table: Below the dashboard, add a table with columns: Segment, Customer Count, Recommended Action, Campaign Type, Expected ROI. Fill in marketing recommendations for each segment (e.g., Champions get exclusive early access; At Risk gets win-back discount emails).

Key Insights & Recommendations

Present these findings to the marketing team with clear, actionable next steps for each segment.

Champions (High R, F, M)

These are your best customers: recent, frequent, and high-spending. Reward them with loyalty programs, exclusive early access, and referral bonuses. They typically represent 15-20% of customers but 40%+ of revenue.

Loyal Customers (High F, M)

Frequent buyers with solid spend. Upsell premium products, offer tiered rewards, and ask for reviews/testimonials. These customers are one step from becoming Champions with the right engagement.

New Customers (High R, Low F)

Recent first-time or second-time buyers. The onboarding experience is critical. Send welcome series emails, provide first-purchase discounts on their second order, and introduce product recommendations based on their first purchase category.

At Risk (Low R, High F)

Previously loyal customers who have not purchased recently. Launch win-back campaigns with personalized offers based on their purchase history. Time-limited discounts create urgency to return.

Lost Customers (Low R, F, M)

Inactive customers with minimal purchase history. Survey them to understand why they left. Cost-effective reactivation attempts (email only, no paid ads) are appropriate since ROI on this segment is low.

Revenue Concentration

Typically the top two segments (Champions + Loyal) account for 60-70% of total revenue from only 30% of customers. This insight justifies investing in retention over acquisition for these groups.

Knowledge Check Quiz

Test your understanding of RFM analysis and customer segmentation concepts.