Bank Customer Churn Analysis
A retail bank is losing customers at an alarming rate and needs to understand why they leave and predict who is likely to churn next. You will analyze customer behavior patterns with SQL, build a churn prediction model with Random Forest in Python, and create an Excel dashboard that quantifies the business impact of retention efforts.
Project Overview
Business Scenario
Summit Bank has seen its customer attrition rate climb from 8% to 14% over the past two years. Acquiring a new customer costs 5-7x more than retaining an existing one, so even a modest improvement in retention has significant financial impact. The Head of Retail Banking wants to identify the key drivers of churn, predict which active customers are most at risk, and determine whether targeted retention offers (fee waivers, rate upgrades, personal outreach) are cost-effective.
- Goal: Predict customer churn, identify root causes, and calculate the ROI of retention interventions
- Tools: SQL (churn pattern analysis), Python (Random Forest prediction model), Excel (churn dashboard)
- Stakeholder: Head of Retail Banking and Customer Retention Team
- Deliverables: Churn prediction model, risk-scored customer list, segment analysis, retention cost-benefit report
Dataset
Database Schema
The churn analysis database links customer demographics, account details, transaction activity, complaint records, and confirmed churn labels.
customers
| Column | Type | Description |
|---|---|---|
| customer_id | INT (PK) | Unique customer identifier |
| age | INT | Customer age |
| gender | VARCHAR(1) | M or F |
| tenure_months | INT | Months as a customer |
| segment | VARCHAR(20) | Mass, Affluent, Premier |
| city | VARCHAR(50) | Customer city |
| join_date | DATE | Date customer opened first account |
accounts
| Column | Type | Description |
|---|---|---|
| account_id | INT (PK) | Account identifier |
| customer_id | INT (FK) | Customer who owns the account |
| account_type | VARCHAR(20) | Checking, Savings, Credit Card, Mortgage |
| balance | DECIMAL(14,2) | Current account balance |
| monthly_fee | DECIMAL(6,2) | Monthly account maintenance fee |
| is_active | TINYINT | 1 = active, 0 = closed |
transactions
| Column | Type | Description |
|---|---|---|
| txn_id | INT (PK) | Transaction identifier |
| account_id | INT (FK) | Account used |
| txn_date | DATE | Transaction date |
| txn_type | VARCHAR(20) | Deposit, Withdrawal, Transfer, Payment |
| amount | DECIMAL(10,2) | Transaction amount |
complaints
| Column | Type | Description |
|---|---|---|
| complaint_id | INT (PK) | Complaint identifier |
| customer_id | INT (FK) | Customer who filed complaint |
| complaint_date | DATE | Date filed |
| category | VARCHAR(30) | Fees, Service, App/Website, Wait Times, Other |
| resolution_days | INT | Days to resolve (NULL if unresolved) |
churn_labels
| Column | Type | Description |
|---|---|---|
| customer_id | INT (PK/FK) | Customer identifier |
| churned | TINYINT | 1 = churned, 0 = retained |
| churn_date | DATE | Date all accounts were closed (NULL if retained) |
| exit_reason | VARCHAR(50) | Self-reported reason for leaving |
SQL Analysis
Query 1: Churn Rate by Customer Segment
Calculate churn rates across customer segments and demographics to identify which groups are leaving at the highest rates.
SELECT
c.segment,
COUNT(*) AS total_customers,
SUM(cl.churned) AS churned_count,
ROUND(SUM(cl.churned) * 100.0 / COUNT(*), 2) AS churn_rate_pct,
ROUND(AVG(c.tenure_months), 1) AS avg_tenure,
ROUND(AVG(c.age), 1) AS avg_age
FROM customers c
JOIN churn_labels cl ON c.customer_id = cl.customer_id
GROUP BY c.segment
ORDER BY churn_rate_pct DESC;
Query 2: Activity Patterns of Churners vs Retained
Compare the transaction behavior of customers who churned against those who stayed. Declining activity often signals imminent churn.
WITH customer_activity AS (
SELECT
a.customer_id,
cl.churned,
COUNT(t.txn_id) AS total_txns,
COUNT(DISTINCT DATE_FORMAT(t.txn_date, '%Y-%m')) AS active_months,
ROUND(AVG(t.amount), 2) AS avg_txn_amount,
DATEDIFF(
COALESCE(cl.churn_date, CURDATE()),
MAX(t.txn_date)
) AS days_since_last_txn,
SUM(a.balance) AS total_balance,
COUNT(DISTINCT a.account_id) AS num_accounts
FROM customers cust
JOIN accounts a ON cust.customer_id = a.customer_id
LEFT JOIN transactions t ON a.account_id = t.account_id
JOIN churn_labels cl ON cust.customer_id = cl.customer_id
GROUP BY a.customer_id, cl.churned
)
SELECT
CASE WHEN churned = 1 THEN 'Churned' ELSE 'Retained' END AS status,
COUNT(*) AS customers,
ROUND(AVG(total_txns), 1) AS avg_transactions,
ROUND(AVG(active_months), 1) AS avg_active_months,
ROUND(AVG(avg_txn_amount), 2) AS avg_txn_size,
ROUND(AVG(days_since_last_txn), 0) AS avg_days_inactive,
ROUND(AVG(total_balance), 2) AS avg_balance,
ROUND(AVG(num_accounts), 1) AS avg_accounts
FROM customer_activity
GROUP BY churned;
Query 3: Complaint Correlation with Churn
Determine whether filing complaints is associated with higher churn rates, and identify which complaint categories are most predictive.
SELECT
COALESCE(comp.category, 'No Complaints') AS complaint_category,
COUNT(DISTINCT c.customer_id) AS customers,
SUM(cl.churned) AS churned,
ROUND(SUM(cl.churned) * 100.0 / COUNT(DISTINCT c.customer_id), 2) AS churn_rate_pct,
ROUND(AVG(comp.resolution_days), 1) AS avg_resolution_days
FROM customers c
JOIN churn_labels cl ON c.customer_id = cl.customer_id
LEFT JOIN complaints comp ON c.customer_id = comp.customer_id
GROUP BY COALESCE(comp.category, 'No Complaints')
ORDER BY churn_rate_pct DESC;
Python Analysis
Data Preparation
Merge all tables into a single customer-level feature matrix. Each row represents one customer with all their aggregated behavioral metrics.
import pandas as pd
import numpy as np
customers = pd.read_csv('customers.csv')
accounts = pd.read_csv('accounts.csv')
transactions = pd.read_csv('transactions.csv', parse_dates=['txn_date'])
complaints = pd.read_csv('complaints.csv')
churn = pd.read_csv('churn_labels.csv')
# Aggregate account-level features per customer
acct_features = accounts.groupby('customer_id').agg(
num_accounts=('account_id', 'count'),
total_balance=('balance', 'sum'),
avg_balance=('balance', 'mean'),
total_monthly_fees=('monthly_fee', 'sum'),
has_credit_card=('account_type', lambda x: int('Credit Card' in x.values)),
has_mortgage=('account_type', lambda x: int('Mortgage' in x.values))
).reset_index()
# Aggregate transaction-level features per customer
txn_features = transactions.merge(accounts[['account_id', 'customer_id']], on='account_id')
txn_agg = txn_features.groupby('customer_id').agg(
total_txns=('txn_id', 'count'),
avg_txn_amount=('amount', 'mean'),
total_txn_amount=('amount', 'sum'),
last_txn_date=('txn_date', 'max'),
txn_months=('txn_date', lambda x: x.dt.to_period('M').nunique())
).reset_index()
# Complaint features
comp_features = complaints.groupby('customer_id').agg(
num_complaints=('complaint_id', 'count'),
avg_resolution_days=('resolution_days', 'mean'),
has_unresolved=('resolution_days', lambda x: int(x.isna().any()))
).reset_index()
# Merge everything
df = customers.merge(churn, on='customer_id')
df = df.merge(acct_features, on='customer_id', how='left')
df = df.merge(txn_agg, on='customer_id', how='left')
df = df.merge(comp_features, on='customer_id', how='left')
# Fill missing values (customers with no complaints or transactions)
df['num_complaints'] = df['num_complaints'].fillna(0)
df['avg_resolution_days'] = df['avg_resolution_days'].fillna(0)
df['has_unresolved'] = df['has_unresolved'].fillna(0)
df['total_txns'] = df['total_txns'].fillna(0)
# Encode segment
df['segment_encoded'] = df['segment'].map({'Mass': 0, 'Affluent': 1, 'Premier': 2})
df['gender_encoded'] = df['gender'].map({'M': 0, 'F': 1})
print(f"Dataset: {len(df)} customers, {df['churned'].sum()} churned ({df['churned'].mean()*100:.1f}%)")
Random Forest Churn Prediction
Random Forest is well-suited for churn prediction because it handles mixed feature types, captures non-linear relationships, and provides built-in feature importance rankings.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score
features = ['age', 'tenure_months', 'segment_encoded', 'gender_encoded',
'num_accounts', 'total_balance', 'total_monthly_fees',
'has_credit_card', 'has_mortgage', 'total_txns',
'avg_txn_amount', 'txn_months', 'num_complaints',
'avg_resolution_days', 'has_unresolved']
X = df[features].fillna(0)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
rf = RandomForestClassifier(
n_estimators=200,
max_depth=10,
min_samples_leaf=20,
class_weight='balanced',
random_state=42
)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")
Feature Importance Analysis
Identify which factors most strongly predict churn. This tells the retention team where to focus their efforts.
import matplotlib.pyplot as plt
importances = pd.DataFrame({
'feature': features,
'importance': rf.feature_importances_
}).sort_values('importance', ascending=True)
plt.figure(figsize=(10, 8))
plt.barh(importances['feature'], importances['importance'], color='steelblue')
plt.xlabel('Feature Importance (Gini)')
plt.title('Churn Prediction - Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance_churn.png', dpi=150)
plt.show()
print("\nTop 5 churn drivers:")
for _, row in importances.tail(5).iterrows():
print(f" {row['feature']}: {row['importance']:.4f}")
Retention Value Analysis
Score all active customers with their churn probability and calculate the expected revenue at risk. This prioritizes retention outreach by financial impact.
# Score all customers (not just test set)
df['churn_probability'] = rf.predict_proba(X)[:, 1]
# Estimate annual revenue per customer (monthly fees + estimated transaction revenue)
df['est_annual_revenue'] = (df['total_monthly_fees'] * 12) + (df['total_txn_amount'] * 0.02)
# Revenue at risk = probability of churn x annual revenue
df['revenue_at_risk'] = df['churn_probability'] * df['est_annual_revenue']
# High-risk customers (churn probability > 60%, still active)
active_customers = df[df['churned'] == 0].copy()
high_risk = active_customers[active_customers['churn_probability'] > 0.6]
high_risk = high_risk.sort_values('revenue_at_risk', ascending=False)
print(f"Active customers: {len(active_customers)}")
print(f"High-risk customers (>60% churn prob): {len(high_risk)}")
print(f"Total revenue at risk: ${high_risk['revenue_at_risk'].sum():,.2f}")
print(f"\nTop 10 highest-value at-risk customers:")
print(high_risk[['customer_id', 'segment', 'churn_probability',
'est_annual_revenue', 'revenue_at_risk']].head(10).to_string(index=False))
Excel Analysis
=(revenue_retained - total_cost) / total_cost.Key Insights
Inactivity is the Strongest Signal
Customers who reduce transaction frequency by more than 50% in the 3 months before churning show a clear disengagement pattern. Monitoring transaction velocity changes provides an early warning system, often 60-90 days before the customer actually closes their accounts.
Product Stickiness
Customers with 3+ products (checking + savings + credit card) churn at less than half the rate of single-product customers. Cross-selling is not just a revenue strategy; it is the strongest retention lever the bank has.
Complaints are Opportunities
Customers who complain and get fast resolution (<3 days) actually churn at lower rates than customers who never complain. The problem is not complaints themselves but unresolved complaints, especially those taking more than 7 days.
Retention ROI is Substantial
If a $50 fee waiver retains even 20% of high-risk customers, the revenue preserved far exceeds the cost. The cost-benefit analysis typically shows 5-10x ROI on targeted retention programs, making it one of the highest-impact investments the bank can make.
Quiz
Why is Random Forest a good choice for churn prediction?
What does "revenue at risk" represent in the retention analysis?
According to the SQL analysis, what behavioral pattern most distinguishes churners from retained customers?
Why is stratify=y used in the train_test_split?
Why do customers with 3+ bank products churn at lower rates than single-product customers?