Project 15

Bank Customer Churn Analysis

A retail bank is losing customers at an alarming rate and needs to understand why they leave and predict who is likely to churn next. You will analyze customer behavior patterns with SQL, build a churn prediction model with Random Forest in Python, and create an Excel dashboard that quantifies the business impact of retention efforts.

Project Overview

Business Scenario

Summit Bank has seen its customer attrition rate climb from 8% to 14% over the past two years. Acquiring a new customer costs 5-7x more than retaining an existing one, so even a modest improvement in retention has significant financial impact. The Head of Retail Banking wants to identify the key drivers of churn, predict which active customers are most at risk, and determine whether targeted retention offers (fee waivers, rate upgrades, personal outreach) are cost-effective.

  • Goal: Predict customer churn, identify root causes, and calculate the ROI of retention interventions
  • Tools: SQL (churn pattern analysis), Python (Random Forest prediction model), Excel (churn dashboard)
  • Stakeholder: Head of Retail Banking and Customer Retention Team
  • Deliverables: Churn prediction model, risk-scored customer list, segment analysis, retention cost-benefit report

Dataset

Database Schema

The churn analysis database links customer demographics, account details, transaction activity, complaint records, and confirmed churn labels.

customers

ColumnTypeDescription
customer_idINT (PK)Unique customer identifier
ageINTCustomer age
genderVARCHAR(1)M or F
tenure_monthsINTMonths as a customer
segmentVARCHAR(20)Mass, Affluent, Premier
cityVARCHAR(50)Customer city
join_dateDATEDate customer opened first account

accounts

ColumnTypeDescription
account_idINT (PK)Account identifier
customer_idINT (FK)Customer who owns the account
account_typeVARCHAR(20)Checking, Savings, Credit Card, Mortgage
balanceDECIMAL(14,2)Current account balance
monthly_feeDECIMAL(6,2)Monthly account maintenance fee
is_activeTINYINT1 = active, 0 = closed

transactions

ColumnTypeDescription
txn_idINT (PK)Transaction identifier
account_idINT (FK)Account used
txn_dateDATETransaction date
txn_typeVARCHAR(20)Deposit, Withdrawal, Transfer, Payment
amountDECIMAL(10,2)Transaction amount

complaints

ColumnTypeDescription
complaint_idINT (PK)Complaint identifier
customer_idINT (FK)Customer who filed complaint
complaint_dateDATEDate filed
categoryVARCHAR(30)Fees, Service, App/Website, Wait Times, Other
resolution_daysINTDays to resolve (NULL if unresolved)

churn_labels

ColumnTypeDescription
customer_idINT (PK/FK)Customer identifier
churnedTINYINT1 = churned, 0 = retained
churn_dateDATEDate all accounts were closed (NULL if retained)
exit_reasonVARCHAR(50)Self-reported reason for leaving

SQL Analysis

Query 1: Churn Rate by Customer Segment

Calculate churn rates across customer segments and demographics to identify which groups are leaving at the highest rates.

SELECT
    c.segment,
    COUNT(*) AS total_customers,
    SUM(cl.churned) AS churned_count,
    ROUND(SUM(cl.churned) * 100.0 / COUNT(*), 2) AS churn_rate_pct,
    ROUND(AVG(c.tenure_months), 1) AS avg_tenure,
    ROUND(AVG(c.age), 1) AS avg_age
FROM customers c
JOIN churn_labels cl ON c.customer_id = cl.customer_id
GROUP BY c.segment
ORDER BY churn_rate_pct DESC;

Query 2: Activity Patterns of Churners vs Retained

Compare the transaction behavior of customers who churned against those who stayed. Declining activity often signals imminent churn.

WITH customer_activity AS (
    SELECT
        a.customer_id,
        cl.churned,
        COUNT(t.txn_id) AS total_txns,
        COUNT(DISTINCT DATE_FORMAT(t.txn_date, '%Y-%m')) AS active_months,
        ROUND(AVG(t.amount), 2) AS avg_txn_amount,
        DATEDIFF(
            COALESCE(cl.churn_date, CURDATE()),
            MAX(t.txn_date)
        ) AS days_since_last_txn,
        SUM(a.balance) AS total_balance,
        COUNT(DISTINCT a.account_id) AS num_accounts
    FROM customers cust
    JOIN accounts a ON cust.customer_id = a.customer_id
    LEFT JOIN transactions t ON a.account_id = t.account_id
    JOIN churn_labels cl ON cust.customer_id = cl.customer_id
    GROUP BY a.customer_id, cl.churned
)
SELECT
    CASE WHEN churned = 1 THEN 'Churned' ELSE 'Retained' END AS status,
    COUNT(*) AS customers,
    ROUND(AVG(total_txns), 1) AS avg_transactions,
    ROUND(AVG(active_months), 1) AS avg_active_months,
    ROUND(AVG(avg_txn_amount), 2) AS avg_txn_size,
    ROUND(AVG(days_since_last_txn), 0) AS avg_days_inactive,
    ROUND(AVG(total_balance), 2) AS avg_balance,
    ROUND(AVG(num_accounts), 1) AS avg_accounts
FROM customer_activity
GROUP BY churned;

Query 3: Complaint Correlation with Churn

Determine whether filing complaints is associated with higher churn rates, and identify which complaint categories are most predictive.

SELECT
    COALESCE(comp.category, 'No Complaints') AS complaint_category,
    COUNT(DISTINCT c.customer_id) AS customers,
    SUM(cl.churned) AS churned,
    ROUND(SUM(cl.churned) * 100.0 / COUNT(DISTINCT c.customer_id), 2) AS churn_rate_pct,
    ROUND(AVG(comp.resolution_days), 1) AS avg_resolution_days
FROM customers c
JOIN churn_labels cl ON c.customer_id = cl.customer_id
LEFT JOIN complaints comp ON c.customer_id = comp.customer_id
GROUP BY COALESCE(comp.category, 'No Complaints')
ORDER BY churn_rate_pct DESC;

Python Analysis

Data Preparation

Merge all tables into a single customer-level feature matrix. Each row represents one customer with all their aggregated behavioral metrics.

import pandas as pd
import numpy as np

customers = pd.read_csv('customers.csv')
accounts = pd.read_csv('accounts.csv')
transactions = pd.read_csv('transactions.csv', parse_dates=['txn_date'])
complaints = pd.read_csv('complaints.csv')
churn = pd.read_csv('churn_labels.csv')

# Aggregate account-level features per customer
acct_features = accounts.groupby('customer_id').agg(
    num_accounts=('account_id', 'count'),
    total_balance=('balance', 'sum'),
    avg_balance=('balance', 'mean'),
    total_monthly_fees=('monthly_fee', 'sum'),
    has_credit_card=('account_type', lambda x: int('Credit Card' in x.values)),
    has_mortgage=('account_type', lambda x: int('Mortgage' in x.values))
).reset_index()

# Aggregate transaction-level features per customer
txn_features = transactions.merge(accounts[['account_id', 'customer_id']], on='account_id')
txn_agg = txn_features.groupby('customer_id').agg(
    total_txns=('txn_id', 'count'),
    avg_txn_amount=('amount', 'mean'),
    total_txn_amount=('amount', 'sum'),
    last_txn_date=('txn_date', 'max'),
    txn_months=('txn_date', lambda x: x.dt.to_period('M').nunique())
).reset_index()

# Complaint features
comp_features = complaints.groupby('customer_id').agg(
    num_complaints=('complaint_id', 'count'),
    avg_resolution_days=('resolution_days', 'mean'),
    has_unresolved=('resolution_days', lambda x: int(x.isna().any()))
).reset_index()

# Merge everything
df = customers.merge(churn, on='customer_id')
df = df.merge(acct_features, on='customer_id', how='left')
df = df.merge(txn_agg, on='customer_id', how='left')
df = df.merge(comp_features, on='customer_id', how='left')

# Fill missing values (customers with no complaints or transactions)
df['num_complaints'] = df['num_complaints'].fillna(0)
df['avg_resolution_days'] = df['avg_resolution_days'].fillna(0)
df['has_unresolved'] = df['has_unresolved'].fillna(0)
df['total_txns'] = df['total_txns'].fillna(0)

# Encode segment
df['segment_encoded'] = df['segment'].map({'Mass': 0, 'Affluent': 1, 'Premier': 2})
df['gender_encoded'] = df['gender'].map({'M': 0, 'F': 1})

print(f"Dataset: {len(df)} customers, {df['churned'].sum()} churned ({df['churned'].mean()*100:.1f}%)")

Random Forest Churn Prediction

Random Forest is well-suited for churn prediction because it handles mixed feature types, captures non-linear relationships, and provides built-in feature importance rankings.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

features = ['age', 'tenure_months', 'segment_encoded', 'gender_encoded',
            'num_accounts', 'total_balance', 'total_monthly_fees',
            'has_credit_card', 'has_mortgage', 'total_txns',
            'avg_txn_amount', 'txn_months', 'num_complaints',
            'avg_resolution_days', 'has_unresolved']

X = df[features].fillna(0)
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=20,
    class_weight='balanced',
    random_state=42
)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")

Feature Importance Analysis

Identify which factors most strongly predict churn. This tells the retention team where to focus their efforts.

import matplotlib.pyplot as plt

importances = pd.DataFrame({
    'feature': features,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=True)

plt.figure(figsize=(10, 8))
plt.barh(importances['feature'], importances['importance'], color='steelblue')
plt.xlabel('Feature Importance (Gini)')
plt.title('Churn Prediction - Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance_churn.png', dpi=150)
plt.show()

print("\nTop 5 churn drivers:")
for _, row in importances.tail(5).iterrows():
    print(f"  {row['feature']}: {row['importance']:.4f}")

Retention Value Analysis

Score all active customers with their churn probability and calculate the expected revenue at risk. This prioritizes retention outreach by financial impact.

# Score all customers (not just test set)
df['churn_probability'] = rf.predict_proba(X)[:, 1]

# Estimate annual revenue per customer (monthly fees + estimated transaction revenue)
df['est_annual_revenue'] = (df['total_monthly_fees'] * 12) + (df['total_txn_amount'] * 0.02)

# Revenue at risk = probability of churn x annual revenue
df['revenue_at_risk'] = df['churn_probability'] * df['est_annual_revenue']

# High-risk customers (churn probability > 60%, still active)
active_customers = df[df['churned'] == 0].copy()
high_risk = active_customers[active_customers['churn_probability'] > 0.6]
high_risk = high_risk.sort_values('revenue_at_risk', ascending=False)

print(f"Active customers: {len(active_customers)}")
print(f"High-risk customers (>60% churn prob): {len(high_risk)}")
print(f"Total revenue at risk: ${high_risk['revenue_at_risk'].sum():,.2f}")
print(f"\nTop 10 highest-value at-risk customers:")
print(high_risk[['customer_id', 'segment', 'churn_probability',
                 'est_annual_revenue', 'revenue_at_risk']].head(10).to_string(index=False))

Excel Analysis

  • Churn Dashboard: Create a summary dashboard showing overall churn rate (large KPI card), churn rate by segment (bar chart), churn trend by month (line chart), and top exit reasons (horizontal bar chart). Use slicers to filter by segment and city.
  • Segment Comparison Table: Build a comparison table with rows for each segment (Mass, Affluent, Premier) and columns for: customer count, churn rate, average tenure, average balance, average complaints, and average transaction activity. Use conditional formatting to highlight the worst-performing metrics per segment.
  • At-Risk Customer List: Import the Python-scored customer list showing customer_id, segment, churn probability, and revenue at risk. Sort by revenue at risk descending. Add a "Retention Action" dropdown column (Call, Email, Fee Waiver, Rate Upgrade, No Action) for the retention team to log their planned intervention.
  • Retention Cost-Benefit Analysis: Create a scenario table showing: (a) cost of retention offer per customer (e.g., $50 fee waiver), (b) number of high-risk customers targeted, (c) expected retention improvement (assume 20% of targeted customers are saved), (d) revenue retained = saved customers x avg annual revenue. Calculate ROI as =(revenue_retained - total_cost) / total_cost.
  • Churn Waterfall Chart: Build a waterfall chart showing the customer base at the start of the year, new acquisitions, churned customers, and end-of-year base. This clearly visualizes the net customer growth or decline and makes the cost of churn tangible to executives.
  • Complaint Impact Analysis: Use a pivot table to cross-reference complaint category with churn outcome. Calculate the churn rate lift for each category (churn rate with complaint minus baseline churn rate). Identify which complaint types are the strongest churn signals.
  • Key Insights

    Inactivity is the Strongest Signal

    Customers who reduce transaction frequency by more than 50% in the 3 months before churning show a clear disengagement pattern. Monitoring transaction velocity changes provides an early warning system, often 60-90 days before the customer actually closes their accounts.

    Product Stickiness

    Customers with 3+ products (checking + savings + credit card) churn at less than half the rate of single-product customers. Cross-selling is not just a revenue strategy; it is the strongest retention lever the bank has.

    Complaints are Opportunities

    Customers who complain and get fast resolution (<3 days) actually churn at lower rates than customers who never complain. The problem is not complaints themselves but unresolved complaints, especially those taking more than 7 days.

    Retention ROI is Substantial

    If a $50 fee waiver retains even 20% of high-risk customers, the revenue preserved far exceeds the cost. The cost-benefit analysis typically shows 5-10x ROI on targeted retention programs, making it one of the highest-impact investments the bank can make.

    Quiz

    Question 1

    Why is Random Forest a good choice for churn prediction?

    Question 2

    What does "revenue at risk" represent in the retention analysis?

    Question 3

    According to the SQL analysis, what behavioral pattern most distinguishes churners from retained customers?

    Question 4

    Why is stratify=y used in the train_test_split?

    Question 5

    Why do customers with 3+ bank products churn at lower rates than single-product customers?

    ← Previous Project Next Project →