Bank Customer Churn Analysis

Project Overview

Business Scenario

Summit Bank has seen its customer attrition rate climb from 8% to 14% over the past two years. Acquiring a new customer costs 5-7x more than retaining an existing one, so even a modest improvement in retention has significant financial impact. The Head of Retail Banking wants to identify the key drivers of churn, predict which active customers are most at risk, and determine whether targeted retention offers (fee waivers, rate upgrades, personal outreach) are cost-effective.

Goal: Predict customer churn, identify root causes, and calculate the ROI of retention interventions
Tools: SQL (churn pattern analysis), Python (Random Forest prediction model), Excel (churn dashboard)
Stakeholder: Head of Retail Banking and Customer Retention Team
Deliverables: Churn prediction model, risk-scored customer list, segment analysis, retention cost-benefit report

Dataset

Database Schema

The churn analysis database links customer demographics, account details, transaction activity, complaint records, and confirmed churn labels.

customers

Column	Type	Description
customer_id	INT (PK)	Unique customer identifier
age	INT	Customer age
gender	VARCHAR(1)	M or F
tenure_months	INT	Months as a customer
segment	VARCHAR(20)	Mass, Affluent, Premier
city	VARCHAR(50)	Customer city
join_date	DATE	Date customer opened first account

accounts

Column	Type	Description
account_id	INT (PK)	Account identifier
customer_id	INT (FK)	Customer who owns the account
account_type	VARCHAR(20)	Checking, Savings, Credit Card, Mortgage
balance	DECIMAL(14,2)	Current account balance
monthly_fee	DECIMAL(6,2)	Monthly account maintenance fee
is_active	TINYINT	1 = active, 0 = closed

transactions

Column	Type	Description
txn_id	INT (PK)	Transaction identifier
account_id	INT (FK)	Account used
txn_date	DATE	Transaction date
txn_type	VARCHAR(20)	Deposit, Withdrawal, Transfer, Payment
amount	DECIMAL(10,2)	Transaction amount

complaints

Column	Type	Description
complaint_id	INT (PK)	Complaint identifier
customer_id	INT (FK)	Customer who filed complaint
complaint_date	DATE	Date filed
category	VARCHAR(30)	Fees, Service, App/Website, Wait Times, Other
resolution_days	INT	Days to resolve (NULL if unresolved)

churn_labels

Column	Type	Description
customer_id	INT (PK/FK)	Customer identifier
churned	TINYINT	1 = churned, 0 = retained
churn_date	DATE	Date all accounts were closed (NULL if retained)
exit_reason	VARCHAR(50)	Self-reported reason for leaving

SQL Analysis

Query 1: Churn Rate by Customer Segment

Calculate churn rates across customer segments and demographics to identify which groups are leaving at the highest rates.

SELECT
    c.segment,
    COUNT(*) AS total_customers,
    SUM(cl.churned) AS churned_count,
    ROUND(SUM(cl.churned) * 100.0 / COUNT(*), 2) AS churn_rate_pct,
    ROUND(AVG(c.tenure_months), 1) AS avg_tenure,
    ROUND(AVG(c.age), 1) AS avg_age
FROM customers c
JOIN churn_labels cl ON c.customer_id = cl.customer_id
GROUP BY c.segment
ORDER BY churn_rate_pct DESC;

Query 2: Activity Patterns of Churners vs Retained

Compare the transaction behavior of customers who churned against those who stayed. Declining activity often signals imminent churn.

WITH customer_activity AS (
    SELECT
        a.customer_id,
        cl.churned,
        COUNT(t.txn_id) AS total_txns,
        COUNT(DISTINCT DATE_FORMAT(t.txn_date, '%Y-%m')) AS active_months,
        ROUND(AVG(t.amount), 2) AS avg_txn_amount,
        DATEDIFF(
            COALESCE(cl.churn_date, CURDATE()),
            MAX(t.txn_date)
        ) AS days_since_last_txn,
        SUM(a.balance) AS total_balance,
        COUNT(DISTINCT a.account_id) AS num_accounts
    FROM customers cust
    JOIN accounts a ON cust.customer_id = a.customer_id
    LEFT JOIN transactions t ON a.account_id = t.account_id
    JOIN churn_labels cl ON cust.customer_id = cl.customer_id
    GROUP BY a.customer_id, cl.churned
)
SELECT
    CASE WHEN churned = 1 THEN 'Churned' ELSE 'Retained' END AS status,
    COUNT(*) AS customers,
    ROUND(AVG(total_txns), 1) AS avg_transactions,
    ROUND(AVG(active_months), 1) AS avg_active_months,
    ROUND(AVG(avg_txn_amount), 2) AS avg_txn_size,
    ROUND(AVG(days_since_last_txn), 0) AS avg_days_inactive,
    ROUND(AVG(total_balance), 2) AS avg_balance,
    ROUND(AVG(num_accounts), 1) AS avg_accounts
FROM customer_activity
GROUP BY churned;

Query 3: Complaint Correlation with Churn

Determine whether filing complaints is associated with higher churn rates, and identify which complaint categories are most predictive.

SELECT
    COALESCE(comp.category, 'No Complaints') AS complaint_category,
    COUNT(DISTINCT c.customer_id) AS customers,
    SUM(cl.churned) AS churned,
    ROUND(SUM(cl.churned) * 100.0 / COUNT(DISTINCT c.customer_id), 2) AS churn_rate_pct,
    ROUND(AVG(comp.resolution_days), 1) AS avg_resolution_days
FROM customers c
JOIN churn_labels cl ON c.customer_id = cl.customer_id
LEFT JOIN complaints comp ON c.customer_id = comp.customer_id
GROUP BY COALESCE(comp.category, 'No Complaints')
ORDER BY churn_rate_pct DESC;

Python Analysis

Data Preparation

Merge all tables into a single customer-level feature matrix. Each row represents one customer with all their aggregated behavioral metrics.

import pandas as pd
import numpy as np

customers = pd.read_csv('customers.csv')
accounts = pd.read_csv('accounts.csv')
transactions = pd.read_csv('transactions.csv', parse_dates=['txn_date'])
complaints = pd.read_csv('complaints.csv')
churn = pd.read_csv('churn_labels.csv')

# Aggregate account-level features per customer
acct_features = accounts.groupby('customer_id').agg(
    num_accounts=('account_id', 'count'),
    total_balance=('balance', 'sum'),
    avg_balance=('balance', 'mean'),
    total_monthly_fees=('monthly_fee', 'sum'),
    has_credit_card=('account_type', lambda x: int('Credit Card' in x.values)),
    has_mortgage=('account_type', lambda x: int('Mortgage' in x.values))
).reset_index()

# Aggregate transaction-level features per customer
txn_features = transactions.merge(accounts[['account_id', 'customer_id']], on='account_id')
txn_agg = txn_features.groupby('customer_id').agg(
    total_txns=('txn_id', 'count'),
    avg_txn_amount=('amount', 'mean'),
    total_txn_amount=('amount', 'sum'),
    last_txn_date=('txn_date', 'max'),
    txn_months=('txn_date', lambda x: x.dt.to_period('M').nunique())
).reset_index()

# Complaint features
comp_features = complaints.groupby('customer_id').agg(
    num_complaints=('complaint_id', 'count'),
    avg_resolution_days=('resolution_days', 'mean'),
    has_unresolved=('resolution_days', lambda x: int(x.isna().any()))
).reset_index()

# Merge everything
df = customers.merge(churn, on='customer_id')
df = df.merge(acct_features, on='customer_id', how='left')
df = df.merge(txn_agg, on='customer_id', how='left')
df = df.merge(comp_features, on='customer_id', how='left')

# Fill missing values (customers with no complaints or transactions)
df['num_complaints'] = df['num_complaints'].fillna(0)
df['avg_resolution_days'] = df['avg_resolution_days'].fillna(0)
df['has_unresolved'] = df['has_unresolved'].fillna(0)
df['total_txns'] = df['total_txns'].fillna(0)

# Encode segment
df['segment_encoded'] = df['segment'].map({'Mass': 0, 'Affluent': 1, 'Premier': 2})
df['gender_encoded'] = df['gender'].map({'M': 0, 'F': 1})

print(f"Dataset: {len(df)} customers, {df['churned'].sum()} churned ({df['churned'].mean()*100:.1f}%)")

Random Forest Churn Prediction

Random Forest is well-suited for churn prediction because it handles mixed feature types, captures non-linear relationships, and provides built-in feature importance rankings.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, roc_auc_score

features = ['age', 'tenure_months', 'segment_encoded', 'gender_encoded',
            'num_accounts', 'total_balance', 'total_monthly_fees',
            'has_credit_card', 'has_mortgage', 'total_txns',
            'avg_txn_amount', 'txn_months', 'num_complaints',
            'avg_resolution_days', 'has_unresolved']

X = df[features].fillna(0)
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=20,
    class_weight='balanced',
    random_state=42
)
rf.fit(X_train, y_train)

y_pred = rf.predict(X_test)
y_prob = rf.predict_proba(X_test)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Retained', 'Churned']))
print(f"ROC AUC: {roc_auc_score(y_test, y_prob):.3f}")

Feature Importance Analysis

Identify which factors most strongly predict churn. This tells the retention team where to focus their efforts.

import matplotlib.pyplot as plt

importances = pd.DataFrame({
    'feature': features,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=True)

plt.figure(figsize=(10, 8))
plt.barh(importances['feature'], importances['importance'], color='steelblue')
plt.xlabel('Feature Importance (Gini)')
plt.title('Churn Prediction - Feature Importance')
plt.tight_layout()
plt.savefig('feature_importance_churn.png', dpi=150)
plt.show()

print("\nTop 5 churn drivers:")
for _, row in importances.tail(5).iterrows():
    print(f"  {row['feature']}: {row['importance']:.4f}")

Retention Value Analysis

Score all active customers with their churn probability and calculate the expected revenue at risk. This prioritizes retention outreach by financial impact.

# Score all customers (not just test set)
df['churn_probability'] = rf.predict_proba(X)[:, 1]

# Estimate annual revenue per customer (monthly fees + estimated transaction revenue)
df['est_annual_revenue'] = (df['total_monthly_fees'] * 12) + (df['total_txn_amount'] * 0.02)

# Revenue at risk = probability of churn x annual revenue
df['revenue_at_risk'] = df['churn_probability'] * df['est_annual_revenue']

# High-risk customers (churn probability > 60%, still active)
active_customers = df[df['churned'] == 0].copy()
high_risk = active_customers[active_customers['churn_probability'] > 0.6]
high_risk = high_risk.sort_values('revenue_at_risk', ascending=False)

print(f"Active customers: {len(active_customers)}")
print(f"High-risk customers (>60% churn prob): {len(high_risk)}")
print(f"Total revenue at risk: ${high_risk['revenue_at_risk'].sum():,.2f}")
print(f"\nTop 10 highest-value at-risk customers:")
print(high_risk[['customer_id', 'segment', 'churn_probability',
                 'est_annual_revenue', 'revenue_at_risk']].head(10).to_string(index=False))

Excel Analysis

Churn Dashboard: Create a summary dashboard showing overall churn rate (large KPI card), churn rate by segment (bar chart), churn trend by month (line chart), and top exit reasons (horizontal bar chart). Use slicers to filter by segment and city.

Segment Comparison Table: Build a comparison table with rows for each segment (Mass, Affluent, Premier) and columns for: customer count, churn rate, average tenure, average balance, average complaints, and average transaction activity. Use conditional formatting to highlight the worst-performing metrics per segment.

At-Risk Customer List: Import the Python-scored customer list showing customer_id, segment, churn probability, and revenue at risk. Sort by revenue at risk descending. Add a "Retention Action" dropdown column (Call, Email, Fee Waiver, Rate Upgrade, No Action) for the retention team to log their planned intervention.

Retention Cost-Benefit Analysis: Create a scenario table showing: (a) cost of retention offer per customer (e.g., $50 fee waiver), (b) number of high-risk customers targeted, (c) expected retention improvement (assume 20% of targeted customers are saved), (d) revenue retained = saved customers x avg annual revenue. Calculate ROI as =(revenue_retained - total_cost) / total_cost.

Churn Waterfall Chart: Build a waterfall chart showing the customer base at the start of the year, new acquisitions, churned customers, and end-of-year base. This clearly visualizes the net customer growth or decline and makes the cost of churn tangible to executives.

Complaint Impact Analysis: Use a pivot table to cross-reference complaint category with churn outcome. Calculate the churn rate lift for each category (churn rate with complaint minus baseline churn rate). Identify which complaint types are the strongest churn signals.

Key Insights

Inactivity is the Strongest Signal

Customers who reduce transaction frequency by more than 50% in the 3 months before churning show a clear disengagement pattern. Monitoring transaction velocity changes provides an early warning system, often 60-90 days before the customer actually closes their accounts.

Product Stickiness

Customers with 3+ products (checking + savings + credit card) churn at less than half the rate of single-product customers. Cross-selling is not just a revenue strategy; it is the strongest retention lever the bank has.

Complaints are Opportunities

Customers who complain and get fast resolution (<3 days) actually churn at lower rates than customers who never complain. The problem is not complaints themselves but unresolved complaints, especially those taking more than 7 days.

Retention ROI is Substantial

If a $50 fee waiver retains even 20% of high-risk customers, the revenue preserved far exceeds the cost. The cost-benefit analysis typically shows 5-10x ROI on targeted retention programs, making it one of the highest-impact investments the bank can make.