Credit Risk Assessment
A commercial bank needs to evaluate loan applications using historical data to predict which borrowers are likely to default. You will build a credit scoring model that segments applicants by risk level, identifies the strongest predictors of default, and creates an actionable scorecard for the lending team.
Project Overview
Business Scenario
First National Bank processes over 5,000 personal loan applications per month. Their current approval process relies heavily on manual review by credit analysts, leading to inconsistent decisions and a default rate of 12%. The Chief Risk Officer wants a data-driven credit scoring system that standardizes the evaluation process, reduces defaults to below 8%, and speeds up decision-making from 5 days to same-day approval for low-risk applicants.
- Goal: Build a predictive credit scoring model and risk segmentation framework
- Tools: SQL (risk segmentation), Python (logistic regression model), Excel (scorecard dashboard)
- Stakeholder: Chief Risk Officer and lending team
- Deliverables: Default rate analysis, predictive model with 80%+ accuracy, credit scorecard, risk matrix
Dataset
Database Schema
The credit database contains loan application details, credit histories, employment records, and actual loan outcomes for model training.
loan_applications
| Column | Type | Description |
|---|---|---|
| application_id | INT (PK) | Unique application identifier |
| applicant_id | INT (FK) | Links to credit_history and employment |
| loan_amount | DECIMAL(12,2) | Requested loan amount |
| loan_purpose | VARCHAR(50) | Purpose (debt_consolidation, home, education, etc.) |
| interest_rate | DECIMAL(5,2) | Assigned interest rate (%) |
| term_months | INT | Loan term in months (36 or 60) |
| application_date | DATE | Date of application |
credit_history
| Column | Type | Description |
|---|---|---|
| applicant_id | INT (PK) | Applicant identifier |
| credit_score | INT | FICO credit score (300-850) |
| num_open_accounts | INT | Number of open credit accounts |
| total_debt | DECIMAL(12,2) | Total outstanding debt |
| credit_utilization | DECIMAL(5,2) | Credit utilization ratio (0-1) |
| num_late_payments | INT | Late payments in last 24 months |
| credit_history_years | DECIMAL(4,1) | Length of credit history in years |
employment
| Column | Type | Description |
|---|---|---|
| applicant_id | INT (PK) | Applicant identifier |
| annual_income | DECIMAL(12,2) | Verified annual income |
| employment_length | INT | Years at current employer |
| employment_type | VARCHAR(20) | Full-time, Part-time, Self-employed, Unemployed |
| debt_to_income | DECIMAL(5,2) | Monthly debt payments / monthly income |
loan_outcomes
| Column | Type | Description |
|---|---|---|
| application_id | INT (PK/FK) | Links to loan_applications |
| status | VARCHAR(20) | Fully Paid, Default, Current, Late |
| amount_paid | DECIMAL(12,2) | Total amount repaid to date |
| default_date | DATE | Date of default (NULL if not defaulted) |
SQL Analysis
Query 1: Default Rates by Credit Score Range
Segment applicants into credit score buckets and calculate the default rate for each range. This is the foundation of any credit risk model.
SELECT
CASE
WHEN ch.credit_score >= 750 THEN 'Excellent (750+)'
WHEN ch.credit_score >= 700 THEN 'Good (700-749)'
WHEN ch.credit_score >= 650 THEN 'Fair (650-699)'
WHEN ch.credit_score >= 600 THEN 'Poor (600-649)'
ELSE 'Very Poor (<600)'
END AS score_range,
COUNT(*) AS total_applications,
SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) AS defaults,
ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
/ COUNT(*), 2) AS default_rate_pct,
ROUND(AVG(la.loan_amount), 2) AS avg_loan_amount
FROM loan_applications la
JOIN credit_history ch ON la.applicant_id = ch.applicant_id
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY score_range
ORDER BY default_rate_pct DESC;
Query 2: Loan Amount Distribution and Risk
Analyze how loan size correlates with default probability. Larger loans may carry higher risk depending on the borrower profile.
SELECT
CASE
WHEN la.loan_amount < 5000 THEN 'Under $5K'
WHEN la.loan_amount < 15000 THEN '$5K - $15K'
WHEN la.loan_amount < 30000 THEN '$15K - $30K'
ELSE 'Over $30K'
END AS loan_bucket,
COUNT(*) AS num_loans,
ROUND(AVG(la.loan_amount), 2) AS avg_amount,
ROUND(AVG(la.interest_rate), 2) AS avg_rate,
ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
/ COUNT(*), 2) AS default_rate_pct
FROM loan_applications la
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY loan_bucket
ORDER BY avg_amount;
Query 3: Multi-Factor Risk Segmentation
Combine credit score, debt-to-income ratio, and employment type to create granular risk segments. This identifies the riskiest applicant profiles.
SELECT
CASE WHEN ch.credit_score >= 700 THEN 'High Score' ELSE 'Low Score' END AS score_tier,
CASE WHEN e.debt_to_income <= 0.35 THEN 'Low DTI' ELSE 'High DTI' END AS dti_tier,
e.employment_type,
COUNT(*) AS applicants,
ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
/ COUNT(*), 2) AS default_rate_pct,
ROUND(AVG(la.loan_amount), 0) AS avg_loan
FROM loan_applications la
JOIN credit_history ch ON la.applicant_id = ch.applicant_id
JOIN employment e ON la.applicant_id = e.applicant_id
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY score_tier, dti_tier, e.employment_type
HAVING COUNT(*) >= 20
ORDER BY default_rate_pct DESC;
Python Analysis
Data Preparation and Feature Importance
Load the joined dataset, engineer features, and identify which variables most strongly predict default.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
# Load data (joined from SQL)
df = pd.read_csv('credit_data.csv')
# Create target variable: 1 = Default, 0 = Not Default
df['is_default'] = (df['status'] == 'Default').astype(int)
# Feature engineering
df['loan_to_income'] = df['loan_amount'] / df['annual_income']
df['monthly_payment'] = df['loan_amount'] * (df['interest_rate']/100/12) \
/ (1 - (1 + df['interest_rate']/100/12) ** -df['term_months'])
df['payment_to_income'] = df['monthly_payment'] / (df['annual_income'] / 12)
# Encode categorical variables
le = LabelEncoder()
df['employment_encoded'] = le.fit_transform(df['employment_type'])
df['purpose_encoded'] = le.fit_transform(df['loan_purpose'])
features = ['credit_score', 'credit_utilization', 'num_late_payments',
'debt_to_income', 'loan_to_income', 'payment_to_income',
'annual_income', 'employment_length', 'credit_history_years',
'employment_encoded', 'purpose_encoded']
X = df[features]
y = df['is_default']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
Logistic Regression Credit Scoring Model
Train a logistic regression model, which is the industry standard for credit scoring because its coefficients are directly interpretable as risk weights.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
model = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))
# Feature importance (coefficient magnitudes)
importance = pd.DataFrame({
'feature': features,
'coefficient': model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)
print("\nFeature Importance (by coefficient magnitude):")
print(importance.to_string(index=False))
Confusion Matrix Visualization
Visualize model performance. In credit risk, false negatives (approving loans that default) are more costly than false positives (rejecting good borrowers).
import matplotlib.pyplot as plt
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['No Default', 'Default'],
yticklabels=['No Default', 'Default'])
plt.title('Credit Risk Model - Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()
ROC Curve
The ROC curve shows the tradeoff between catching defaults (True Positive Rate) and incorrectly flagging good borrowers (False Positive Rate). An AUC above 0.75 is considered acceptable for credit models.
from sklearn.metrics import roc_curve, auc
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Credit Default Prediction')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()
# Find optimal threshold (maximizes Youden's J statistic)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal probability threshold: {optimal_threshold:.3f}")
Excel Analysis
=COUNTIFS() formulas.=TP/(TP+FP) for precision, =TP/(TP+FN) for recall. Display in a clean summary card format.Key Insights
Credit Score is Not Everything
While credit score is the strongest single predictor, combining it with debt-to-income ratio and employment stability creates far more accurate risk segments. A borrower with a 720 score but 55% DTI may be riskier than one with a 680 score and 20% DTI.
Class Imbalance Matters
Only 12% of loans default, creating an imbalanced dataset. Using class_weight='balanced' in the model ensures the algorithm does not simply predict "no default" for everyone and claim 88% accuracy.
Threshold is a Business Decision
The optimal probability threshold depends on the bank's risk appetite. A lower threshold catches more defaults but rejects more good borrowers. The ROC curve helps visualize this tradeoff so stakeholders can choose the right balance.
Interpretability is Required
Financial regulators require that credit decisions be explainable. Logistic regression is preferred over black-box models because each coefficient directly translates to a risk factor that can be communicated to applicants and auditors.
Quiz
Why is class_weight='balanced' used in the logistic regression model?
In credit risk modeling, which type of error is typically more costly for the bank?
What does AUC (Area Under the Curve) represent in the ROC analysis?
Why is logistic regression preferred over neural networks for credit scoring in regulated banking?
What does the debt-to-income (DTI) ratio measure?