Credit Risk Assessment

Project Overview

Business Scenario

First National Bank processes over 5,000 personal loan applications per month. Their current approval process relies heavily on manual review by credit analysts, leading to inconsistent decisions and a default rate of 12%. The Chief Risk Officer wants a data-driven credit scoring system that standardizes the evaluation process, reduces defaults to below 8%, and speeds up decision-making from 5 days to same-day approval for low-risk applicants.

Goal: Build a predictive credit scoring model and risk segmentation framework
Tools: SQL (risk segmentation), Python (logistic regression model), Excel (scorecard dashboard)
Stakeholder: Chief Risk Officer and lending team
Deliverables: Default rate analysis, predictive model with 80%+ accuracy, credit scorecard, risk matrix

Dataset

Database Schema

The credit database contains loan application details, credit histories, employment records, and actual loan outcomes for model training.

loan_applications

Column	Type	Description
application_id	INT (PK)	Unique application identifier
applicant_id	INT (FK)	Links to credit_history and employment
loan_amount	DECIMAL(12,2)	Requested loan amount
loan_purpose	VARCHAR(50)	Purpose (debt_consolidation, home, education, etc.)
interest_rate	DECIMAL(5,2)	Assigned interest rate (%)
term_months	INT	Loan term in months (36 or 60)
application_date	DATE	Date of application

credit_history

Column	Type	Description
applicant_id	INT (PK)	Applicant identifier
credit_score	INT	FICO credit score (300-850)
num_open_accounts	INT	Number of open credit accounts
total_debt	DECIMAL(12,2)	Total outstanding debt
credit_utilization	DECIMAL(5,2)	Credit utilization ratio (0-1)
num_late_payments	INT	Late payments in last 24 months
credit_history_years	DECIMAL(4,1)	Length of credit history in years

employment

Column	Type	Description
applicant_id	INT (PK)	Applicant identifier
annual_income	DECIMAL(12,2)	Verified annual income
employment_length	INT	Years at current employer
employment_type	VARCHAR(20)	Full-time, Part-time, Self-employed, Unemployed
debt_to_income	DECIMAL(5,2)	Monthly debt payments / monthly income

loan_outcomes

Column	Type	Description
application_id	INT (PK/FK)	Links to loan_applications
status	VARCHAR(20)	Fully Paid, Default, Current, Late
amount_paid	DECIMAL(12,2)	Total amount repaid to date
default_date	DATE	Date of default (NULL if not defaulted)

SQL Analysis

Query 1: Default Rates by Credit Score Range

Segment applicants into credit score buckets and calculate the default rate for each range. This is the foundation of any credit risk model.

SELECT
    CASE
        WHEN ch.credit_score >= 750 THEN 'Excellent (750+)'
        WHEN ch.credit_score >= 700 THEN 'Good (700-749)'
        WHEN ch.credit_score >= 650 THEN 'Fair (650-699)'
        WHEN ch.credit_score >= 600 THEN 'Poor (600-649)'
        ELSE 'Very Poor (<600)'
    END AS score_range,
    COUNT(*) AS total_applications,
    SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) AS defaults,
    ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
        / COUNT(*), 2) AS default_rate_pct,
    ROUND(AVG(la.loan_amount), 2) AS avg_loan_amount
FROM loan_applications la
JOIN credit_history ch ON la.applicant_id = ch.applicant_id
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY score_range
ORDER BY default_rate_pct DESC;

Query 2: Loan Amount Distribution and Risk

Analyze how loan size correlates with default probability. Larger loans may carry higher risk depending on the borrower profile.

SELECT
    CASE
        WHEN la.loan_amount < 5000 THEN 'Under $5K'
        WHEN la.loan_amount < 15000 THEN '$5K - $15K'
        WHEN la.loan_amount < 30000 THEN '$15K - $30K'
        ELSE 'Over $30K'
    END AS loan_bucket,
    COUNT(*) AS num_loans,
    ROUND(AVG(la.loan_amount), 2) AS avg_amount,
    ROUND(AVG(la.interest_rate), 2) AS avg_rate,
    ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
        / COUNT(*), 2) AS default_rate_pct
FROM loan_applications la
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY loan_bucket
ORDER BY avg_amount;

Query 3: Multi-Factor Risk Segmentation

Combine credit score, debt-to-income ratio, and employment type to create granular risk segments. This identifies the riskiest applicant profiles.

SELECT
    CASE WHEN ch.credit_score >= 700 THEN 'High Score' ELSE 'Low Score' END AS score_tier,
    CASE WHEN e.debt_to_income <= 0.35 THEN 'Low DTI' ELSE 'High DTI' END AS dti_tier,
    e.employment_type,
    COUNT(*) AS applicants,
    ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
        / COUNT(*), 2) AS default_rate_pct,
    ROUND(AVG(la.loan_amount), 0) AS avg_loan
FROM loan_applications la
JOIN credit_history ch ON la.applicant_id = ch.applicant_id
JOIN employment e ON la.applicant_id = e.applicant_id
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY score_tier, dti_tier, e.employment_type
HAVING COUNT(*) >= 20
ORDER BY default_rate_pct DESC;

Python Analysis

Data Preparation and Feature Importance

Load the joined dataset, engineer features, and identify which variables most strongly predict default.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load data (joined from SQL)
df = pd.read_csv('credit_data.csv')

# Create target variable: 1 = Default, 0 = Not Default
df['is_default'] = (df['status'] == 'Default').astype(int)

# Feature engineering
df['loan_to_income'] = df['loan_amount'] / df['annual_income']
df['monthly_payment'] = df['loan_amount'] * (df['interest_rate']/100/12) \
    / (1 - (1 + df['interest_rate']/100/12) ** -df['term_months'])
df['payment_to_income'] = df['monthly_payment'] / (df['annual_income'] / 12)

# Encode categorical variables
le = LabelEncoder()
df['employment_encoded'] = le.fit_transform(df['employment_type'])
df['purpose_encoded'] = le.fit_transform(df['loan_purpose'])

features = ['credit_score', 'credit_utilization', 'num_late_payments',
            'debt_to_income', 'loan_to_income', 'payment_to_income',
            'annual_income', 'employment_length', 'credit_history_years',
            'employment_encoded', 'purpose_encoded']

X = df[features]
y = df['is_default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Logistic Regression Credit Scoring Model

Train a logistic regression model, which is the industry standard for credit scoring because its coefficients are directly interpretable as risk weights.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))

# Feature importance (coefficient magnitudes)
importance = pd.DataFrame({
    'feature': features,
    'coefficient': model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("\nFeature Importance (by coefficient magnitude):")
print(importance.to_string(index=False))

Confusion Matrix Visualization

Visualize model performance. In credit risk, false negatives (approving loans that default) are more costly than false positives (rejecting good borrowers).

import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Default', 'Default'],
            yticklabels=['No Default', 'Default'])
plt.title('Credit Risk Model - Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()

ROC Curve

The ROC curve shows the tradeoff between catching defaults (True Positive Rate) and incorrectly flagging good borrowers (False Positive Rate). An AUC above 0.75 is considered acceptable for credit models.

from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Credit Default Prediction')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()

# Find optimal threshold (maximizes Youden's J statistic)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal probability threshold: {optimal_threshold:.3f}")

Excel Analysis

Credit Scorecard Sheet: Create a points-based scorecard table. List each feature (credit score range, DTI range, employment type, etc.) with assigned points derived from the logistic regression coefficients. For example, credit score 750+ = +40 points, credit score 600-649 = -20 points. Total points determine the risk tier.

Risk Matrix: Build a two-dimensional matrix with credit score ranges on rows and DTI buckets on columns. Each cell shows the default rate from the SQL analysis. Use conditional formatting with a Red-Yellow-Green color scale so high-risk cells immediately stand out.

Loan Approval Dashboard: Create a summary view showing: total applications processed, approval rate, average loan amount, default rate by month (line chart), and loan purpose breakdown (bar chart). Add slicers for filtering by date range and loan purpose.

Threshold Analysis: Use a data table to show how different approval thresholds affect the default rate and approval volume. Column A lists thresholds (e.g., accept if score > 600, 620, 640...). Columns B and C show the resulting default rate and number of approved loans using =COUNTIFS() formulas.

Loss Projection: For each risk tier, multiply the expected default rate by the average loan amount and number of loans to estimate expected losses. Sum these for total portfolio loss projection. Add a what-if table to show impact of tightening or loosening approval criteria.

Model Performance Summary: Import the Python confusion matrix values. Calculate accuracy, precision, recall, and F1-score using formulas: =TP/(TP+FP) for precision, =TP/(TP+FN) for recall. Display in a clean summary card format.

Key Insights

Credit Score is Not Everything

While credit score is the strongest single predictor, combining it with debt-to-income ratio and employment stability creates far more accurate risk segments. A borrower with a 720 score but 55% DTI may be riskier than one with a 680 score and 20% DTI.

Class Imbalance Matters

Only 12% of loans default, creating an imbalanced dataset. Using class_weight='balanced' in the model ensures the algorithm does not simply predict "no default" for everyone and claim 88% accuracy.

Threshold is a Business Decision

The optimal probability threshold depends on the bank's risk appetite. A lower threshold catches more defaults but rejects more good borrowers. The ROC curve helps visualize this tradeoff so stakeholders can choose the right balance.

Interpretability is Required

Financial regulators require that credit decisions be explainable. Logistic regression is preferred over black-box models because each coefficient directly translates to a risk factor that can be communicated to applicants and auditors.