Project 12

Credit Risk Assessment

A commercial bank needs to evaluate loan applications using historical data to predict which borrowers are likely to default. You will build a credit scoring model that segments applicants by risk level, identifies the strongest predictors of default, and creates an actionable scorecard for the lending team.

Project Overview

Business Scenario

First National Bank processes over 5,000 personal loan applications per month. Their current approval process relies heavily on manual review by credit analysts, leading to inconsistent decisions and a default rate of 12%. The Chief Risk Officer wants a data-driven credit scoring system that standardizes the evaluation process, reduces defaults to below 8%, and speeds up decision-making from 5 days to same-day approval for low-risk applicants.

  • Goal: Build a predictive credit scoring model and risk segmentation framework
  • Tools: SQL (risk segmentation), Python (logistic regression model), Excel (scorecard dashboard)
  • Stakeholder: Chief Risk Officer and lending team
  • Deliverables: Default rate analysis, predictive model with 80%+ accuracy, credit scorecard, risk matrix

Dataset

Database Schema

The credit database contains loan application details, credit histories, employment records, and actual loan outcomes for model training.

loan_applications

ColumnTypeDescription
application_idINT (PK)Unique application identifier
applicant_idINT (FK)Links to credit_history and employment
loan_amountDECIMAL(12,2)Requested loan amount
loan_purposeVARCHAR(50)Purpose (debt_consolidation, home, education, etc.)
interest_rateDECIMAL(5,2)Assigned interest rate (%)
term_monthsINTLoan term in months (36 or 60)
application_dateDATEDate of application

credit_history

ColumnTypeDescription
applicant_idINT (PK)Applicant identifier
credit_scoreINTFICO credit score (300-850)
num_open_accountsINTNumber of open credit accounts
total_debtDECIMAL(12,2)Total outstanding debt
credit_utilizationDECIMAL(5,2)Credit utilization ratio (0-1)
num_late_paymentsINTLate payments in last 24 months
credit_history_yearsDECIMAL(4,1)Length of credit history in years

employment

ColumnTypeDescription
applicant_idINT (PK)Applicant identifier
annual_incomeDECIMAL(12,2)Verified annual income
employment_lengthINTYears at current employer
employment_typeVARCHAR(20)Full-time, Part-time, Self-employed, Unemployed
debt_to_incomeDECIMAL(5,2)Monthly debt payments / monthly income

loan_outcomes

ColumnTypeDescription
application_idINT (PK/FK)Links to loan_applications
statusVARCHAR(20)Fully Paid, Default, Current, Late
amount_paidDECIMAL(12,2)Total amount repaid to date
default_dateDATEDate of default (NULL if not defaulted)

SQL Analysis

Query 1: Default Rates by Credit Score Range

Segment applicants into credit score buckets and calculate the default rate for each range. This is the foundation of any credit risk model.

SELECT
    CASE
        WHEN ch.credit_score >= 750 THEN 'Excellent (750+)'
        WHEN ch.credit_score >= 700 THEN 'Good (700-749)'
        WHEN ch.credit_score >= 650 THEN 'Fair (650-699)'
        WHEN ch.credit_score >= 600 THEN 'Poor (600-649)'
        ELSE 'Very Poor (<600)'
    END AS score_range,
    COUNT(*) AS total_applications,
    SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) AS defaults,
    ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
        / COUNT(*), 2) AS default_rate_pct,
    ROUND(AVG(la.loan_amount), 2) AS avg_loan_amount
FROM loan_applications la
JOIN credit_history ch ON la.applicant_id = ch.applicant_id
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY score_range
ORDER BY default_rate_pct DESC;

Query 2: Loan Amount Distribution and Risk

Analyze how loan size correlates with default probability. Larger loans may carry higher risk depending on the borrower profile.

SELECT
    CASE
        WHEN la.loan_amount < 5000 THEN 'Under $5K'
        WHEN la.loan_amount < 15000 THEN '$5K - $15K'
        WHEN la.loan_amount < 30000 THEN '$15K - $30K'
        ELSE 'Over $30K'
    END AS loan_bucket,
    COUNT(*) AS num_loans,
    ROUND(AVG(la.loan_amount), 2) AS avg_amount,
    ROUND(AVG(la.interest_rate), 2) AS avg_rate,
    ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
        / COUNT(*), 2) AS default_rate_pct
FROM loan_applications la
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY loan_bucket
ORDER BY avg_amount;

Query 3: Multi-Factor Risk Segmentation

Combine credit score, debt-to-income ratio, and employment type to create granular risk segments. This identifies the riskiest applicant profiles.

SELECT
    CASE WHEN ch.credit_score >= 700 THEN 'High Score' ELSE 'Low Score' END AS score_tier,
    CASE WHEN e.debt_to_income <= 0.35 THEN 'Low DTI' ELSE 'High DTI' END AS dti_tier,
    e.employment_type,
    COUNT(*) AS applicants,
    ROUND(SUM(CASE WHEN lo.status = 'Default' THEN 1 ELSE 0 END) * 100.0
        / COUNT(*), 2) AS default_rate_pct,
    ROUND(AVG(la.loan_amount), 0) AS avg_loan
FROM loan_applications la
JOIN credit_history ch ON la.applicant_id = ch.applicant_id
JOIN employment e ON la.applicant_id = e.applicant_id
JOIN loan_outcomes lo ON la.application_id = lo.application_id
GROUP BY score_tier, dti_tier, e.employment_type
HAVING COUNT(*) >= 20
ORDER BY default_rate_pct DESC;

Python Analysis

Data Preparation and Feature Importance

Load the joined dataset, engineer features, and identify which variables most strongly predict default.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder

# Load data (joined from SQL)
df = pd.read_csv('credit_data.csv')

# Create target variable: 1 = Default, 0 = Not Default
df['is_default'] = (df['status'] == 'Default').astype(int)

# Feature engineering
df['loan_to_income'] = df['loan_amount'] / df['annual_income']
df['monthly_payment'] = df['loan_amount'] * (df['interest_rate']/100/12) \
    / (1 - (1 + df['interest_rate']/100/12) ** -df['term_months'])
df['payment_to_income'] = df['monthly_payment'] / (df['annual_income'] / 12)

# Encode categorical variables
le = LabelEncoder()
df['employment_encoded'] = le.fit_transform(df['employment_type'])
df['purpose_encoded'] = le.fit_transform(df['loan_purpose'])

features = ['credit_score', 'credit_utilization', 'num_late_payments',
            'debt_to_income', 'loan_to_income', 'payment_to_income',
            'annual_income', 'employment_length', 'credit_history_years',
            'employment_encoded', 'purpose_encoded']

X = df[features]
y = df['is_default']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Logistic Regression Credit Scoring Model

Train a logistic regression model, which is the industry standard for credit scoring because its coefficients are directly interpretable as risk weights.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

model = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))

# Feature importance (coefficient magnitudes)
importance = pd.DataFrame({
    'feature': features,
    'coefficient': model.coef_[0]
}).sort_values('coefficient', key=abs, ascending=False)

print("\nFeature Importance (by coefficient magnitude):")
print(importance.to_string(index=False))

Confusion Matrix Visualization

Visualize model performance. In credit risk, false negatives (approving loans that default) are more costly than false positives (rejecting good borrowers).

import matplotlib.pyplot as plt
import seaborn as sns

cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['No Default', 'Default'],
            yticklabels=['No Default', 'Default'])
plt.title('Credit Risk Model - Confusion Matrix')
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=150)
plt.show()

ROC Curve

The ROC curve shows the tradeoff between catching defaults (True Positive Rate) and incorrectly flagging good borrowers (False Positive Rate). An AUC above 0.75 is considered acceptable for credit models.

from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_prob)
roc_auc = auc(fpr, tpr)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC Curve (AUC = {roc_auc:.3f})')
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random Classifier')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve - Credit Default Prediction')
plt.legend(loc='lower right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()

# Find optimal threshold (maximizes Youden's J statistic)
j_scores = tpr - fpr
optimal_idx = np.argmax(j_scores)
optimal_threshold = thresholds[optimal_idx]
print(f"Optimal probability threshold: {optimal_threshold:.3f}")

Excel Analysis

  • Credit Scorecard Sheet: Create a points-based scorecard table. List each feature (credit score range, DTI range, employment type, etc.) with assigned points derived from the logistic regression coefficients. For example, credit score 750+ = +40 points, credit score 600-649 = -20 points. Total points determine the risk tier.
  • Risk Matrix: Build a two-dimensional matrix with credit score ranges on rows and DTI buckets on columns. Each cell shows the default rate from the SQL analysis. Use conditional formatting with a Red-Yellow-Green color scale so high-risk cells immediately stand out.
  • Loan Approval Dashboard: Create a summary view showing: total applications processed, approval rate, average loan amount, default rate by month (line chart), and loan purpose breakdown (bar chart). Add slicers for filtering by date range and loan purpose.
  • Threshold Analysis: Use a data table to show how different approval thresholds affect the default rate and approval volume. Column A lists thresholds (e.g., accept if score > 600, 620, 640...). Columns B and C show the resulting default rate and number of approved loans using =COUNTIFS() formulas.
  • Loss Projection: For each risk tier, multiply the expected default rate by the average loan amount and number of loans to estimate expected losses. Sum these for total portfolio loss projection. Add a what-if table to show impact of tightening or loosening approval criteria.
  • Model Performance Summary: Import the Python confusion matrix values. Calculate accuracy, precision, recall, and F1-score using formulas: =TP/(TP+FP) for precision, =TP/(TP+FN) for recall. Display in a clean summary card format.
  • Key Insights

    Credit Score is Not Everything

    While credit score is the strongest single predictor, combining it with debt-to-income ratio and employment stability creates far more accurate risk segments. A borrower with a 720 score but 55% DTI may be riskier than one with a 680 score and 20% DTI.

    Class Imbalance Matters

    Only 12% of loans default, creating an imbalanced dataset. Using class_weight='balanced' in the model ensures the algorithm does not simply predict "no default" for everyone and claim 88% accuracy.

    Threshold is a Business Decision

    The optimal probability threshold depends on the bank's risk appetite. A lower threshold catches more defaults but rejects more good borrowers. The ROC curve helps visualize this tradeoff so stakeholders can choose the right balance.

    Interpretability is Required

    Financial regulators require that credit decisions be explainable. Logistic regression is preferred over black-box models because each coefficient directly translates to a risk factor that can be communicated to applicants and auditors.

    Quiz

    Question 1

    Why is class_weight='balanced' used in the logistic regression model?

    Question 2

    In credit risk modeling, which type of error is typically more costly for the bank?

    Question 3

    What does AUC (Area Under the Curve) represent in the ROC analysis?

    Question 4

    Why is logistic regression preferred over neural networks for credit scoring in regulated banking?

    Question 5

    What does the debt-to-income (DTI) ratio measure?

    ← Previous Project Next Project →