Project 9

Patient Readmission Predictor

Build a predictive model to identify patients at high risk of 30-day hospital readmission and enable targeted intervention programs.

Project Overview

Scenario

You are a data analyst at Valley Health System, a network of three hospitals. The quality improvement team has identified that the 30-day readmission rate is 18.4%, well above the national benchmark of 15.5%. Under the CMS Hospital Readmissions Reduction Program (HRRP), the system faces financial penalties of up to $1.8 million annually for excessive readmissions in heart failure, pneumonia, and hip/knee replacement.

Your task is to analyze historical readmission data, identify the strongest risk factors, and build a logistic regression model that can flag high-risk patients before discharge so that care coordinators can provide targeted follow-up.

Objectives

  • Calculate readmission rates by diagnosis, department, and patient demographics
  • Identify the top clinical and social risk factors for readmission
  • Engineer predictive features from admission, diagnosis, and medication data
  • Train and evaluate a logistic regression classifier with ROC analysis
  • Determine the optimal probability threshold for clinical intervention

Dataset Description

The electronic health record (EHR) system provides five tables covering patient history, admissions, diagnoses, medications, and readmission outcomes.

patients

Column Type Description
patient_idINT (PK)Unique patient identifier
ageINTPatient age at most recent admission
genderVARCHAR(10)M or F
insurance_typeVARCHAR(30)Private, Medicare, Medicaid, Self-pay
num_prior_admissionsINTLifetime hospital admissions before current
lives_aloneBOOLEANWhether patient lives alone

admissions

Column Type Description
admission_idINT (PK)Unique admission identifier
patient_idINT (FK)References patients table
admission_dateDATETIMEDate and time of admission
discharge_dateDATETIMEDate and time of discharge
departmentVARCHAR(50)Admitting department
admission_typeVARCHAR(20)Emergency, Elective, Urgent
discharge_dispositionVARCHAR(30)Home, SNF, Rehab, AMA, Expired

diagnoses

Column Type Description
diagnosis_idINT (PK)Unique diagnosis record
admission_idINT (FK)References admissions table
icd_codeVARCHAR(10)ICD-10 diagnosis code
diagnosis_nameVARCHAR(100)Human-readable diagnosis
is_primaryBOOLEANWhether this is the primary diagnosis

medications

Column Type Description
med_idINT (PK)Unique medication record
admission_idINT (FK)References admissions table
medication_nameVARCHAR(100)Name of prescribed medication
medication_classVARCHAR(50)Drug class (e.g., Diuretic, Beta-blocker)
num_medicationsINTTotal medications at discharge

readmissions

Column Type Description
readmission_idINT (PK)Unique readmission record
original_admission_idINT (FK)The initial admission that preceded readmission
readmission_admission_idINT (FK)The readmission event
days_to_readmissionINTDays between discharge and readmission
was_preventableBOOLEANClinical review: was readmission preventable?

SQL Analysis

1. Readmission Rate by Primary Diagnosis

Identify which conditions have the highest readmission rates to focus intervention programs.

SELECT
    dg.diagnosis_name,
    COUNT(DISTINCT a.admission_id) AS total_discharges,
    COUNT(DISTINCT r.readmission_id) AS readmissions,
    ROUND(
        COUNT(DISTINCT r.readmission_id) * 100.0
        / COUNT(DISTINCT a.admission_id), 1
    ) AS readmission_rate_pct
FROM admissions a
JOIN diagnoses dg
    ON a.admission_id = dg.admission_id AND dg.is_primary = TRUE
LEFT JOIN readmissions r
    ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL
    AND a.discharge_disposition != 'Expired'
GROUP BY dg.diagnosis_name
HAVING COUNT(DISTINCT a.admission_id) >= 20
ORDER BY readmission_rate_pct DESC
LIMIT 15;

2. Risk Factor Analysis

Compare key patient characteristics between readmitted and non-readmitted groups to identify risk factors.

SELECT
    'Age 65+' AS risk_factor,
    ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND p.age >= 65 THEN 1.0 ELSE 0 END) * 100, 1) AS readmit_pct,
    ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND p.age >= 65 THEN 1.0 ELSE 0 END) * 100, 1) AS no_readmit_pct
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL

UNION ALL

SELECT
    'Lives Alone',
    ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND p.lives_alone = TRUE THEN 1.0 ELSE 0 END) * 100, 1),
    ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND p.lives_alone = TRUE THEN 1.0 ELSE 0 END) * 100, 1)
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL

UNION ALL

SELECT
    'Emergency Admission',
    ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND a.admission_type = 'Emergency' THEN 1.0 ELSE 0 END) * 100, 1),
    ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND a.admission_type = 'Emergency' THEN 1.0 ELSE 0 END) * 100, 1)
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL

UNION ALL

SELECT
    '5+ Prior Admissions',
    ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND p.num_prior_admissions >= 5 THEN 1.0 ELSE 0 END) * 100, 1),
    ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND p.num_prior_admissions >= 5 THEN 1.0 ELSE 0 END) * 100, 1)
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL;

3. Frequent Readmitters

Identify patients with multiple readmissions who may benefit from intensive case management.

SELECT
    p.patient_id,
    p.age,
    p.insurance_type,
    p.num_prior_admissions,
    COUNT(r.readmission_id) AS total_readmissions,
    GROUP_CONCAT(DISTINCT dg.diagnosis_name ORDER BY dg.diagnosis_name SEPARATOR ', ') AS diagnoses,
    ROUND(AVG(r.days_to_readmission), 1) AS avg_days_to_readmit
FROM patients p
JOIN admissions a ON p.patient_id = a.patient_id
JOIN readmissions r ON a.admission_id = r.original_admission_id
JOIN diagnoses dg ON a.admission_id = dg.admission_id AND dg.is_primary = TRUE
GROUP BY p.patient_id, p.age, p.insurance_type, p.num_prior_admissions
HAVING COUNT(r.readmission_id) >= 3
ORDER BY total_readmissions DESC;

4. Medication Burden and Readmission

Examine whether patients on more medications at discharge have higher readmission rates (polypharmacy risk).

SELECT
    CASE
        WHEN m.num_medications BETWEEN 0 AND 3 THEN '0-3 meds'
        WHEN m.num_medications BETWEEN 4 AND 7 THEN '4-7 meds'
        WHEN m.num_medications BETWEEN 8 AND 12 THEN '8-12 meds'
        ELSE '13+ meds'
    END AS medication_bucket,
    COUNT(DISTINCT a.admission_id) AS total_discharges,
    COUNT(DISTINCT r.readmission_id) AS readmissions,
    ROUND(
        COUNT(DISTINCT r.readmission_id) * 100.0
        / COUNT(DISTINCT a.admission_id), 1
    ) AS readmission_rate_pct
FROM admissions a
JOIN medications m ON a.admission_id = m.admission_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL
GROUP BY medication_bucket
ORDER BY medication_bucket;

Python Analysis

1. Feature Engineering

Create a modeling dataset by combining patient, admission, diagnosis, and medication data into predictive features.

import pandas as pd
import numpy as np

# Load all tables
patients = pd.read_csv('patients.csv')
admissions = pd.read_csv('admissions.csv', parse_dates=['admission_date', 'discharge_date'])
diagnoses = pd.read_csv('diagnoses.csv')
medications = pd.read_csv('medications.csv')
readmissions = pd.read_csv('readmissions.csv')

# Calculate length of stay
admissions['length_of_stay'] = (
    admissions['discharge_date'] - admissions['admission_date']
).dt.days

# Count diagnoses per admission
diag_counts = (
    diagnoses.groupby('admission_id')
    .size().reset_index(name='num_diagnoses')
)

# Get max medications per admission
med_counts = (
    medications.groupby('admission_id')['num_medications']
    .max().reset_index()
)

# Create target variable: was this admission followed by a readmission?
readmit_set = set(readmissions['original_admission_id'])
admissions['readmitted'] = admissions['admission_id'].isin(readmit_set).astype(int)

# Build feature matrix
df = (
    admissions
    .merge(patients, on='patient_id')
    .merge(diag_counts, on='admission_id', how='left')
    .merge(med_counts, on='admission_id', how='left')
)

# Fill missing values
df['num_diagnoses'] = df['num_diagnoses'].fillna(1)
df['num_medications'] = df['num_medications'].fillna(0)

# Select features
feature_cols = [
    'age', 'num_prior_admissions', 'lives_alone',
    'length_of_stay', 'num_diagnoses', 'num_medications'
]

# One-hot encode categorical features
df_model = pd.get_dummies(df[feature_cols + ['admission_type',
    'discharge_disposition', 'department', 'readmitted']],
    columns=['admission_type', 'discharge_disposition', 'department'],
    drop_first=True
)

print(f"Dataset shape: {df_model.shape}")
print(f"Readmission rate: {df_model['readmitted'].mean():.1%}")
print(f"\nFeature columns: {list(df_model.columns[:10])} ...")

2. Logistic Regression and ROC Curve

Train a logistic regression model and evaluate its discriminative ability using AUC-ROC.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Separate features and target
X = df_model.drop('readmitted', axis=1)
y = df_model['readmitted']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train logistic regression
model = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
model.fit(X_train_scaled, y_train)

# Predict probabilities
y_prob = model.predict_proba(X_test_scaled)[:, 1]

# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc_score = roc_auc_score(y_test, y_prob)

fig, ax = plt.subplots(figsize=(8, 8))
ax.plot(fpr, tpr, color='darkorange', linewidth=2,
        label=f'Logistic Regression (AUC = {auc_score:.3f})')
ax.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1,
        label='Random Classifier (AUC = 0.500)')
ax.fill_between(fpr, tpr, alpha=0.1, color='darkorange')

ax.set_xlabel('False Positive Rate', fontsize=13)
ax.set_ylabel('True Positive Rate (Sensitivity)', fontsize=13)
ax.set_title('ROC Curve: 30-Day Readmission Prediction',
             fontsize=15, fontweight='bold')
ax.legend(fontsize=12, loc='lower right')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()

print(f"\nAUC-ROC Score: {auc_score:.3f}")

3. Feature Importance

Examine which features are the strongest predictors of readmission based on model coefficients.

# Extract coefficients as odds ratios
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'coefficient': model.coef_[0],
    'odds_ratio': np.exp(model.coef_[0])
}).sort_values('coefficient', key=abs, ascending=False)

# Plot top 15 features
top_features = feature_importance.head(15)

fig, ax = plt.subplots(figsize=(10, 8))
colors = ['#e74c3c' if c > 0 else '#2ecc71' for c in top_features['coefficient']]
bars = ax.barh(range(len(top_features)), top_features['coefficient'],
               color=colors, edgecolor='none')

ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'], fontsize=11)
ax.set_xlabel('Coefficient (Log-Odds)', fontsize=12)
ax.set_title('Top 15 Readmission Risk Factors',
             fontsize=15, fontweight='bold')
ax.axvline(x=0, color='black', linewidth=0.5)

# Add odds ratio labels
for i, (coef, odds) in enumerate(
    zip(top_features['coefficient'], top_features['odds_ratio'])):
    ax.text(coef + (0.02 if coef >= 0 else -0.02), i,
            f'OR: {odds:.2f}', va='center', fontsize=9,
            ha='left' if coef >= 0 else 'right')

ax.invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()

print("\nTop Risk Factors (Odds Ratios > 1 increase risk):")
print(feature_importance[['feature', 'odds_ratio']].head(10).to_string(index=False))

4. Optimal Threshold Selection

Find the probability cutoff that balances sensitivity and specificity for clinical use.

from sklearn.metrics import precision_recall_curve, f1_score

# Calculate precision-recall at different thresholds
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)

# Calculate F1 score at each threshold
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-8)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = pr_thresholds[optimal_idx]

# Also find Youden's J statistic on ROC curve
j_scores = tpr - fpr
youden_idx = np.argmax(j_scores)
youden_threshold = thresholds[youden_idx]

print(f"Optimal threshold (max F1):       {optimal_threshold:.3f}")
print(f"Optimal threshold (Youden's J):   {youden_threshold:.3f}")

# Apply recommended threshold
y_pred_custom = (y_prob >= optimal_threshold).astype(int)
print(f"\nClassification Report (threshold = {optimal_threshold:.3f}):")
print(classification_report(y_test, y_pred_custom,
                            target_names=['No Readmit', 'Readmit']))

Key Insights

Top Risk Factors

Number of prior admissions (OR: 1.35 per admission), emergency admission type (OR: 2.1), and living alone (OR: 1.8) are the strongest predictors of readmission.

Polypharmacy

Patients discharged with 13+ medications have a 28% readmission rate vs. 11% for those with 0-3 medications. Medication reconciliation at discharge is critical.

Heart Failure Burden

Heart failure has the highest readmission rate at 24.7%, followed by COPD at 21.3%. Together they account for over 40% of all readmissions.

Model Performance

The logistic regression achieves an AUC of 0.74, correctly identifying 68% of eventual readmissions while flagging only 30% of non-readmissions (false positives).

Discharge Destination

Patients discharged Against Medical Advice (AMA) have a 35% readmission rate. Those discharged to SNF/Rehab actually have lower rates (12%) due to supervised recovery.

Intervention Opportunity

Using the model to flag the top 25% highest-risk patients would capture 58% of all readmissions, enabling focused follow-up calls and home visits.

Quiz

Question 1

What does an odds ratio of 2.1 for emergency admissions mean?

Question 2

Why is class_weight='balanced' used in the logistic regression model?

Question 3

What does AUC-ROC measure in this context?

Question 4

In the SQL risk factor analysis, what does the UNION ALL operator do?

Question 5

Why is StandardScaler applied before training the logistic regression?

← Previous Project Next Project →