Patient Readmission Predictor
Build a predictive model to identify patients at high risk of 30-day hospital readmission and enable targeted intervention programs.
Project Overview
Scenario
You are a data analyst at Valley Health System, a network of three hospitals. The quality improvement team has identified that the 30-day readmission rate is 18.4%, well above the national benchmark of 15.5%. Under the CMS Hospital Readmissions Reduction Program (HRRP), the system faces financial penalties of up to $1.8 million annually for excessive readmissions in heart failure, pneumonia, and hip/knee replacement.
Your task is to analyze historical readmission data, identify the strongest risk factors, and build a logistic regression model that can flag high-risk patients before discharge so that care coordinators can provide targeted follow-up.
Objectives
- Calculate readmission rates by diagnosis, department, and patient demographics
- Identify the top clinical and social risk factors for readmission
- Engineer predictive features from admission, diagnosis, and medication data
- Train and evaluate a logistic regression classifier with ROC analysis
- Determine the optimal probability threshold for clinical intervention
Dataset Description
The electronic health record (EHR) system provides five tables covering patient history, admissions, diagnoses, medications, and readmission outcomes.
patients
| Column | Type | Description |
|---|---|---|
| patient_id | INT (PK) | Unique patient identifier |
| age | INT | Patient age at most recent admission |
| gender | VARCHAR(10) | M or F |
| insurance_type | VARCHAR(30) | Private, Medicare, Medicaid, Self-pay |
| num_prior_admissions | INT | Lifetime hospital admissions before current |
| lives_alone | BOOLEAN | Whether patient lives alone |
admissions
| Column | Type | Description |
|---|---|---|
| admission_id | INT (PK) | Unique admission identifier |
| patient_id | INT (FK) | References patients table |
| admission_date | DATETIME | Date and time of admission |
| discharge_date | DATETIME | Date and time of discharge |
| department | VARCHAR(50) | Admitting department |
| admission_type | VARCHAR(20) | Emergency, Elective, Urgent |
| discharge_disposition | VARCHAR(30) | Home, SNF, Rehab, AMA, Expired |
diagnoses
| Column | Type | Description |
|---|---|---|
| diagnosis_id | INT (PK) | Unique diagnosis record |
| admission_id | INT (FK) | References admissions table |
| icd_code | VARCHAR(10) | ICD-10 diagnosis code |
| diagnosis_name | VARCHAR(100) | Human-readable diagnosis |
| is_primary | BOOLEAN | Whether this is the primary diagnosis |
medications
| Column | Type | Description |
|---|---|---|
| med_id | INT (PK) | Unique medication record |
| admission_id | INT (FK) | References admissions table |
| medication_name | VARCHAR(100) | Name of prescribed medication |
| medication_class | VARCHAR(50) | Drug class (e.g., Diuretic, Beta-blocker) |
| num_medications | INT | Total medications at discharge |
readmissions
| Column | Type | Description |
|---|---|---|
| readmission_id | INT (PK) | Unique readmission record |
| original_admission_id | INT (FK) | The initial admission that preceded readmission |
| readmission_admission_id | INT (FK) | The readmission event |
| days_to_readmission | INT | Days between discharge and readmission |
| was_preventable | BOOLEAN | Clinical review: was readmission preventable? |
SQL Analysis
1. Readmission Rate by Primary Diagnosis
Identify which conditions have the highest readmission rates to focus intervention programs.
SELECT
dg.diagnosis_name,
COUNT(DISTINCT a.admission_id) AS total_discharges,
COUNT(DISTINCT r.readmission_id) AS readmissions,
ROUND(
COUNT(DISTINCT r.readmission_id) * 100.0
/ COUNT(DISTINCT a.admission_id), 1
) AS readmission_rate_pct
FROM admissions a
JOIN diagnoses dg
ON a.admission_id = dg.admission_id AND dg.is_primary = TRUE
LEFT JOIN readmissions r
ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL
AND a.discharge_disposition != 'Expired'
GROUP BY dg.diagnosis_name
HAVING COUNT(DISTINCT a.admission_id) >= 20
ORDER BY readmission_rate_pct DESC
LIMIT 15;
2. Risk Factor Analysis
Compare key patient characteristics between readmitted and non-readmitted groups to identify risk factors.
SELECT
'Age 65+' AS risk_factor,
ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND p.age >= 65 THEN 1.0 ELSE 0 END) * 100, 1) AS readmit_pct,
ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND p.age >= 65 THEN 1.0 ELSE 0 END) * 100, 1) AS no_readmit_pct
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL
UNION ALL
SELECT
'Lives Alone',
ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND p.lives_alone = TRUE THEN 1.0 ELSE 0 END) * 100, 1),
ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND p.lives_alone = TRUE THEN 1.0 ELSE 0 END) * 100, 1)
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL
UNION ALL
SELECT
'Emergency Admission',
ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND a.admission_type = 'Emergency' THEN 1.0 ELSE 0 END) * 100, 1),
ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND a.admission_type = 'Emergency' THEN 1.0 ELSE 0 END) * 100, 1)
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL
UNION ALL
SELECT
'5+ Prior Admissions',
ROUND(AVG(CASE WHEN r.readmission_id IS NOT NULL AND p.num_prior_admissions >= 5 THEN 1.0 ELSE 0 END) * 100, 1),
ROUND(AVG(CASE WHEN r.readmission_id IS NULL AND p.num_prior_admissions >= 5 THEN 1.0 ELSE 0 END) * 100, 1)
FROM admissions a
JOIN patients p ON a.patient_id = p.patient_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL;
3. Frequent Readmitters
Identify patients with multiple readmissions who may benefit from intensive case management.
SELECT
p.patient_id,
p.age,
p.insurance_type,
p.num_prior_admissions,
COUNT(r.readmission_id) AS total_readmissions,
GROUP_CONCAT(DISTINCT dg.diagnosis_name ORDER BY dg.diagnosis_name SEPARATOR ', ') AS diagnoses,
ROUND(AVG(r.days_to_readmission), 1) AS avg_days_to_readmit
FROM patients p
JOIN admissions a ON p.patient_id = a.patient_id
JOIN readmissions r ON a.admission_id = r.original_admission_id
JOIN diagnoses dg ON a.admission_id = dg.admission_id AND dg.is_primary = TRUE
GROUP BY p.patient_id, p.age, p.insurance_type, p.num_prior_admissions
HAVING COUNT(r.readmission_id) >= 3
ORDER BY total_readmissions DESC;
4. Medication Burden and Readmission
Examine whether patients on more medications at discharge have higher readmission rates (polypharmacy risk).
SELECT
CASE
WHEN m.num_medications BETWEEN 0 AND 3 THEN '0-3 meds'
WHEN m.num_medications BETWEEN 4 AND 7 THEN '4-7 meds'
WHEN m.num_medications BETWEEN 8 AND 12 THEN '8-12 meds'
ELSE '13+ meds'
END AS medication_bucket,
COUNT(DISTINCT a.admission_id) AS total_discharges,
COUNT(DISTINCT r.readmission_id) AS readmissions,
ROUND(
COUNT(DISTINCT r.readmission_id) * 100.0
/ COUNT(DISTINCT a.admission_id), 1
) AS readmission_rate_pct
FROM admissions a
JOIN medications m ON a.admission_id = m.admission_id
LEFT JOIN readmissions r ON a.admission_id = r.original_admission_id
WHERE a.discharge_date IS NOT NULL
GROUP BY medication_bucket
ORDER BY medication_bucket;
Python Analysis
1. Feature Engineering
Create a modeling dataset by combining patient, admission, diagnosis, and medication data into predictive features.
import pandas as pd
import numpy as np
# Load all tables
patients = pd.read_csv('patients.csv')
admissions = pd.read_csv('admissions.csv', parse_dates=['admission_date', 'discharge_date'])
diagnoses = pd.read_csv('diagnoses.csv')
medications = pd.read_csv('medications.csv')
readmissions = pd.read_csv('readmissions.csv')
# Calculate length of stay
admissions['length_of_stay'] = (
admissions['discharge_date'] - admissions['admission_date']
).dt.days
# Count diagnoses per admission
diag_counts = (
diagnoses.groupby('admission_id')
.size().reset_index(name='num_diagnoses')
)
# Get max medications per admission
med_counts = (
medications.groupby('admission_id')['num_medications']
.max().reset_index()
)
# Create target variable: was this admission followed by a readmission?
readmit_set = set(readmissions['original_admission_id'])
admissions['readmitted'] = admissions['admission_id'].isin(readmit_set).astype(int)
# Build feature matrix
df = (
admissions
.merge(patients, on='patient_id')
.merge(diag_counts, on='admission_id', how='left')
.merge(med_counts, on='admission_id', how='left')
)
# Fill missing values
df['num_diagnoses'] = df['num_diagnoses'].fillna(1)
df['num_medications'] = df['num_medications'].fillna(0)
# Select features
feature_cols = [
'age', 'num_prior_admissions', 'lives_alone',
'length_of_stay', 'num_diagnoses', 'num_medications'
]
# One-hot encode categorical features
df_model = pd.get_dummies(df[feature_cols + ['admission_type',
'discharge_disposition', 'department', 'readmitted']],
columns=['admission_type', 'discharge_disposition', 'department'],
drop_first=True
)
print(f"Dataset shape: {df_model.shape}")
print(f"Readmission rate: {df_model['readmitted'].mean():.1%}")
print(f"\nFeature columns: {list(df_model.columns[:10])} ...")
2. Logistic Regression and ROC Curve
Train a logistic regression model and evaluate its discriminative ability using AUC-ROC.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, classification_report
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Separate features and target
X = df_model.drop('readmitted', axis=1)
y = df_model['readmitted']
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train logistic regression
model = LogisticRegression(max_iter=1000, class_weight='balanced', random_state=42)
model.fit(X_train_scaled, y_train)
# Predict probabilities
y_prob = model.predict_proba(X_test_scaled)[:, 1]
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
auc_score = roc_auc_score(y_test, y_prob)
fig, ax = plt.subplots(figsize=(8, 8))
ax.plot(fpr, tpr, color='darkorange', linewidth=2,
label=f'Logistic Regression (AUC = {auc_score:.3f})')
ax.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1,
label='Random Classifier (AUC = 0.500)')
ax.fill_between(fpr, tpr, alpha=0.1, color='darkorange')
ax.set_xlabel('False Positive Rate', fontsize=13)
ax.set_ylabel('True Positive Rate (Sensitivity)', fontsize=13)
ax.set_title('ROC Curve: 30-Day Readmission Prediction',
fontsize=15, fontweight='bold')
ax.legend(fontsize=12, loc='lower right')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('roc_curve.png', dpi=150)
plt.show()
print(f"\nAUC-ROC Score: {auc_score:.3f}")
3. Feature Importance
Examine which features are the strongest predictors of readmission based on model coefficients.
# Extract coefficients as odds ratios
feature_importance = pd.DataFrame({
'feature': X.columns,
'coefficient': model.coef_[0],
'odds_ratio': np.exp(model.coef_[0])
}).sort_values('coefficient', key=abs, ascending=False)
# Plot top 15 features
top_features = feature_importance.head(15)
fig, ax = plt.subplots(figsize=(10, 8))
colors = ['#e74c3c' if c > 0 else '#2ecc71' for c in top_features['coefficient']]
bars = ax.barh(range(len(top_features)), top_features['coefficient'],
color=colors, edgecolor='none')
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'], fontsize=11)
ax.set_xlabel('Coefficient (Log-Odds)', fontsize=12)
ax.set_title('Top 15 Readmission Risk Factors',
fontsize=15, fontweight='bold')
ax.axvline(x=0, color='black', linewidth=0.5)
# Add odds ratio labels
for i, (coef, odds) in enumerate(
zip(top_features['coefficient'], top_features['odds_ratio'])):
ax.text(coef + (0.02 if coef >= 0 else -0.02), i,
f'OR: {odds:.2f}', va='center', fontsize=9,
ha='left' if coef >= 0 else 'right')
ax.invert_yaxis()
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()
print("\nTop Risk Factors (Odds Ratios > 1 increase risk):")
print(feature_importance[['feature', 'odds_ratio']].head(10).to_string(index=False))
4. Optimal Threshold Selection
Find the probability cutoff that balances sensitivity and specificity for clinical use.
from sklearn.metrics import precision_recall_curve, f1_score
# Calculate precision-recall at different thresholds
precision, recall, pr_thresholds = precision_recall_curve(y_test, y_prob)
# Calculate F1 score at each threshold
f1_scores = 2 * (precision[:-1] * recall[:-1]) / (precision[:-1] + recall[:-1] + 1e-8)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = pr_thresholds[optimal_idx]
# Also find Youden's J statistic on ROC curve
j_scores = tpr - fpr
youden_idx = np.argmax(j_scores)
youden_threshold = thresholds[youden_idx]
print(f"Optimal threshold (max F1): {optimal_threshold:.3f}")
print(f"Optimal threshold (Youden's J): {youden_threshold:.3f}")
# Apply recommended threshold
y_pred_custom = (y_prob >= optimal_threshold).astype(int)
print(f"\nClassification Report (threshold = {optimal_threshold:.3f}):")
print(classification_report(y_test, y_pred_custom,
target_names=['No Readmit', 'Readmit']))
Key Insights
Top Risk Factors
Number of prior admissions (OR: 1.35 per admission), emergency admission type (OR: 2.1), and living alone (OR: 1.8) are the strongest predictors of readmission.
Polypharmacy
Patients discharged with 13+ medications have a 28% readmission rate vs. 11% for those with 0-3 medications. Medication reconciliation at discharge is critical.
Heart Failure Burden
Heart failure has the highest readmission rate at 24.7%, followed by COPD at 21.3%. Together they account for over 40% of all readmissions.
Model Performance
The logistic regression achieves an AUC of 0.74, correctly identifying 68% of eventual readmissions while flagging only 30% of non-readmissions (false positives).
Discharge Destination
Patients discharged Against Medical Advice (AMA) have a 35% readmission rate. Those discharged to SNF/Rehab actually have lower rates (12%) due to supervised recovery.
Intervention Opportunity
Using the model to flag the top 25% highest-risk patients would capture 58% of all readmissions, enabling focused follow-up calls and home visits.
Quiz
What does an odds ratio of 2.1 for emergency admissions mean?
Why is class_weight='balanced' used in the logistic regression model?
What does AUC-ROC measure in this context?
In the SQL risk factor analysis, what does the UNION ALL operator do?
Why is StandardScaler applied before training the logistic regression?