Healthcare Cost Analysis
Analyze hospital cost drivers, compare department spending, and build predictive models to help a hospital CFO reduce expenses.
Project Overview
Scenario
You are a data analyst reporting to the Chief Financial Officer (CFO) of Riverside Medical Center, a 350-bed hospital with an annual operating budget of $280 million. Over the past two years, costs have risen 12% while reimbursement rates have remained flat. The CFO needs a clear picture of where money is being spent, which cost categories are growing fastest, and where savings can be found without compromising patient outcomes.
Your analysis will cover procedure-level costs, department budgets, insurance reimbursement patterns, and predictive modeling for patient-level cost estimation.
Objectives
- Calculate average cost per patient by department and procedure type
- Compare insurance reimbursement vs. out-of-pocket costs
- Identify the highest-cost procedures and their volume trends
- Detect cost outliers that may indicate billing errors or inefficiencies
- Build a regression model to predict total patient cost based on clinical factors
Dataset Description
The hospital financial system captures costs across five interconnected tables covering patient demographics, procedures, billing, insurance, and department budgets.
patients
| Column | Type | Description |
|---|---|---|
| patient_id | INT (PK) | Unique patient identifier |
| age | INT | Patient age |
| gender | VARCHAR(10) | M or F |
| insurance_type | VARCHAR(30) | Private, Medicare, Medicaid, Self-pay |
| zip_code | VARCHAR(10) | Patient residential zip code |
| num_comorbidities | INT | Number of chronic conditions |
procedures
| Column | Type | Description |
|---|---|---|
| procedure_id | INT (PK) | Unique procedure identifier |
| patient_id | INT (FK) | References patients table |
| department_id | INT (FK) | Department that performed procedure |
| procedure_name | VARCHAR(100) | Name of the procedure |
| procedure_date | DATE | Date the procedure was performed |
| procedure_cost | DECIMAL(10,2) | Total cost of the procedure |
billing
| Column | Type | Description |
|---|---|---|
| bill_id | INT (PK) | Unique billing record |
| patient_id | INT (FK) | References patients table |
| total_charge | DECIMAL(10,2) | Total amount billed |
| insurance_paid | DECIMAL(10,2) | Amount covered by insurance |
| patient_paid | DECIMAL(10,2) | Out-of-pocket patient payment |
| billing_date | DATE | Date the bill was generated |
insurance_claims
| Column | Type | Description |
|---|---|---|
| claim_id | INT (PK) | Unique claim identifier |
| bill_id | INT (FK) | References billing table |
| insurance_provider | VARCHAR(50) | Name of insurance company |
| claim_amount | DECIMAL(10,2) | Amount claimed from insurer |
| approved_amount | DECIMAL(10,2) | Amount approved by insurer |
| claim_status | VARCHAR(15) | Approved, Denied, Pending, Partial |
| submission_date | DATE | Date claim was submitted |
departments
| Column | Type | Description |
|---|---|---|
| department_id | INT (PK) | Unique department identifier |
| department_name | VARCHAR(50) | Department name |
| annual_budget | DECIMAL(12,2) | Allocated annual budget |
| staff_count | INT | Number of staff in department |
SQL Analysis
1. Average Cost per Patient by Department
Identify which departments generate the highest per-patient costs.
SELECT
d.department_name,
COUNT(DISTINCT p.patient_id) AS unique_patients,
COUNT(pr.procedure_id) AS total_procedures,
ROUND(SUM(pr.procedure_cost), 2) AS total_cost,
ROUND(SUM(pr.procedure_cost) / COUNT(DISTINCT p.patient_id), 2) AS avg_cost_per_patient,
ROUND(AVG(pr.procedure_cost), 2) AS avg_cost_per_procedure
FROM procedures pr
JOIN patients p ON pr.patient_id = p.patient_id
JOIN departments d ON pr.department_id = d.department_id
GROUP BY d.department_name
ORDER BY avg_cost_per_patient DESC;
2. Insurance vs. Out-of-Pocket Breakdown
Compare how much insurance covers versus what patients pay out of pocket, segmented by insurance type.
SELECT
p.insurance_type,
COUNT(b.bill_id) AS total_bills,
ROUND(AVG(b.total_charge), 2) AS avg_total_charge,
ROUND(AVG(b.insurance_paid), 2) AS avg_insurance_paid,
ROUND(AVG(b.patient_paid), 2) AS avg_patient_paid,
ROUND(
AVG(b.insurance_paid) * 100.0 / AVG(b.total_charge), 1
) AS insurance_coverage_pct
FROM billing b
JOIN patients p ON b.patient_id = p.patient_id
GROUP BY p.insurance_type
ORDER BY avg_total_charge DESC;
3. Top 10 Most Expensive Procedures
Rank procedures by total cost to the hospital to find the biggest spending categories.
SELECT
procedure_name,
COUNT(*) AS times_performed,
ROUND(AVG(procedure_cost), 2) AS avg_cost,
ROUND(MIN(procedure_cost), 2) AS min_cost,
ROUND(MAX(procedure_cost), 2) AS max_cost,
ROUND(SUM(procedure_cost), 2) AS total_cost
FROM procedures
GROUP BY procedure_name
ORDER BY total_cost DESC
LIMIT 10;
4. Monthly Cost Trends
Track how total costs change month over month to detect seasonal patterns and growth rates.
SELECT
DATE_FORMAT(procedure_date, '%Y-%m') AS month,
COUNT(*) AS procedure_count,
ROUND(SUM(procedure_cost), 2) AS total_cost,
ROUND(SUM(procedure_cost) / COUNT(*), 2) AS avg_cost_per_procedure,
ROUND(
(SUM(procedure_cost) - LAG(SUM(procedure_cost)) OVER (ORDER BY DATE_FORMAT(procedure_date, '%Y-%m')))
* 100.0 / LAG(SUM(procedure_cost)) OVER (ORDER BY DATE_FORMAT(procedure_date, '%Y-%m')),
1
) AS month_over_month_pct_change
FROM procedures
GROUP BY DATE_FORMAT(procedure_date, '%Y-%m')
ORDER BY month;
Python Analysis
1. Cost Distribution Analysis
Visualize how procedure costs are distributed to understand the spread and identify skewness.
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# Load data
procedures = pd.read_csv('procedures.csv', parse_dates=['procedure_date'])
patients = pd.read_csv('patients.csv')
departments = pd.read_csv('departments.csv')
# Merge for analysis
df = procedures.merge(patients, on='patient_id').merge(departments, on='department_id')
# Cost distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Histogram
axes[0].hist(df['procedure_cost'], bins=50, color='steelblue',
edgecolor='white', alpha=0.8)
axes[0].axvline(df['procedure_cost'].median(), color='red',
linestyle='--', linewidth=2, label=f"Median: ${df['procedure_cost'].median():,.0f}")
axes[0].axvline(df['procedure_cost'].mean(), color='orange',
linestyle='--', linewidth=2, label=f"Mean: ${df['procedure_cost'].mean():,.0f}")
axes[0].set_title('Procedure Cost Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Cost ($)')
axes[0].set_ylabel('Frequency')
axes[0].legend()
# Box plot by department
dept_costs = df.groupby('department_name')['procedure_cost'].apply(list)
axes[1].boxplot(dept_costs.values, labels=dept_costs.index,
vert=True, patch_artist=True,
boxprops=dict(facecolor='lightblue'))
axes[1].set_title('Cost Distribution by Department', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Cost ($)')
axes[1].tick_params(axis='x', rotation=45)
plt.tight_layout()
plt.savefig('cost_distribution.png', dpi=150)
plt.show()
2. Linear Regression for Cost Prediction
Build a model to predict total patient cost based on age, number of comorbidities, and department.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np
# Aggregate cost per patient
patient_costs = (
df.groupby('patient_id')
.agg(
total_cost=('procedure_cost', 'sum'),
num_procedures=('procedure_id', 'count'),
age=('age', 'first'),
num_comorbidities=('num_comorbidities', 'first'),
insurance_type=('insurance_type', 'first'),
primary_dept=('department_name', lambda x: x.mode()[0])
)
.reset_index()
)
# Features
features = ['age', 'num_comorbidities', 'num_procedures']
X_numeric = patient_costs[features]
# One-hot encode categorical features
encoder = OneHotEncoder(sparse_output=False, drop='first')
X_cat = encoder.fit_transform(patient_costs[['insurance_type', 'primary_dept']])
cat_cols = encoder.get_feature_names_out(['insurance_type', 'primary_dept'])
X = np.hstack([X_numeric.values, X_cat])
y = patient_costs['total_cost'].values
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train model
model = LinearRegression()
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.3f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred):,.2f}")
# Feature importance
all_features = list(features) + list(cat_cols)
coef_df = pd.DataFrame({
'feature': all_features,
'coefficient': model.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print("\nTop 10 Cost Predictors:")
print(coef_df.head(10).to_string(index=False))
3. Outlier Detection with Z-Scores
Flag unusually high-cost procedures that may indicate billing errors or rare complex cases.
from scipy import stats
# Calculate z-scores within each procedure type
df['cost_zscore'] = df.groupby('procedure_name')['procedure_cost'].transform(
lambda x: stats.zscore(x, nan_policy='omit')
)
# Flag outliers (z-score > 3)
outliers = df[df['cost_zscore'].abs() > 3].sort_values('cost_zscore', ascending=False)
print(f"Total outliers detected: {len(outliers)} out of {len(df)} procedures")
print(f"Outlier rate: {len(outliers)/len(df)*100:.1f}%\n")
# Summary of outliers
outlier_summary = (
outliers.groupby('procedure_name')
.agg(
count=('procedure_id', 'count'),
avg_cost=('procedure_cost', 'mean'),
max_zscore=('cost_zscore', 'max')
)
.sort_values('count', ascending=False)
)
print("Procedures with Most Outliers:")
print(outlier_summary.head(10).to_string())
# Scatter plot of outliers
fig, ax = plt.subplots(figsize=(12, 6))
normal = df[df['cost_zscore'].abs() <= 3]
ax.scatter(normal.index, normal['procedure_cost'], alpha=0.2,
s=5, color='steelblue', label='Normal')
ax.scatter(outliers.index, outliers['procedure_cost'], alpha=0.7,
s=20, color='red', label='Outlier (|z| > 3)')
ax.set_title('Procedure Cost Outlier Detection', fontsize=14, fontweight='bold')
ax.set_ylabel('Cost ($)')
ax.set_xlabel('Procedure Index')
ax.legend()
plt.tight_layout()
plt.savefig('cost_outliers.png', dpi=150)
plt.show()
Excel Dashboard
Build a financial dashboard the CFO can use for monthly cost reviews.
Step-by-Step Instructions
- Import Data: Go to Data > Get Data > From Text/CSV. Import procedures.csv, billing.csv, departments.csv, and patients.csv into separate sheets.
- Create a Cost Summary PivotTable: Insert a PivotTable from procedures data. Place department_name in Rows, procedure_date (grouped by Month) in Columns, and SUM of procedure_cost in Values. Format values as Currency.
- Budget vs. Actual Chart: In a new sheet, create a table with columns: Department, Annual Budget (from departments table), Actual Spend (summed from procedures). Add a Clustered Bar Chart comparing the two side by side. Apply conditional formatting: red if Actual > Budget.
- Insurance Coverage Analysis: Create a PivotTable from billing data grouped by insurance_type. Show AVG of total_charge, insurance_paid, and patient_paid. Insert a Stacked Bar Chart showing the insurance vs. patient split.
- Top Procedures Table: Use
=SUMIFS()to aggregate costs by procedure_name. Sort descending. Add Data Bars conditional formatting to visualize relative costs. - Monthly Trend Line: Create a PivotChart (Line) showing total monthly cost with a linear trendline. Right-click the trendline, check "Display equation on chart" to show the growth rate.
- KPI Summary Cards: At the top of the dashboard, create formatted cells for:
- Total YTD Spend:
=SUM(procedures[procedure_cost]) - Avg Cost per Patient:
=SUM(procedures[procedure_cost])/COUNTA(UNIQUE(procedures[patient_id])) - Claim Denial Rate:
=COUNTIFS(claims[claim_status],"Denied")/COUNTA(claims[claim_id])
- Total YTD Spend:
- Add Slicers: Insert slicers for Department, Insurance Type, and Date Range. Connect all slicers to every PivotTable for interactive filtering.
Key Insights
Cost Concentration
The top 5 procedures account for 38% of total hospital costs. Cardiac catheterization and joint replacements are the biggest drivers at $15,000-$45,000 per procedure.
Insurance Gaps
Medicaid reimburses only 62% of billed charges on average, compared to 84% for private insurance. Self-pay patients have the highest unpaid balances.
Claim Denials
12% of insurance claims are denied on first submission. Coding errors account for 45% of denials. A coding audit could recover an estimated $2.1M annually.
Comorbidity Impact
Each additional comorbidity increases average patient cost by $3,200. Patients with 3+ comorbidities cost 2.8x more than those with none.
Budget Overruns
Orthopedics and Cardiology are 15% and 11% over budget respectively. Radiology is 8% under budget due to new equipment reducing per-scan costs.
Billing Outliers
2.3% of procedures are cost outliers (z-score > 3). Investigation reveals half are legitimate complex cases; the other half contain coding or supply charge errors.
Quiz
Which SQL function is used to calculate month-over-month percentage change in the cost trends query?
In the outlier detection code, what z-score threshold is used to flag unusual procedure costs?
What does the R-squared metric tell us about the cost prediction model?
Why is OneHotEncoder used before training the linear regression model?
What does a high claim denial rate most commonly indicate?