Project 8

Healthcare Cost Analysis

Analyze hospital cost drivers, compare department spending, and build predictive models to help a hospital CFO reduce expenses.

Project Overview

Scenario

You are a data analyst reporting to the Chief Financial Officer (CFO) of Riverside Medical Center, a 350-bed hospital with an annual operating budget of $280 million. Over the past two years, costs have risen 12% while reimbursement rates have remained flat. The CFO needs a clear picture of where money is being spent, which cost categories are growing fastest, and where savings can be found without compromising patient outcomes.

Your analysis will cover procedure-level costs, department budgets, insurance reimbursement patterns, and predictive modeling for patient-level cost estimation.

Objectives

  • Calculate average cost per patient by department and procedure type
  • Compare insurance reimbursement vs. out-of-pocket costs
  • Identify the highest-cost procedures and their volume trends
  • Detect cost outliers that may indicate billing errors or inefficiencies
  • Build a regression model to predict total patient cost based on clinical factors

Dataset Description

The hospital financial system captures costs across five interconnected tables covering patient demographics, procedures, billing, insurance, and department budgets.

patients

Column Type Description
patient_idINT (PK)Unique patient identifier
ageINTPatient age
genderVARCHAR(10)M or F
insurance_typeVARCHAR(30)Private, Medicare, Medicaid, Self-pay
zip_codeVARCHAR(10)Patient residential zip code
num_comorbiditiesINTNumber of chronic conditions

procedures

Column Type Description
procedure_idINT (PK)Unique procedure identifier
patient_idINT (FK)References patients table
department_idINT (FK)Department that performed procedure
procedure_nameVARCHAR(100)Name of the procedure
procedure_dateDATEDate the procedure was performed
procedure_costDECIMAL(10,2)Total cost of the procedure

billing

Column Type Description
bill_idINT (PK)Unique billing record
patient_idINT (FK)References patients table
total_chargeDECIMAL(10,2)Total amount billed
insurance_paidDECIMAL(10,2)Amount covered by insurance
patient_paidDECIMAL(10,2)Out-of-pocket patient payment
billing_dateDATEDate the bill was generated

insurance_claims

Column Type Description
claim_idINT (PK)Unique claim identifier
bill_idINT (FK)References billing table
insurance_providerVARCHAR(50)Name of insurance company
claim_amountDECIMAL(10,2)Amount claimed from insurer
approved_amountDECIMAL(10,2)Amount approved by insurer
claim_statusVARCHAR(15)Approved, Denied, Pending, Partial
submission_dateDATEDate claim was submitted

departments

Column Type Description
department_idINT (PK)Unique department identifier
department_nameVARCHAR(50)Department name
annual_budgetDECIMAL(12,2)Allocated annual budget
staff_countINTNumber of staff in department

SQL Analysis

1. Average Cost per Patient by Department

Identify which departments generate the highest per-patient costs.

SELECT
    d.department_name,
    COUNT(DISTINCT p.patient_id) AS unique_patients,
    COUNT(pr.procedure_id) AS total_procedures,
    ROUND(SUM(pr.procedure_cost), 2) AS total_cost,
    ROUND(SUM(pr.procedure_cost) / COUNT(DISTINCT p.patient_id), 2) AS avg_cost_per_patient,
    ROUND(AVG(pr.procedure_cost), 2) AS avg_cost_per_procedure
FROM procedures pr
JOIN patients p ON pr.patient_id = p.patient_id
JOIN departments d ON pr.department_id = d.department_id
GROUP BY d.department_name
ORDER BY avg_cost_per_patient DESC;

2. Insurance vs. Out-of-Pocket Breakdown

Compare how much insurance covers versus what patients pay out of pocket, segmented by insurance type.

SELECT
    p.insurance_type,
    COUNT(b.bill_id) AS total_bills,
    ROUND(AVG(b.total_charge), 2) AS avg_total_charge,
    ROUND(AVG(b.insurance_paid), 2) AS avg_insurance_paid,
    ROUND(AVG(b.patient_paid), 2) AS avg_patient_paid,
    ROUND(
        AVG(b.insurance_paid) * 100.0 / AVG(b.total_charge), 1
    ) AS insurance_coverage_pct
FROM billing b
JOIN patients p ON b.patient_id = p.patient_id
GROUP BY p.insurance_type
ORDER BY avg_total_charge DESC;

3. Top 10 Most Expensive Procedures

Rank procedures by total cost to the hospital to find the biggest spending categories.

SELECT
    procedure_name,
    COUNT(*) AS times_performed,
    ROUND(AVG(procedure_cost), 2) AS avg_cost,
    ROUND(MIN(procedure_cost), 2) AS min_cost,
    ROUND(MAX(procedure_cost), 2) AS max_cost,
    ROUND(SUM(procedure_cost), 2) AS total_cost
FROM procedures
GROUP BY procedure_name
ORDER BY total_cost DESC
LIMIT 10;

4. Monthly Cost Trends

Track how total costs change month over month to detect seasonal patterns and growth rates.

SELECT
    DATE_FORMAT(procedure_date, '%Y-%m') AS month,
    COUNT(*) AS procedure_count,
    ROUND(SUM(procedure_cost), 2) AS total_cost,
    ROUND(SUM(procedure_cost) / COUNT(*), 2) AS avg_cost_per_procedure,
    ROUND(
        (SUM(procedure_cost) - LAG(SUM(procedure_cost)) OVER (ORDER BY DATE_FORMAT(procedure_date, '%Y-%m')))
        * 100.0 / LAG(SUM(procedure_cost)) OVER (ORDER BY DATE_FORMAT(procedure_date, '%Y-%m')),
        1
    ) AS month_over_month_pct_change
FROM procedures
GROUP BY DATE_FORMAT(procedure_date, '%Y-%m')
ORDER BY month;

Python Analysis

1. Cost Distribution Analysis

Visualize how procedure costs are distributed to understand the spread and identify skewness.

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

# Load data
procedures = pd.read_csv('procedures.csv', parse_dates=['procedure_date'])
patients = pd.read_csv('patients.csv')
departments = pd.read_csv('departments.csv')

# Merge for analysis
df = procedures.merge(patients, on='patient_id').merge(departments, on='department_id')

# Cost distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df['procedure_cost'], bins=50, color='steelblue',
             edgecolor='white', alpha=0.8)
axes[0].axvline(df['procedure_cost'].median(), color='red',
                linestyle='--', linewidth=2, label=f"Median: ${df['procedure_cost'].median():,.0f}")
axes[0].axvline(df['procedure_cost'].mean(), color='orange',
                linestyle='--', linewidth=2, label=f"Mean: ${df['procedure_cost'].mean():,.0f}")
axes[0].set_title('Procedure Cost Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Cost ($)')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# Box plot by department
dept_costs = df.groupby('department_name')['procedure_cost'].apply(list)
axes[1].boxplot(dept_costs.values, labels=dept_costs.index,
                vert=True, patch_artist=True,
                boxprops=dict(facecolor='lightblue'))
axes[1].set_title('Cost Distribution by Department', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Cost ($)')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.savefig('cost_distribution.png', dpi=150)
plt.show()

2. Linear Regression for Cost Prediction

Build a model to predict total patient cost based on age, number of comorbidities, and department.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
import numpy as np

# Aggregate cost per patient
patient_costs = (
    df.groupby('patient_id')
    .agg(
        total_cost=('procedure_cost', 'sum'),
        num_procedures=('procedure_id', 'count'),
        age=('age', 'first'),
        num_comorbidities=('num_comorbidities', 'first'),
        insurance_type=('insurance_type', 'first'),
        primary_dept=('department_name', lambda x: x.mode()[0])
    )
    .reset_index()
)

# Features
features = ['age', 'num_comorbidities', 'num_procedures']
X_numeric = patient_costs[features]

# One-hot encode categorical features
encoder = OneHotEncoder(sparse_output=False, drop='first')
X_cat = encoder.fit_transform(patient_costs[['insurance_type', 'primary_dept']])
cat_cols = encoder.get_feature_names_out(['insurance_type', 'primary_dept'])

X = np.hstack([X_numeric.values, X_cat])
y = patient_costs['total_cost'].values

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train model
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
print(f"R-squared: {r2_score(y_test, y_pred):.3f}")
print(f"Mean Absolute Error: ${mean_absolute_error(y_test, y_pred):,.2f}")

# Feature importance
all_features = list(features) + list(cat_cols)
coef_df = pd.DataFrame({
    'feature': all_features,
    'coefficient': model.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print("\nTop 10 Cost Predictors:")
print(coef_df.head(10).to_string(index=False))

3. Outlier Detection with Z-Scores

Flag unusually high-cost procedures that may indicate billing errors or rare complex cases.

from scipy import stats

# Calculate z-scores within each procedure type
df['cost_zscore'] = df.groupby('procedure_name')['procedure_cost'].transform(
    lambda x: stats.zscore(x, nan_policy='omit')
)

# Flag outliers (z-score > 3)
outliers = df[df['cost_zscore'].abs() > 3].sort_values('cost_zscore', ascending=False)

print(f"Total outliers detected: {len(outliers)} out of {len(df)} procedures")
print(f"Outlier rate: {len(outliers)/len(df)*100:.1f}%\n")

# Summary of outliers
outlier_summary = (
    outliers.groupby('procedure_name')
    .agg(
        count=('procedure_id', 'count'),
        avg_cost=('procedure_cost', 'mean'),
        max_zscore=('cost_zscore', 'max')
    )
    .sort_values('count', ascending=False)
)
print("Procedures with Most Outliers:")
print(outlier_summary.head(10).to_string())

# Scatter plot of outliers
fig, ax = plt.subplots(figsize=(12, 6))
normal = df[df['cost_zscore'].abs() <= 3]
ax.scatter(normal.index, normal['procedure_cost'], alpha=0.2,
           s=5, color='steelblue', label='Normal')
ax.scatter(outliers.index, outliers['procedure_cost'], alpha=0.7,
           s=20, color='red', label='Outlier (|z| > 3)')
ax.set_title('Procedure Cost Outlier Detection', fontsize=14, fontweight='bold')
ax.set_ylabel('Cost ($)')
ax.set_xlabel('Procedure Index')
ax.legend()
plt.tight_layout()
plt.savefig('cost_outliers.png', dpi=150)
plt.show()

Excel Dashboard

Build a financial dashboard the CFO can use for monthly cost reviews.

Step-by-Step Instructions

  1. Import Data: Go to Data > Get Data > From Text/CSV. Import procedures.csv, billing.csv, departments.csv, and patients.csv into separate sheets.
  2. Create a Cost Summary PivotTable: Insert a PivotTable from procedures data. Place department_name in Rows, procedure_date (grouped by Month) in Columns, and SUM of procedure_cost in Values. Format values as Currency.
  3. Budget vs. Actual Chart: In a new sheet, create a table with columns: Department, Annual Budget (from departments table), Actual Spend (summed from procedures). Add a Clustered Bar Chart comparing the two side by side. Apply conditional formatting: red if Actual > Budget.
  4. Insurance Coverage Analysis: Create a PivotTable from billing data grouped by insurance_type. Show AVG of total_charge, insurance_paid, and patient_paid. Insert a Stacked Bar Chart showing the insurance vs. patient split.
  5. Top Procedures Table: Use =SUMIFS() to aggregate costs by procedure_name. Sort descending. Add Data Bars conditional formatting to visualize relative costs.
  6. Monthly Trend Line: Create a PivotChart (Line) showing total monthly cost with a linear trendline. Right-click the trendline, check "Display equation on chart" to show the growth rate.
  7. KPI Summary Cards: At the top of the dashboard, create formatted cells for:
    • Total YTD Spend: =SUM(procedures[procedure_cost])
    • Avg Cost per Patient: =SUM(procedures[procedure_cost])/COUNTA(UNIQUE(procedures[patient_id]))
    • Claim Denial Rate: =COUNTIFS(claims[claim_status],"Denied")/COUNTA(claims[claim_id])
  8. Add Slicers: Insert slicers for Department, Insurance Type, and Date Range. Connect all slicers to every PivotTable for interactive filtering.

Key Insights

Cost Concentration

The top 5 procedures account for 38% of total hospital costs. Cardiac catheterization and joint replacements are the biggest drivers at $15,000-$45,000 per procedure.

Insurance Gaps

Medicaid reimburses only 62% of billed charges on average, compared to 84% for private insurance. Self-pay patients have the highest unpaid balances.

Claim Denials

12% of insurance claims are denied on first submission. Coding errors account for 45% of denials. A coding audit could recover an estimated $2.1M annually.

Comorbidity Impact

Each additional comorbidity increases average patient cost by $3,200. Patients with 3+ comorbidities cost 2.8x more than those with none.

Budget Overruns

Orthopedics and Cardiology are 15% and 11% over budget respectively. Radiology is 8% under budget due to new equipment reducing per-scan costs.

Billing Outliers

2.3% of procedures are cost outliers (z-score > 3). Investigation reveals half are legitimate complex cases; the other half contain coding or supply charge errors.

Quiz

Question 1

Which SQL function is used to calculate month-over-month percentage change in the cost trends query?

Question 2

In the outlier detection code, what z-score threshold is used to flag unusual procedure costs?

Question 3

What does the R-squared metric tell us about the cost prediction model?

Question 4

Why is OneHotEncoder used before training the linear regression model?

Question 5

What does a high claim denial rate most commonly indicate?

← Previous Project Next Project →