Project 7

Disease Outbreak Tracker

Track and visualize the spread of infectious disease across regions using epidemiological data and geospatial analysis.

Project Overview

Scenario

You are a data analyst at the State Department of Public Health. A respiratory illness has been spreading across 12 counties in the state over the past 6 months. Health officials need a comprehensive tracking system to monitor case growth, identify hotspots, and evaluate testing effectiveness.

Your task is to build analytical queries and visualizations that enable public health officials to make timely, data-driven decisions about resource allocation and intervention strategies.

Objectives

  • Track daily and cumulative case counts by region
  • Calculate test positivity rates to assess testing adequacy
  • Analyze demographic breakdowns to identify vulnerable populations
  • Create geospatial visualizations to map outbreak hotspots
  • Model epidemic curves and estimate basic reproduction numbers

Dataset Description

The surveillance system collects data from testing sites, hospitals, and county health departments across four interconnected tables.

cases

Column Type Description
case_idINT (PK)Unique case identifier
patient_idINT (FK)References demographics table
region_idINT (FK)County/region where case was reported
report_dateDATEDate the case was reported
symptom_onset_dateDATEDate symptoms first appeared
severityVARCHAR(15)Mild, Moderate, Severe, Critical
outcomeVARCHAR(15)Recovered, Hospitalized, Deceased, Active

regions

Column Type Description
region_idINT (PK)Unique region identifier
region_nameVARCHAR(50)County or district name
populationINTTotal population of region
latitudeDECIMAL(9,6)Region centroid latitude
longitudeDECIMAL(9,6)Region centroid longitude
urban_ruralVARCHAR(10)Urban, Suburban, or Rural

demographics

Column Type Description
patient_idINT (PK)Unique patient identifier
ageINTPatient age at time of report
genderVARCHAR(10)M, F, or Other
race_ethnicityVARCHAR(30)Self-reported race/ethnicity
has_comorbidityBOOLEANWhether patient has pre-existing conditions

test_results

Column Type Description
test_idINT (PK)Unique test identifier
patient_idINT (FK)References demographics table
region_idINT (FK)Testing site region
test_dateDATEDate the test was administered
test_typeVARCHAR(20)PCR, Rapid Antigen, Antibody
resultVARCHAR(10)Positive, Negative, Inconclusive

SQL Analysis

1. Daily and Cumulative Case Counts by Region

Track how cases accumulate over time in each county using a window function for the running total.

SELECT
    r.region_name,
    c.report_date,
    COUNT(*) AS daily_cases,
    SUM(COUNT(*)) OVER (
        PARTITION BY r.region_name
        ORDER BY c.report_date
        ROWS UNBOUNDED PRECEDING
    ) AS cumulative_cases
FROM cases c
JOIN regions r ON c.region_id = r.region_id
GROUP BY r.region_name, c.report_date
ORDER BY r.region_name, c.report_date;

2. Test Positivity Rate by Region and Week

A positivity rate above 5% suggests insufficient testing. This query calculates the weekly rate for each region.

SELECT
    r.region_name,
    YEAR(t.test_date) AS test_year,
    WEEK(t.test_date) AS test_week,
    COUNT(*) AS total_tests,
    SUM(CASE WHEN t.result = 'Positive' THEN 1 ELSE 0 END) AS positive_tests,
    ROUND(
        SUM(CASE WHEN t.result = 'Positive' THEN 1 ELSE 0 END) * 100.0
        / COUNT(*), 2
    ) AS positivity_rate_pct
FROM test_results t
JOIN regions r ON t.region_id = r.region_id
WHERE t.result != 'Inconclusive'
GROUP BY r.region_name, YEAR(t.test_date), WEEK(t.test_date)
ORDER BY r.region_name, test_year, test_week;

3. Demographic Breakdown of Severity

Identify which age groups and demographics are most affected by severe illness.

SELECT
    CASE
        WHEN d.age BETWEEN 0 AND 17 THEN '0-17'
        WHEN d.age BETWEEN 18 AND 34 THEN '18-34'
        WHEN d.age BETWEEN 35 AND 49 THEN '35-49'
        WHEN d.age BETWEEN 50 AND 64 THEN '50-64'
        ELSE '65+'
    END AS age_group,
    c.severity,
    COUNT(*) AS case_count,
    ROUND(
        COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (
            PARTITION BY CASE
                WHEN d.age BETWEEN 0 AND 17 THEN '0-17'
                WHEN d.age BETWEEN 18 AND 34 THEN '18-34'
                WHEN d.age BETWEEN 35 AND 49 THEN '35-49'
                WHEN d.age BETWEEN 50 AND 64 THEN '50-64'
                ELSE '65+'
            END
        ), 1
    ) AS pct_within_age_group
FROM cases c
JOIN demographics d ON c.patient_id = d.patient_id
GROUP BY age_group, c.severity
ORDER BY age_group, c.severity;

4. Incidence Rate per 100,000 Population

Normalize case counts by population to enable fair comparison between regions of different sizes.

SELECT
    r.region_name,
    r.population,
    COUNT(c.case_id) AS total_cases,
    ROUND(
        COUNT(c.case_id) * 100000.0 / r.population, 1
    ) AS incidence_per_100k,
    SUM(CASE WHEN c.outcome = 'Deceased' THEN 1 ELSE 0 END) AS total_deaths,
    ROUND(
        SUM(CASE WHEN c.outcome = 'Deceased' THEN 1 ELSE 0 END) * 100000.0
        / r.population, 1
    ) AS mortality_per_100k
FROM regions r
LEFT JOIN cases c ON r.region_id = c.region_id
GROUP BY r.region_name, r.population
ORDER BY incidence_per_100k DESC;

Python Analysis

1. Geospatial Visualization of Case Density

Map case hotspots across the state using bubble markers sized by incidence rate.

import pandas as pd
import folium

# Load data
cases = pd.read_csv('cases.csv', parse_dates=['report_date'])
regions = pd.read_csv('regions.csv')

# Calculate incidence per 100k by region
case_counts = cases.groupby('region_id').size().reset_index(name='total_cases')
region_data = regions.merge(case_counts, on='region_id')
region_data['incidence_per_100k'] = (
    region_data['total_cases'] * 100000 / region_data['population']
)

# Create map centered on state
center_lat = region_data['latitude'].mean()
center_lon = region_data['longitude'].mean()
m = folium.Map(location=[center_lat, center_lon], zoom_start=8,
               tiles='CartoDB positron')

# Add bubble markers
for _, row in region_data.iterrows():
    color = ('red' if row['incidence_per_100k'] > 500
             else 'orange' if row['incidence_per_100k'] > 200
             else 'green')
    folium.CircleMarker(
        location=[row['latitude'], row['longitude']],
        radius=row['incidence_per_100k'] / 50,
        popup=f"{row['region_name']}<br>"
              f"Cases: {row['total_cases']}<br>"
              f"Rate: {row['incidence_per_100k']:.1f}/100k",
        color=color,
        fill=True,
        fill_opacity=0.6
    ).add_to(m)

m.save('outbreak_map.html')
m

2. Epidemic Curve (Epi Curve)

Plot the classic epidemiological curve showing new cases by symptom onset date, stacked by severity.

import matplotlib.pyplot as plt
import matplotlib.dates as mdates

# Epi curve by symptom onset date and severity
epi_data = (
    cases
    .groupby(['symptom_onset_date', 'severity'])
    .size()
    .unstack(fill_value=0)
)

severity_order = ['Mild', 'Moderate', 'Severe', 'Critical']
epi_data = epi_data.reindex(columns=severity_order, fill_value=0)

colors = {'Mild': '#2196F3', 'Moderate': '#FF9800',
          'Severe': '#F44336', 'Critical': '#9C27B0'}

fig, ax = plt.subplots(figsize=(14, 6))
epi_data.plot(kind='bar', stacked=True, ax=ax, width=1.0,
              color=[colors[s] for s in severity_order],
              edgecolor='none')

ax.set_title('Epidemic Curve by Symptom Onset Date',
             fontsize=16, fontweight='bold')
ax.set_xlabel('Symptom Onset Date')
ax.set_ylabel('Number of Cases')

# Reduce x-tick density
tick_positions = range(0, len(epi_data), 7)
ax.set_xticks(tick_positions)
ax.set_xticklabels(
    [epi_data.index[i].strftime('%b %d') for i in tick_positions],
    rotation=45
)
ax.legend(title='Severity')
plt.tight_layout()
plt.savefig('epi_curve.png', dpi=150)
plt.show()

3. SIR Model Simulation

Implement a basic Susceptible-Infected-Recovered (SIR) model to project outbreak trajectory.

import numpy as np
from scipy.integrate import odeint

# SIR model parameters
N = 500000          # Total population
I0 = 50             # Initial infected
R0 = 0              # Initial recovered
S0 = N - I0 - R0    # Initial susceptible
beta = 0.3          # Transmission rate
gamma = 0.1         # Recovery rate (1/duration of infection)
days = 180

def sir_model(y, t, N, beta, gamma):
    S, I, R = y
    dSdt = -beta * S * I / N
    dIdt = beta * S * I / N - gamma * I
    dRdt = gamma * I
    return dSdt, dIdt, dRdt

t = np.linspace(0, days, days)
y0 = S0, I0, R0
solution = odeint(sir_model, y0, t, args=(N, beta, gamma))
S, I, R = solution.T

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(t, S / 1000, 'b-', linewidth=2, label='Susceptible')
ax.plot(t, I / 1000, 'r-', linewidth=2, label='Infected')
ax.plot(t, R / 1000, 'g-', linewidth=2, label='Recovered')

ax.set_title(f'SIR Model Projection (R0 = {beta/gamma:.1f})',
             fontsize=16, fontweight='bold')
ax.set_xlabel('Days Since First Case')
ax.set_ylabel('Population (thousands)')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)

# Annotate peak
peak_day = int(t[np.argmax(I)])
peak_val = np.max(I) / 1000
ax.annotate(f'Peak: Day {peak_day}\n{peak_val:.0f}k infected',
            xy=(peak_day, peak_val), fontsize=10,
            arrowprops=dict(arrowstyle='->', color='red'),
            xytext=(peak_day + 20, peak_val + 20))

plt.tight_layout()
plt.savefig('sir_model.png', dpi=150)
plt.show()

Key Insights

Urban Hotspots

Urban counties show 3x higher incidence rates than rural areas, but rural regions have 2x higher case fatality rates due to limited hospital access.

Testing Gaps

Three rural counties maintain positivity rates above 15%, suggesting significant under-testing. Deploying mobile testing units is recommended.

Age Vulnerability

Adults 65+ make up 18% of cases but 62% of hospitalizations and 78% of deaths. Targeted vaccination and early treatment for this group is critical.

Reporting Lag

The median delay between symptom onset and case report is 5 days. Faster reporting infrastructure would improve real-time situational awareness.

SIR Projection

The model estimates a basic reproduction number (R0) of 3.0, predicting peak infections around Day 60 without intervention. Social measures could flatten this curve.

Comorbidity Risk

Patients with pre-existing conditions are 4.5x more likely to develop severe illness. Pre-screening comorbid patients can prioritize limited ICU resources.

Quiz

Question 1

What test positivity rate threshold generally indicates insufficient testing in a region?

Question 2

In the SIR model, what does the "R" compartment represent?

Question 3

Why do we calculate incidence per 100,000 population instead of using raw case counts?

Question 4

Which SQL function is used in the cumulative case count query to create a running total?

Question 5

What Python library is used in this project to create interactive map visualizations?

← Previous Project Next Project →