Project 21

Employee Attrition Analysis

Investigate high turnover rates, identify at-risk employees, and build predictive models to help an HR director develop targeted retention strategies.

Project Overview

Scenario: You are a data analyst working with the HR director at a mid-size technology company (2,500 employees). Over the past two years, voluntary turnover has climbed from 12% to 19%, well above the industry average of 13%. Leadership wants to understand which departments, roles, and employee profiles are most affected, and whether attrition can be predicted before employees resign.

Objective: Analyze historical employee data, exit survey responses, and salary records to identify the primary drivers of attrition. Build a logistic regression model to flag at-risk employees and recommend data-backed retention interventions.

Tools: SQL (PostgreSQL), Python (pandas, matplotlib, scikit-learn, lifelines), Microsoft Excel

Dataset Description

Table: employees

ColumnTypeDescription
employee_idINT (PK)Unique employee identifier
first_nameVARCHAR(50)Employee first name
last_nameVARCHAR(50)Employee last name
department_idINT (FK)Reference to departments table
role_titleVARCHAR(100)Job title
hire_dateDATEDate employee was hired
termination_dateDATEDate employee left (NULL if active)
is_activeBOOLEANCurrent employment status
ageINTEmployee age at hire
manager_idINT (FK)Direct manager employee_id

Table: departments

ColumnTypeDescription
department_idINT (PK)Unique department identifier
department_nameVARCHAR(100)Department name
locationVARCHAR(100)Office location

Table: performance_reviews

ColumnTypeDescription
review_idINT (PK)Unique review identifier
employee_idINT (FK)Reference to employees
review_dateDATEDate of review
ratingDECIMAL(2,1)Performance rating (1.0-5.0)
reviewer_idINT (FK)Manager who conducted review

Table: salary_history

ColumnTypeDescription
salary_idINT (PK)Unique record identifier
employee_idINT (FK)Reference to employees
effective_dateDATEDate salary became effective
annual_salaryDECIMAL(10,2)Annual salary amount

Table: exit_surveys

ColumnTypeDescription
survey_idINT (PK)Unique survey identifier
employee_idINT (FK)Reference to employees
exit_dateDATEDate of departure
reason_categoryVARCHAR(50)Primary reason for leaving
satisfaction_scoreINTOverall satisfaction (1-10)
would_recommendBOOLEANWould recommend company

SQL Analysis

Query 1: Attrition Rate by Department and Year

Calculate the annual attrition rate for each department to identify which teams are losing the most people and whether the trend is worsening.

SELECT
    d.department_name,
    EXTRACT(YEAR FROM e.termination_date) AS exit_year,
    COUNT(e.employee_id) AS employees_left,
    COUNT(e.employee_id) * 100.0 / (
        SELECT COUNT(*)
        FROM employees e2
        WHERE e2.department_id = d.department_id
          AND e2.hire_date <= DATE_TRUNC('year', e.termination_date) + INTERVAL '1 year' - INTERVAL '1 day'
          AND (e2.termination_date IS NULL
               OR e2.termination_date >= DATE_TRUNC('year', e.termination_date))
    ) AS attrition_rate_pct
FROM employees e
JOIN departments d ON e.department_id = d.department_id
WHERE e.termination_date IS NOT NULL
GROUP BY d.department_name, EXTRACT(YEAR FROM e.termination_date), d.department_id
ORDER BY exit_year DESC, attrition_rate_pct DESC;

This query uses a correlated subquery to compute the headcount denominator at each year-end, giving a true attrition percentage rather than a raw count.

Query 2: Attrition by Tenure Band

Group employees who left by their tenure length to see whether most attrition happens in the first year, mid-career, or among long-tenured staff.

SELECT
    CASE
        WHEN tenure_months < 6  THEN '0-6 months'
        WHEN tenure_months < 12 THEN '6-12 months'
        WHEN tenure_months < 24 THEN '1-2 years'
        WHEN tenure_months < 48 THEN '2-4 years'
        ELSE '4+ years'
    END AS tenure_band,
    COUNT(*) AS departures,
    ROUND(AVG(latest_salary), 0) AS avg_salary_at_exit,
    ROUND(AVG(last_rating), 1) AS avg_last_rating
FROM (
    SELECT
        e.employee_id,
        EXTRACT(EPOCH FROM (e.termination_date - e.hire_date)) / (30*86400) AS tenure_months,
        sh.annual_salary AS latest_salary,
        pr.rating AS last_rating
    FROM employees e
    JOIN LATERAL (
        SELECT annual_salary FROM salary_history
        WHERE employee_id = e.employee_id
        ORDER BY effective_date DESC LIMIT 1
    ) sh ON TRUE
    LEFT JOIN LATERAL (
        SELECT rating FROM performance_reviews
        WHERE employee_id = e.employee_id
        ORDER BY review_date DESC LIMIT 1
    ) pr ON TRUE
    WHERE e.termination_date IS NOT NULL
) sub
GROUP BY tenure_band
ORDER BY MIN(tenure_months);

Query 3: Salary Growth vs. Attrition Correlation

Compare the average annual salary increase for employees who stayed versus those who left, testing whether stagnant pay drives attrition.

WITH salary_growth AS (
    SELECT
        sh.employee_id,
        e.is_active,
        (MAX(sh.annual_salary) - MIN(sh.annual_salary)) * 100.0
            / NULLIF(MIN(sh.annual_salary), 0) AS total_growth_pct,
        COUNT(sh.salary_id) AS salary_changes
    FROM salary_history sh
    JOIN employees e ON sh.employee_id = e.employee_id
    GROUP BY sh.employee_id, e.is_active
    HAVING COUNT(sh.salary_id) > 1
)
SELECT
    CASE WHEN is_active THEN 'Active' ELSE 'Departed' END AS status,
    COUNT(*) AS employees,
    ROUND(AVG(total_growth_pct), 1) AS avg_growth_pct,
    ROUND(AVG(salary_changes), 1) AS avg_salary_adjustments
FROM salary_growth
GROUP BY is_active;

Query 4: Exit Survey Theme Analysis

Aggregate exit survey responses to identify the most cited reasons for leaving and their associated satisfaction scores.

SELECT
    es.reason_category,
    COUNT(*) AS responses,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct_of_exits,
    ROUND(AVG(es.satisfaction_score), 1) AS avg_satisfaction,
    ROUND(AVG(CASE WHEN es.would_recommend THEN 1 ELSE 0 END) * 100, 1) AS pct_would_recommend
FROM exit_surveys es
GROUP BY es.reason_category
ORDER BY responses DESC;

Python Analysis

Kaplan-Meier Survival Analysis

Use survival analysis to estimate the probability of an employee remaining at the company over time, segmented by department.

import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter

# Load data
employees = pd.read_csv('employees.csv', parse_dates=['hire_date', 'termination_date'])

# Calculate tenure in months
today = pd.Timestamp('2026-01-01')
employees['tenure_months'] = (
    (employees['termination_date'].fillna(today) - employees['hire_date']).dt.days / 30.44
).round(1)
employees['event_observed'] = employees['termination_date'].notna().astype(int)

# Fit Kaplan-Meier by department
departments = pd.read_csv('departments.csv')
df = employees.merge(departments, on='department_id')

fig, ax = plt.subplots(figsize=(10, 6))
kmf = KaplanMeierFitter()

for dept in df['department_name'].unique():
    mask = df['department_name'] == dept
    kmf.fit(
        durations=df.loc[mask, 'tenure_months'],
        event_observed=df.loc[mask, 'event_observed'],
        label=dept
    )
    kmf.plot_survival_function(ax=ax)

ax.set_title('Employee Survival Curves by Department', fontsize=14)
ax.set_xlabel('Tenure (months)')
ax.set_ylabel('Probability of Remaining Employed')
ax.legend(loc='lower left', fontsize=9)
plt.tight_layout()
plt.savefig('survival_curves.png', dpi=150)
plt.show()

Logistic Regression for Attrition Prediction

Build a predictive model that flags employees at risk of leaving based on salary growth, performance ratings, tenure, and department.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt

# Prepare feature matrix
features = pd.read_csv('employee_features.csv')
# Columns: employee_id, department_name, tenure_months, latest_salary,
#           salary_growth_pct, last_rating, num_reviews, age, attrited (0/1)

le = LabelEncoder()
features['dept_encoded'] = le.fit_transform(features['department_name'])

X = features[['tenure_months', 'latest_salary', 'salary_growth_pct',
               'last_rating', 'num_reviews', 'age', 'dept_encoded']]
y = features['attrited']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train_scaled, y_train)

y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

print(classification_report(y_test, y_pred))
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}')

# Feature importance plot
coefs = pd.Series(model.coef_[0], index=X.columns).sort_values()
fig, ax = plt.subplots(figsize=(8, 5))
coefs.plot(kind='barh', ax=ax, color=['#e74c3c' if v > 0 else '#2ecc71' for v in coefs])
ax.set_title('Attrition Risk Factors (Logistic Regression Coefficients)', fontsize=13)
ax.set_xlabel('Coefficient (positive = increases attrition risk)')
ax.axvline(0, color='gray', linewidth=0.8, linestyle='--')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()

Excel Dashboard

Build an interactive attrition dashboard in Excel to present findings to the HR leadership team.

  1. Import Data: Load the SQL query results into separate sheets: "Attrition by Dept", "Tenure Bands", "Salary Growth", and "Exit Surveys". Format each as an Excel Table (Ctrl+T).
  2. Attrition Trend Chart: Create a clustered column chart from the department attrition data. Place departments on the X-axis, attrition rate on the Y-axis, and group by year. Add a horizontal reference line at 13% (industry average) using a secondary series.
  3. Department Heatmap: Build a matrix with departments as rows and months as columns. Use Conditional Formatting (Color Scales: green-yellow-red) to highlight months with the highest departure counts. This reveals seasonal patterns in attrition.
  4. Tenure Distribution Chart: Create a bar chart showing departures by tenure band. Add a secondary axis line for average last performance rating to see if low performers leave at specific tenure points.
  5. Exit Survey Treemap: Use an Excel Treemap chart (Insert > Treemap) to visualize exit reason categories sized by response count. Color-code by average satisfaction score to reveal which reasons correlate with the lowest satisfaction.
  6. Retention Cost Calculator: Create an input section where HR can enter average recruitment cost, onboarding cost, and lost productivity weeks. Use formulas: =recruitment_cost + onboarding_cost + (weekly_salary * lost_weeks) to estimate cost-per-departure. Multiply by projected departures for annual cost.
  7. Dashboard Tab: Create a summary dashboard tab with slicers for Department, Year, and Tenure Band. Link PivotCharts to the slicers so leadership can interactively filter. Add KPI cards at the top showing: Overall Attrition Rate, Avg Tenure at Exit, Top Exit Reason, and Estimated Annual Cost.
  8. At-Risk List: Import the Python model's predicted probabilities and create a table sorted by risk score. Apply Conditional Formatting (Data Bars) to the risk column and add an Icon Set (traffic lights) for quick visual scanning.

Key Insights

Early Tenure Risk

42% of departures occur within the first 12 months, indicating onboarding and early engagement gaps. New hire mentorship programs could reduce this significantly.

Salary Stagnation

Departed employees averaged only 4.2% total salary growth compared to 11.8% for active employees. Proactive mid-year adjustments for high performers could improve retention.

Department Hotspots

Customer Support (28% attrition) and Sales (24% attrition) far exceed other departments. Exit surveys point to workload and limited career progression as primary causes.

Predictive Power

The logistic regression model achieves an ROC-AUC of 0.82, with salary growth percentage and last performance rating as the strongest predictors of attrition risk.

Knowledge Check

Question 1

In the Kaplan-Meier survival analysis, what does it mean when the survival curve drops steeply in the first 6 months?

Question 2

Why is class_weight='balanced' used in the logistic regression model?

Question 3

In the SQL salary growth query, what does NULLIF(MIN(sh.annual_salary), 0) prevent?

Question 4

Which Excel feature is most appropriate for allowing leadership to interactively filter the attrition dashboard by department?

Question 5

A positive coefficient in the logistic regression feature importance plot means:

← Previous Project Next Project →