Employee Attrition Analysis
Investigate high turnover rates, identify at-risk employees, and build predictive models to help an HR director develop targeted retention strategies.
Project Overview
Scenario: You are a data analyst working with the HR director at a mid-size technology company (2,500 employees). Over the past two years, voluntary turnover has climbed from 12% to 19%, well above the industry average of 13%. Leadership wants to understand which departments, roles, and employee profiles are most affected, and whether attrition can be predicted before employees resign.
Objective: Analyze historical employee data, exit survey responses, and salary records to identify the primary drivers of attrition. Build a logistic regression model to flag at-risk employees and recommend data-backed retention interventions.
Tools: SQL (PostgreSQL), Python (pandas, matplotlib, scikit-learn, lifelines), Microsoft Excel
Dataset Description
Table: employees
| Column | Type | Description |
|---|---|---|
| employee_id | INT (PK) | Unique employee identifier |
| first_name | VARCHAR(50) | Employee first name |
| last_name | VARCHAR(50) | Employee last name |
| department_id | INT (FK) | Reference to departments table |
| role_title | VARCHAR(100) | Job title |
| hire_date | DATE | Date employee was hired |
| termination_date | DATE | Date employee left (NULL if active) |
| is_active | BOOLEAN | Current employment status |
| age | INT | Employee age at hire |
| manager_id | INT (FK) | Direct manager employee_id |
Table: departments
| Column | Type | Description |
|---|---|---|
| department_id | INT (PK) | Unique department identifier |
| department_name | VARCHAR(100) | Department name |
| location | VARCHAR(100) | Office location |
Table: performance_reviews
| Column | Type | Description |
|---|---|---|
| review_id | INT (PK) | Unique review identifier |
| employee_id | INT (FK) | Reference to employees |
| review_date | DATE | Date of review |
| rating | DECIMAL(2,1) | Performance rating (1.0-5.0) |
| reviewer_id | INT (FK) | Manager who conducted review |
Table: salary_history
| Column | Type | Description |
|---|---|---|
| salary_id | INT (PK) | Unique record identifier |
| employee_id | INT (FK) | Reference to employees |
| effective_date | DATE | Date salary became effective |
| annual_salary | DECIMAL(10,2) | Annual salary amount |
Table: exit_surveys
| Column | Type | Description |
|---|---|---|
| survey_id | INT (PK) | Unique survey identifier |
| employee_id | INT (FK) | Reference to employees |
| exit_date | DATE | Date of departure |
| reason_category | VARCHAR(50) | Primary reason for leaving |
| satisfaction_score | INT | Overall satisfaction (1-10) |
| would_recommend | BOOLEAN | Would recommend company |
SQL Analysis
Query 1: Attrition Rate by Department and Year
Calculate the annual attrition rate for each department to identify which teams are losing the most people and whether the trend is worsening.
SELECT
d.department_name,
EXTRACT(YEAR FROM e.termination_date) AS exit_year,
COUNT(e.employee_id) AS employees_left,
COUNT(e.employee_id) * 100.0 / (
SELECT COUNT(*)
FROM employees e2
WHERE e2.department_id = d.department_id
AND e2.hire_date <= DATE_TRUNC('year', e.termination_date) + INTERVAL '1 year' - INTERVAL '1 day'
AND (e2.termination_date IS NULL
OR e2.termination_date >= DATE_TRUNC('year', e.termination_date))
) AS attrition_rate_pct
FROM employees e
JOIN departments d ON e.department_id = d.department_id
WHERE e.termination_date IS NOT NULL
GROUP BY d.department_name, EXTRACT(YEAR FROM e.termination_date), d.department_id
ORDER BY exit_year DESC, attrition_rate_pct DESC;
This query uses a correlated subquery to compute the headcount denominator at each year-end, giving a true attrition percentage rather than a raw count.
Query 2: Attrition by Tenure Band
Group employees who left by their tenure length to see whether most attrition happens in the first year, mid-career, or among long-tenured staff.
SELECT
CASE
WHEN tenure_months < 6 THEN '0-6 months'
WHEN tenure_months < 12 THEN '6-12 months'
WHEN tenure_months < 24 THEN '1-2 years'
WHEN tenure_months < 48 THEN '2-4 years'
ELSE '4+ years'
END AS tenure_band,
COUNT(*) AS departures,
ROUND(AVG(latest_salary), 0) AS avg_salary_at_exit,
ROUND(AVG(last_rating), 1) AS avg_last_rating
FROM (
SELECT
e.employee_id,
EXTRACT(EPOCH FROM (e.termination_date - e.hire_date)) / (30*86400) AS tenure_months,
sh.annual_salary AS latest_salary,
pr.rating AS last_rating
FROM employees e
JOIN LATERAL (
SELECT annual_salary FROM salary_history
WHERE employee_id = e.employee_id
ORDER BY effective_date DESC LIMIT 1
) sh ON TRUE
LEFT JOIN LATERAL (
SELECT rating FROM performance_reviews
WHERE employee_id = e.employee_id
ORDER BY review_date DESC LIMIT 1
) pr ON TRUE
WHERE e.termination_date IS NOT NULL
) sub
GROUP BY tenure_band
ORDER BY MIN(tenure_months);
Query 3: Salary Growth vs. Attrition Correlation
Compare the average annual salary increase for employees who stayed versus those who left, testing whether stagnant pay drives attrition.
WITH salary_growth AS (
SELECT
sh.employee_id,
e.is_active,
(MAX(sh.annual_salary) - MIN(sh.annual_salary)) * 100.0
/ NULLIF(MIN(sh.annual_salary), 0) AS total_growth_pct,
COUNT(sh.salary_id) AS salary_changes
FROM salary_history sh
JOIN employees e ON sh.employee_id = e.employee_id
GROUP BY sh.employee_id, e.is_active
HAVING COUNT(sh.salary_id) > 1
)
SELECT
CASE WHEN is_active THEN 'Active' ELSE 'Departed' END AS status,
COUNT(*) AS employees,
ROUND(AVG(total_growth_pct), 1) AS avg_growth_pct,
ROUND(AVG(salary_changes), 1) AS avg_salary_adjustments
FROM salary_growth
GROUP BY is_active;
Query 4: Exit Survey Theme Analysis
Aggregate exit survey responses to identify the most cited reasons for leaving and their associated satisfaction scores.
SELECT
es.reason_category,
COUNT(*) AS responses,
ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS pct_of_exits,
ROUND(AVG(es.satisfaction_score), 1) AS avg_satisfaction,
ROUND(AVG(CASE WHEN es.would_recommend THEN 1 ELSE 0 END) * 100, 1) AS pct_would_recommend
FROM exit_surveys es
GROUP BY es.reason_category
ORDER BY responses DESC;
Python Analysis
Kaplan-Meier Survival Analysis
Use survival analysis to estimate the probability of an employee remaining at the company over time, segmented by department.
import pandas as pd
import matplotlib.pyplot as plt
from lifelines import KaplanMeierFitter
# Load data
employees = pd.read_csv('employees.csv', parse_dates=['hire_date', 'termination_date'])
# Calculate tenure in months
today = pd.Timestamp('2026-01-01')
employees['tenure_months'] = (
(employees['termination_date'].fillna(today) - employees['hire_date']).dt.days / 30.44
).round(1)
employees['event_observed'] = employees['termination_date'].notna().astype(int)
# Fit Kaplan-Meier by department
departments = pd.read_csv('departments.csv')
df = employees.merge(departments, on='department_id')
fig, ax = plt.subplots(figsize=(10, 6))
kmf = KaplanMeierFitter()
for dept in df['department_name'].unique():
mask = df['department_name'] == dept
kmf.fit(
durations=df.loc[mask, 'tenure_months'],
event_observed=df.loc[mask, 'event_observed'],
label=dept
)
kmf.plot_survival_function(ax=ax)
ax.set_title('Employee Survival Curves by Department', fontsize=14)
ax.set_xlabel('Tenure (months)')
ax.set_ylabel('Probability of Remaining Employed')
ax.legend(loc='lower left', fontsize=9)
plt.tight_layout()
plt.savefig('survival_curves.png', dpi=150)
plt.show()
Logistic Regression for Attrition Prediction
Build a predictive model that flags employees at risk of leaving based on salary growth, performance ratings, tenure, and department.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, roc_auc_score
import matplotlib.pyplot as plt
# Prepare feature matrix
features = pd.read_csv('employee_features.csv')
# Columns: employee_id, department_name, tenure_months, latest_salary,
# salary_growth_pct, last_rating, num_reviews, age, attrited (0/1)
le = LabelEncoder()
features['dept_encoded'] = le.fit_transform(features['department_name'])
X = features[['tenure_months', 'latest_salary', 'salary_growth_pct',
'last_rating', 'num_reviews', 'age', 'dept_encoded']]
y = features['attrited']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
model = LogisticRegression(max_iter=1000, class_weight='balanced')
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]
print(classification_report(y_test, y_pred))
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}')
# Feature importance plot
coefs = pd.Series(model.coef_[0], index=X.columns).sort_values()
fig, ax = plt.subplots(figsize=(8, 5))
coefs.plot(kind='barh', ax=ax, color=['#e74c3c' if v > 0 else '#2ecc71' for v in coefs])
ax.set_title('Attrition Risk Factors (Logistic Regression Coefficients)', fontsize=13)
ax.set_xlabel('Coefficient (positive = increases attrition risk)')
ax.axvline(0, color='gray', linewidth=0.8, linestyle='--')
plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150)
plt.show()
Excel Dashboard
Build an interactive attrition dashboard in Excel to present findings to the HR leadership team.
- Import Data: Load the SQL query results into separate sheets: "Attrition by Dept", "Tenure Bands", "Salary Growth", and "Exit Surveys". Format each as an Excel Table (Ctrl+T).
- Attrition Trend Chart: Create a clustered column chart from the department attrition data. Place departments on the X-axis, attrition rate on the Y-axis, and group by year. Add a horizontal reference line at 13% (industry average) using a secondary series.
- Department Heatmap: Build a matrix with departments as rows and months as columns. Use Conditional Formatting (Color Scales: green-yellow-red) to highlight months with the highest departure counts. This reveals seasonal patterns in attrition.
- Tenure Distribution Chart: Create a bar chart showing departures by tenure band. Add a secondary axis line for average last performance rating to see if low performers leave at specific tenure points.
- Exit Survey Treemap: Use an Excel Treemap chart (Insert > Treemap) to visualize exit reason categories sized by response count. Color-code by average satisfaction score to reveal which reasons correlate with the lowest satisfaction.
- Retention Cost Calculator: Create an input section where HR can enter average recruitment cost, onboarding cost, and lost productivity weeks. Use formulas:
=recruitment_cost + onboarding_cost + (weekly_salary * lost_weeks)to estimate cost-per-departure. Multiply by projected departures for annual cost. - Dashboard Tab: Create a summary dashboard tab with slicers for Department, Year, and Tenure Band. Link PivotCharts to the slicers so leadership can interactively filter. Add KPI cards at the top showing: Overall Attrition Rate, Avg Tenure at Exit, Top Exit Reason, and Estimated Annual Cost.
- At-Risk List: Import the Python model's predicted probabilities and create a table sorted by risk score. Apply Conditional Formatting (Data Bars) to the risk column and add an Icon Set (traffic lights) for quick visual scanning.
Key Insights
Early Tenure Risk
42% of departures occur within the first 12 months, indicating onboarding and early engagement gaps. New hire mentorship programs could reduce this significantly.
Salary Stagnation
Departed employees averaged only 4.2% total salary growth compared to 11.8% for active employees. Proactive mid-year adjustments for high performers could improve retention.
Department Hotspots
Customer Support (28% attrition) and Sales (24% attrition) far exceed other departments. Exit surveys point to workload and limited career progression as primary causes.
Predictive Power
The logistic regression model achieves an ROC-AUC of 0.82, with salary growth percentage and last performance rating as the strongest predictors of attrition risk.
Knowledge Check
In the Kaplan-Meier survival analysis, what does it mean when the survival curve drops steeply in the first 6 months?
Why is class_weight='balanced' used in the logistic regression model?
In the SQL salary growth query, what does NULLIF(MIN(sh.annual_salary), 0) prevent?
Which Excel feature is most appropriate for allowing leadership to interactively filter the attrition dashboard by department?
A positive coefficient in the logistic regression feature importance plot means: