Fraud Detection Analysis
A major bank is experiencing increasing fraudulent credit card transactions and needs a system to detect suspicious activity in real time. You will use SQL to identify anomalous patterns, then build a Python-based anomaly detection model using Isolation Forest to flag potentially fraudulent transactions before they cause financial losses.
Project Overview
Business Scenario
Pacific Coast Bank processes 2 million credit card transactions daily. Their existing rule-based fraud detection system catches only 60% of fraudulent transactions, while generating too many false alerts that frustrate legitimate customers. The fraud team wants a machine learning approach that can detect subtle patterns humans miss, reduce false positives, and adapt to evolving fraud tactics. Fraud currently costs the bank $4.2 million annually in chargebacks and investigation expenses.
- Goal: Build an anomaly detection system that identifies fraudulent transactions with high precision
- Tools: SQL (pattern detection and rule-based flags), Python (Isolation Forest model, feature engineering)
- Stakeholder: Fraud Prevention Team and Risk Management
- Deliverables: Anomaly detection model, transaction risk scores, high-risk merchant list, velocity check alerts
Dataset
Database Schema
The fraud detection database captures transaction details, account information, merchant data, and confirmed fraud labels for model training.
transactions
| Column | Type | Description |
|---|---|---|
| txn_id | INT (PK) | Unique transaction identifier |
| account_id | INT (FK) | Customer account |
| merchant_id | INT (FK) | Merchant where transaction occurred |
| txn_datetime | DATETIME | Transaction timestamp |
| amount | DECIMAL(10,2) | Transaction amount |
| txn_type | VARCHAR(20) | Purchase, ATM Withdrawal, Online, Transfer |
| location_city | VARCHAR(50) | City of transaction |
| location_country | VARCHAR(3) | Country code (ISO 3166) |
accounts
| Column | Type | Description |
|---|---|---|
| account_id | INT (PK) | Account identifier |
| customer_name | VARCHAR(100) | Account holder name |
| account_type | VARCHAR(20) | Credit, Debit, Prepaid |
| credit_limit | DECIMAL(10,2) | Credit limit (NULL for debit) |
| home_country | VARCHAR(3) | Account holder's home country |
| account_open_date | DATE | Date account was opened |
merchants
| Column | Type | Description |
|---|---|---|
| merchant_id | INT (PK) | Merchant identifier |
| merchant_name | VARCHAR(100) | Business name |
| mcc_code | VARCHAR(4) | Merchant Category Code |
| category | VARCHAR(50) | Business category (Retail, Restaurant, Gas Station, etc.) |
| risk_score | INT | Internal risk score (1-10) |
fraud_labels
| Column | Type | Description |
|---|---|---|
| txn_id | INT (PK/FK) | Transaction identifier |
| is_fraud | TINYINT | 1 = confirmed fraud, 0 = legitimate |
| fraud_type | VARCHAR(30) | Stolen card, Account takeover, Card-not-present, etc. |
| reported_date | DATE | Date fraud was reported |
SQL Analysis
Query 1: Transaction Amount Anomalies
Identify transactions that are significantly larger than a customer's typical spending pattern. Transactions exceeding 3 standard deviations from the account mean are flagged.
WITH account_stats AS (
SELECT
account_id,
AVG(amount) AS avg_amount,
STDDEV(amount) AS std_amount,
COUNT(*) AS txn_count
FROM transactions
WHERE txn_datetime >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)
GROUP BY account_id
HAVING COUNT(*) >= 10
)
SELECT
t.txn_id,
t.account_id,
t.amount,
ROUND(a.avg_amount, 2) AS account_avg,
ROUND((t.amount - a.avg_amount) / a.std_amount, 2) AS z_score,
t.txn_datetime,
t.location_country,
COALESCE(f.is_fraud, 0) AS confirmed_fraud
FROM transactions t
JOIN account_stats a ON t.account_id = a.account_id
LEFT JOIN fraud_labels f ON t.txn_id = f.txn_id
WHERE (t.amount - a.avg_amount) / a.std_amount > 3
ORDER BY z_score DESC
LIMIT 100;
Query 2: High-Risk Merchants
Find merchants with unusually high fraud rates. These merchants may have compromised payment terminals or be complicit in fraud schemes.
SELECT
m.merchant_id,
m.merchant_name,
m.category,
COUNT(t.txn_id) AS total_transactions,
SUM(f.is_fraud) AS fraud_count,
ROUND(SUM(f.is_fraud) * 100.0 / COUNT(t.txn_id), 2) AS fraud_rate_pct,
ROUND(SUM(CASE WHEN f.is_fraud = 1 THEN t.amount ELSE 0 END), 2) AS fraud_loss_total
FROM merchants m
JOIN transactions t ON m.merchant_id = t.merchant_id
LEFT JOIN fraud_labels f ON t.txn_id = f.txn_id
GROUP BY m.merchant_id, m.merchant_name, m.category
HAVING COUNT(t.txn_id) >= 50
ORDER BY fraud_rate_pct DESC
LIMIT 20;
Query 3: Velocity Checks (Rapid Successive Transactions)
Detect accounts with multiple transactions within a short time window, a classic indicator of stolen card usage where fraudsters rapidly make purchases before the card is blocked.
WITH txn_gaps AS (
SELECT
t.txn_id,
t.account_id,
t.amount,
t.txn_datetime,
t.location_city,
LAG(t.txn_datetime) OVER (
PARTITION BY t.account_id ORDER BY t.txn_datetime
) AS prev_txn_time,
LAG(t.location_city) OVER (
PARTITION BY t.account_id ORDER BY t.txn_datetime
) AS prev_city,
TIMESTAMPDIFF(MINUTE, LAG(t.txn_datetime) OVER (
PARTITION BY t.account_id ORDER BY t.txn_datetime
), t.txn_datetime) AS minutes_since_last
FROM transactions t
WHERE t.txn_datetime >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
)
SELECT
txn_id,
account_id,
amount,
txn_datetime,
minutes_since_last,
location_city,
prev_city,
CASE
WHEN minutes_since_last < 2 AND location_city != prev_city
THEN 'HIGH RISK: Impossible travel'
WHEN minutes_since_last < 5
THEN 'MEDIUM RISK: Rapid succession'
ELSE 'LOW RISK'
END AS velocity_flag
FROM txn_gaps
WHERE minutes_since_last < 10
ORDER BY minutes_since_last;
Python Analysis
Feature Engineering
Transform raw transaction data into features that capture behavioral patterns. These engineered features are far more predictive than raw amounts alone.
import pandas as pd
import numpy as np
txn = pd.read_csv('transactions.csv', parse_dates=['txn_datetime'])
fraud = pd.read_csv('fraud_labels.csv')
merchants = pd.read_csv('merchants.csv')
accounts = pd.read_csv('accounts.csv')
# Merge datasets
df = txn.merge(fraud, on='txn_id', how='left')
df = df.merge(merchants[['merchant_id', 'category', 'risk_score']], on='merchant_id')
df = df.merge(accounts[['account_id', 'home_country', 'credit_limit']], on='account_id')
df['is_fraud'] = df['is_fraud'].fillna(0).astype(int)
# Feature: Transaction amount deviation from account average
acct_stats = df.groupby('account_id')['amount'].agg(['mean', 'std']).reset_index()
acct_stats.columns = ['account_id', 'acct_avg_amount', 'acct_std_amount']
df = df.merge(acct_stats, on='account_id')
df['amount_zscore'] = (df['amount'] - df['acct_avg_amount']) / df['acct_std_amount'].replace(0, 1)
# Feature: Transaction velocity (count in last 1 hour)
df = df.sort_values(['account_id', 'txn_datetime'])
df['txn_hour'] = df['txn_datetime'].dt.hour
df['txn_count_1hr'] = df.groupby('account_id')['txn_id'].transform(
lambda x: x.rolling(window='1H', on=df.loc[x.index, 'txn_datetime']).count()
) if False else 1 # Simplified; real impl uses time-window rolling
# Simpler velocity: transactions per account per day
daily_counts = df.groupby(['account_id', df['txn_datetime'].dt.date]).size().reset_index(name='daily_txn_count')
daily_counts.columns = ['account_id', 'txn_date', 'daily_txn_count']
df['txn_date'] = df['txn_datetime'].dt.date
df = df.merge(daily_counts, on=['account_id', 'txn_date'], how='left')
# Feature: Foreign transaction flag
df['is_foreign'] = (df['location_country'] != df['home_country']).astype(int)
# Feature: Credit utilization of single transaction
df['txn_pct_of_limit'] = df['amount'] / df['credit_limit'].replace(0, np.nan)
# Feature: Weekend / late-night flag
df['is_weekend'] = df['txn_datetime'].dt.dayofweek.isin([5, 6]).astype(int)
df['is_night'] = df['txn_hour'].between(0, 5).astype(int)
print(f"Dataset: {len(df)} transactions, {df['is_fraud'].sum()} fraud ({df['is_fraud'].mean()*100:.2f}%)")
print(f"Features engineered: amount_zscore, daily_txn_count, is_foreign, txn_pct_of_limit, is_weekend, is_night")
Isolation Forest Anomaly Detection
Isolation Forest works by randomly partitioning data. Anomalies (fraudulent transactions) are isolated in fewer splits because they differ significantly from normal patterns. This is ideal for fraud detection because it does not require labeled data for training.
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
feature_cols = ['amount', 'amount_zscore', 'daily_txn_count', 'is_foreign',
'txn_pct_of_limit', 'is_weekend', 'is_night', 'risk_score']
X = df[feature_cols].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Train Isolation Forest
# contamination = expected fraud rate (~0.5% of transactions)
iso_forest = IsolationForest(
n_estimators=200,
contamination=0.005,
max_samples='auto',
random_state=42
)
iso_forest.fit(X_scaled)
# Predict: -1 = anomaly (potential fraud), 1 = normal
df['anomaly_pred'] = iso_forest.predict(X_scaled)
df['anomaly_score'] = iso_forest.decision_function(X_scaled)
# Convert to risk score (0-100, higher = more suspicious)
df['risk_score_model'] = ((1 - df['anomaly_score']) * 50).clip(0, 100).round(1)
flagged = df[df['anomaly_pred'] == -1]
print(f"Flagged {len(flagged)} transactions as suspicious ({len(flagged)/len(df)*100:.2f}%)")
print(f"Of those, {flagged['is_fraud'].sum()} are confirmed fraud "
f"({flagged['is_fraud'].sum()/flagged['is_fraud'].count()*100:.1f}% precision)")
Precision-Recall Evaluation
In fraud detection, precision (what fraction of flagged transactions are actually fraud) and recall (what fraction of all fraud is caught) are more meaningful than accuracy because of extreme class imbalance.
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score
# Use the anomaly score as a continuous predictor
# Negate because lower scores = more anomalous
y_true = df['is_fraud']
y_scores = -df['anomaly_score']
precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
avg_precision = average_precision_score(y_true, y_scores)
plt.figure(figsize=(10, 6))
plt.plot(recall, precision, color='darkorange', lw=2,
label=f'Isolation Forest (AP = {avg_precision:.3f})')
plt.axhline(y=y_true.mean(), color='gray', linestyle='--', label='Baseline (random)')
plt.xlabel('Recall (Fraction of Fraud Caught)')
plt.ylabel('Precision (Fraction of Flags That Are Fraud)')
plt.title('Precision-Recall Curve - Fraud Detection')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('precision_recall_fraud.png', dpi=150)
plt.show()
# Find threshold for 80% recall
idx_80 = np.argmin(np.abs(recall - 0.80))
print(f"At 80% recall: precision = {precision[idx_80]:.2%}")
print(f"At 80% recall: threshold = {thresholds[min(idx_80, len(thresholds)-1)]:.3f}")
Excel Component
Note on Excel Usage
Fraud detection is primarily a SQL + Python project due to the volume of transactions and the need for statistical modeling. However, Excel can supplement the analysis in the following ways:
- Summary Reporting: Export the top 100 flagged transactions to Excel for manual review by the fraud team. Include columns for transaction details, risk score, and a "Reviewed" checkbox column for the analyst to mark.
- Merchant Risk Table: The high-risk merchant query results can be formatted in Excel with conditional formatting to highlight merchants exceeding a 5% fraud rate threshold.
- Weekly Fraud Dashboard: Create a pivot table summarizing fraud counts and amounts by week, transaction type, and merchant category for management reporting.
Key Insights
Unsupervised Advantage
Isolation Forest does not require labeled fraud data for training, making it effective against new fraud patterns that have never been seen before. It detects anomalies based purely on how different a transaction is from normal behavior.
Feature Engineering is Critical
Raw transaction amount alone catches only obvious fraud. Engineered features like velocity (transactions per hour), z-score deviation, and impossible travel detection dramatically improve detection rates.
Precision vs Recall Tradeoff
Catching 95% of fraud (high recall) may require flagging 10x more legitimate transactions (low precision). The business must decide: is it worse to miss fraud or to block legitimate customers? Most banks target 80% recall with acceptable precision.
Layered Defense
The best fraud systems combine SQL rule-based checks (velocity, amount limits) with ML anomaly detection. Rules catch known patterns instantly; ML catches subtle, evolving patterns that rules would miss.
Quiz
How does Isolation Forest detect anomalies?
What does the "impossible travel" velocity check detect?
Why is precision-recall preferred over accuracy for evaluating fraud models?
What does the contamination parameter in Isolation Forest control?
Why is a z-score useful as a feature for fraud detection?