Fraud Detection Analysis

Project Overview

Business Scenario

Pacific Coast Bank processes 2 million credit card transactions daily. Their existing rule-based fraud detection system catches only 60% of fraudulent transactions, while generating too many false alerts that frustrate legitimate customers. The fraud team wants a machine learning approach that can detect subtle patterns humans miss, reduce false positives, and adapt to evolving fraud tactics. Fraud currently costs the bank $4.2 million annually in chargebacks and investigation expenses.

Goal: Build an anomaly detection system that identifies fraudulent transactions with high precision
Tools: SQL (pattern detection and rule-based flags), Python (Isolation Forest model, feature engineering)
Stakeholder: Fraud Prevention Team and Risk Management
Deliverables: Anomaly detection model, transaction risk scores, high-risk merchant list, velocity check alerts

Dataset

Database Schema

The fraud detection database captures transaction details, account information, merchant data, and confirmed fraud labels for model training.

transactions

Column	Type	Description
txn_id	INT (PK)	Unique transaction identifier
account_id	INT (FK)	Customer account
merchant_id	INT (FK)	Merchant where transaction occurred
txn_datetime	DATETIME	Transaction timestamp
amount	DECIMAL(10,2)	Transaction amount
txn_type	VARCHAR(20)	Purchase, ATM Withdrawal, Online, Transfer
location_city	VARCHAR(50)	City of transaction
location_country	VARCHAR(3)	Country code (ISO 3166)

accounts

Column	Type	Description
account_id	INT (PK)	Account identifier
customer_name	VARCHAR(100)	Account holder name
account_type	VARCHAR(20)	Credit, Debit, Prepaid
credit_limit	DECIMAL(10,2)	Credit limit (NULL for debit)
home_country	VARCHAR(3)	Account holder's home country
account_open_date	DATE	Date account was opened

merchants

Column	Type	Description
merchant_id	INT (PK)	Merchant identifier
merchant_name	VARCHAR(100)	Business name
mcc_code	VARCHAR(4)	Merchant Category Code
category	VARCHAR(50)	Business category (Retail, Restaurant, Gas Station, etc.)
risk_score	INT	Internal risk score (1-10)

fraud_labels

Column	Type	Description
txn_id	INT (PK/FK)	Transaction identifier
is_fraud	TINYINT	1 = confirmed fraud, 0 = legitimate
fraud_type	VARCHAR(30)	Stolen card, Account takeover, Card-not-present, etc.
reported_date	DATE	Date fraud was reported

SQL Analysis

Query 1: Transaction Amount Anomalies

Identify transactions that are significantly larger than a customer's typical spending pattern. Transactions exceeding 3 standard deviations from the account mean are flagged.

WITH account_stats AS (
    SELECT
        account_id,
        AVG(amount) AS avg_amount,
        STDDEV(amount) AS std_amount,
        COUNT(*) AS txn_count
    FROM transactions
    WHERE txn_datetime >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)
    GROUP BY account_id
    HAVING COUNT(*) >= 10
)
SELECT
    t.txn_id,
    t.account_id,
    t.amount,
    ROUND(a.avg_amount, 2) AS account_avg,
    ROUND((t.amount - a.avg_amount) / a.std_amount, 2) AS z_score,
    t.txn_datetime,
    t.location_country,
    COALESCE(f.is_fraud, 0) AS confirmed_fraud
FROM transactions t
JOIN account_stats a ON t.account_id = a.account_id
LEFT JOIN fraud_labels f ON t.txn_id = f.txn_id
WHERE (t.amount - a.avg_amount) / a.std_amount > 3
ORDER BY z_score DESC
LIMIT 100;

Query 2: High-Risk Merchants

Find merchants with unusually high fraud rates. These merchants may have compromised payment terminals or be complicit in fraud schemes.

SELECT
    m.merchant_id,
    m.merchant_name,
    m.category,
    COUNT(t.txn_id) AS total_transactions,
    SUM(f.is_fraud) AS fraud_count,
    ROUND(SUM(f.is_fraud) * 100.0 / COUNT(t.txn_id), 2) AS fraud_rate_pct,
    ROUND(SUM(CASE WHEN f.is_fraud = 1 THEN t.amount ELSE 0 END), 2) AS fraud_loss_total
FROM merchants m
JOIN transactions t ON m.merchant_id = t.merchant_id
LEFT JOIN fraud_labels f ON t.txn_id = f.txn_id
GROUP BY m.merchant_id, m.merchant_name, m.category
HAVING COUNT(t.txn_id) >= 50
ORDER BY fraud_rate_pct DESC
LIMIT 20;

Query 3: Velocity Checks (Rapid Successive Transactions)

Detect accounts with multiple transactions within a short time window, a classic indicator of stolen card usage where fraudsters rapidly make purchases before the card is blocked.

WITH txn_gaps AS (
    SELECT
        t.txn_id,
        t.account_id,
        t.amount,
        t.txn_datetime,
        t.location_city,
        LAG(t.txn_datetime) OVER (
            PARTITION BY t.account_id ORDER BY t.txn_datetime
        ) AS prev_txn_time,
        LAG(t.location_city) OVER (
            PARTITION BY t.account_id ORDER BY t.txn_datetime
        ) AS prev_city,
        TIMESTAMPDIFF(MINUTE, LAG(t.txn_datetime) OVER (
            PARTITION BY t.account_id ORDER BY t.txn_datetime
        ), t.txn_datetime) AS minutes_since_last
    FROM transactions t
    WHERE t.txn_datetime >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
)
SELECT
    txn_id,
    account_id,
    amount,
    txn_datetime,
    minutes_since_last,
    location_city,
    prev_city,
    CASE
        WHEN minutes_since_last < 2 AND location_city != prev_city
            THEN 'HIGH RISK: Impossible travel'
        WHEN minutes_since_last < 5
            THEN 'MEDIUM RISK: Rapid succession'
        ELSE 'LOW RISK'
    END AS velocity_flag
FROM txn_gaps
WHERE minutes_since_last < 10
ORDER BY minutes_since_last;

Python Analysis

Feature Engineering

Transform raw transaction data into features that capture behavioral patterns. These engineered features are far more predictive than raw amounts alone.

import pandas as pd
import numpy as np

txn = pd.read_csv('transactions.csv', parse_dates=['txn_datetime'])
fraud = pd.read_csv('fraud_labels.csv')
merchants = pd.read_csv('merchants.csv')
accounts = pd.read_csv('accounts.csv')

# Merge datasets
df = txn.merge(fraud, on='txn_id', how='left')
df = df.merge(merchants[['merchant_id', 'category', 'risk_score']], on='merchant_id')
df = df.merge(accounts[['account_id', 'home_country', 'credit_limit']], on='account_id')
df['is_fraud'] = df['is_fraud'].fillna(0).astype(int)

# Feature: Transaction amount deviation from account average
acct_stats = df.groupby('account_id')['amount'].agg(['mean', 'std']).reset_index()
acct_stats.columns = ['account_id', 'acct_avg_amount', 'acct_std_amount']
df = df.merge(acct_stats, on='account_id')
df['amount_zscore'] = (df['amount'] - df['acct_avg_amount']) / df['acct_std_amount'].replace(0, 1)

# Feature: Transaction velocity (count in last 1 hour)
df = df.sort_values(['account_id', 'txn_datetime'])
df['txn_hour'] = df['txn_datetime'].dt.hour
df['txn_count_1hr'] = df.groupby('account_id')['txn_id'].transform(
    lambda x: x.rolling(window='1H', on=df.loc[x.index, 'txn_datetime']).count()
) if False else 1  # Simplified; real impl uses time-window rolling

# Simpler velocity: transactions per account per day
daily_counts = df.groupby(['account_id', df['txn_datetime'].dt.date]).size().reset_index(name='daily_txn_count')
daily_counts.columns = ['account_id', 'txn_date', 'daily_txn_count']
df['txn_date'] = df['txn_datetime'].dt.date
df = df.merge(daily_counts, on=['account_id', 'txn_date'], how='left')

# Feature: Foreign transaction flag
df['is_foreign'] = (df['location_country'] != df['home_country']).astype(int)

# Feature: Credit utilization of single transaction
df['txn_pct_of_limit'] = df['amount'] / df['credit_limit'].replace(0, np.nan)

# Feature: Weekend / late-night flag
df['is_weekend'] = df['txn_datetime'].dt.dayofweek.isin([5, 6]).astype(int)
df['is_night'] = df['txn_hour'].between(0, 5).astype(int)

print(f"Dataset: {len(df)} transactions, {df['is_fraud'].sum()} fraud ({df['is_fraud'].mean()*100:.2f}%)")
print(f"Features engineered: amount_zscore, daily_txn_count, is_foreign, txn_pct_of_limit, is_weekend, is_night")

Isolation Forest Anomaly Detection

Isolation Forest works by randomly partitioning data. Anomalies (fraudulent transactions) are isolated in fewer splits because they differ significantly from normal patterns. This is ideal for fraud detection because it does not require labeled data for training.

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

feature_cols = ['amount', 'amount_zscore', 'daily_txn_count', 'is_foreign',
                'txn_pct_of_limit', 'is_weekend', 'is_night', 'risk_score']

X = df[feature_cols].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train Isolation Forest
# contamination = expected fraud rate (~0.5% of transactions)
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.005,
    max_samples='auto',
    random_state=42
)
iso_forest.fit(X_scaled)

# Predict: -1 = anomaly (potential fraud), 1 = normal
df['anomaly_pred'] = iso_forest.predict(X_scaled)
df['anomaly_score'] = iso_forest.decision_function(X_scaled)

# Convert to risk score (0-100, higher = more suspicious)
df['risk_score_model'] = ((1 - df['anomaly_score']) * 50).clip(0, 100).round(1)

flagged = df[df['anomaly_pred'] == -1]
print(f"Flagged {len(flagged)} transactions as suspicious ({len(flagged)/len(df)*100:.2f}%)")
print(f"Of those, {flagged['is_fraud'].sum()} are confirmed fraud "
      f"({flagged['is_fraud'].sum()/flagged['is_fraud'].count()*100:.1f}% precision)")

Precision-Recall Evaluation

In fraud detection, precision (what fraction of flagged transactions are actually fraud) and recall (what fraction of all fraud is caught) are more meaningful than accuracy because of extreme class imbalance.

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score

# Use the anomaly score as a continuous predictor
# Negate because lower scores = more anomalous
y_true = df['is_fraud']
y_scores = -df['anomaly_score']

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
avg_precision = average_precision_score(y_true, y_scores)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, color='darkorange', lw=2,
         label=f'Isolation Forest (AP = {avg_precision:.3f})')
plt.axhline(y=y_true.mean(), color='gray', linestyle='--', label='Baseline (random)')
plt.xlabel('Recall (Fraction of Fraud Caught)')
plt.ylabel('Precision (Fraction of Flags That Are Fraud)')
plt.title('Precision-Recall Curve - Fraud Detection')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('precision_recall_fraud.png', dpi=150)
plt.show()

# Find threshold for 80% recall
idx_80 = np.argmin(np.abs(recall - 0.80))
print(f"At 80% recall: precision = {precision[idx_80]:.2%}")
print(f"At 80% recall: threshold = {thresholds[min(idx_80, len(thresholds)-1)]:.3f}")

Excel Component

Note on Excel Usage

Fraud detection is primarily a SQL + Python project due to the volume of transactions and the need for statistical modeling. However, Excel can supplement the analysis in the following ways:

Summary Reporting: Export the top 100 flagged transactions to Excel for manual review by the fraud team. Include columns for transaction details, risk score, and a "Reviewed" checkbox column for the analyst to mark.
Merchant Risk Table: The high-risk merchant query results can be formatted in Excel with conditional formatting to highlight merchants exceeding a 5% fraud rate threshold.
Weekly Fraud Dashboard: Create a pivot table summarizing fraud counts and amounts by week, transaction type, and merchant category for management reporting.

Key Insights

Unsupervised Advantage

Isolation Forest does not require labeled fraud data for training, making it effective against new fraud patterns that have never been seen before. It detects anomalies based purely on how different a transaction is from normal behavior.

Feature Engineering is Critical

Raw transaction amount alone catches only obvious fraud. Engineered features like velocity (transactions per hour), z-score deviation, and impossible travel detection dramatically improve detection rates.

Precision vs Recall Tradeoff

Catching 95% of fraud (high recall) may require flagging 10x more legitimate transactions (low precision). The business must decide: is it worse to miss fraud or to block legitimate customers? Most banks target 80% recall with acceptable precision.

Layered Defense

The best fraud systems combine SQL rule-based checks (velocity, amount limits) with ML anomaly detection. Rules catch known patterns instantly; ML catches subtle, evolving patterns that rules would miss.