Project 13

Fraud Detection Analysis

A major bank is experiencing increasing fraudulent credit card transactions and needs a system to detect suspicious activity in real time. You will use SQL to identify anomalous patterns, then build a Python-based anomaly detection model using Isolation Forest to flag potentially fraudulent transactions before they cause financial losses.

Project Overview

Business Scenario

Pacific Coast Bank processes 2 million credit card transactions daily. Their existing rule-based fraud detection system catches only 60% of fraudulent transactions, while generating too many false alerts that frustrate legitimate customers. The fraud team wants a machine learning approach that can detect subtle patterns humans miss, reduce false positives, and adapt to evolving fraud tactics. Fraud currently costs the bank $4.2 million annually in chargebacks and investigation expenses.

  • Goal: Build an anomaly detection system that identifies fraudulent transactions with high precision
  • Tools: SQL (pattern detection and rule-based flags), Python (Isolation Forest model, feature engineering)
  • Stakeholder: Fraud Prevention Team and Risk Management
  • Deliverables: Anomaly detection model, transaction risk scores, high-risk merchant list, velocity check alerts

Dataset

Database Schema

The fraud detection database captures transaction details, account information, merchant data, and confirmed fraud labels for model training.

transactions

ColumnTypeDescription
txn_idINT (PK)Unique transaction identifier
account_idINT (FK)Customer account
merchant_idINT (FK)Merchant where transaction occurred
txn_datetimeDATETIMETransaction timestamp
amountDECIMAL(10,2)Transaction amount
txn_typeVARCHAR(20)Purchase, ATM Withdrawal, Online, Transfer
location_cityVARCHAR(50)City of transaction
location_countryVARCHAR(3)Country code (ISO 3166)

accounts

ColumnTypeDescription
account_idINT (PK)Account identifier
customer_nameVARCHAR(100)Account holder name
account_typeVARCHAR(20)Credit, Debit, Prepaid
credit_limitDECIMAL(10,2)Credit limit (NULL for debit)
home_countryVARCHAR(3)Account holder's home country
account_open_dateDATEDate account was opened

merchants

ColumnTypeDescription
merchant_idINT (PK)Merchant identifier
merchant_nameVARCHAR(100)Business name
mcc_codeVARCHAR(4)Merchant Category Code
categoryVARCHAR(50)Business category (Retail, Restaurant, Gas Station, etc.)
risk_scoreINTInternal risk score (1-10)

fraud_labels

ColumnTypeDescription
txn_idINT (PK/FK)Transaction identifier
is_fraudTINYINT1 = confirmed fraud, 0 = legitimate
fraud_typeVARCHAR(30)Stolen card, Account takeover, Card-not-present, etc.
reported_dateDATEDate fraud was reported

SQL Analysis

Query 1: Transaction Amount Anomalies

Identify transactions that are significantly larger than a customer's typical spending pattern. Transactions exceeding 3 standard deviations from the account mean are flagged.

WITH account_stats AS (
    SELECT
        account_id,
        AVG(amount) AS avg_amount,
        STDDEV(amount) AS std_amount,
        COUNT(*) AS txn_count
    FROM transactions
    WHERE txn_datetime >= DATE_SUB(CURDATE(), INTERVAL 90 DAY)
    GROUP BY account_id
    HAVING COUNT(*) >= 10
)
SELECT
    t.txn_id,
    t.account_id,
    t.amount,
    ROUND(a.avg_amount, 2) AS account_avg,
    ROUND((t.amount - a.avg_amount) / a.std_amount, 2) AS z_score,
    t.txn_datetime,
    t.location_country,
    COALESCE(f.is_fraud, 0) AS confirmed_fraud
FROM transactions t
JOIN account_stats a ON t.account_id = a.account_id
LEFT JOIN fraud_labels f ON t.txn_id = f.txn_id
WHERE (t.amount - a.avg_amount) / a.std_amount > 3
ORDER BY z_score DESC
LIMIT 100;

Query 2: High-Risk Merchants

Find merchants with unusually high fraud rates. These merchants may have compromised payment terminals or be complicit in fraud schemes.

SELECT
    m.merchant_id,
    m.merchant_name,
    m.category,
    COUNT(t.txn_id) AS total_transactions,
    SUM(f.is_fraud) AS fraud_count,
    ROUND(SUM(f.is_fraud) * 100.0 / COUNT(t.txn_id), 2) AS fraud_rate_pct,
    ROUND(SUM(CASE WHEN f.is_fraud = 1 THEN t.amount ELSE 0 END), 2) AS fraud_loss_total
FROM merchants m
JOIN transactions t ON m.merchant_id = t.merchant_id
LEFT JOIN fraud_labels f ON t.txn_id = f.txn_id
GROUP BY m.merchant_id, m.merchant_name, m.category
HAVING COUNT(t.txn_id) >= 50
ORDER BY fraud_rate_pct DESC
LIMIT 20;

Query 3: Velocity Checks (Rapid Successive Transactions)

Detect accounts with multiple transactions within a short time window, a classic indicator of stolen card usage where fraudsters rapidly make purchases before the card is blocked.

WITH txn_gaps AS (
    SELECT
        t.txn_id,
        t.account_id,
        t.amount,
        t.txn_datetime,
        t.location_city,
        LAG(t.txn_datetime) OVER (
            PARTITION BY t.account_id ORDER BY t.txn_datetime
        ) AS prev_txn_time,
        LAG(t.location_city) OVER (
            PARTITION BY t.account_id ORDER BY t.txn_datetime
        ) AS prev_city,
        TIMESTAMPDIFF(MINUTE, LAG(t.txn_datetime) OVER (
            PARTITION BY t.account_id ORDER BY t.txn_datetime
        ), t.txn_datetime) AS minutes_since_last
    FROM transactions t
    WHERE t.txn_datetime >= DATE_SUB(CURDATE(), INTERVAL 30 DAY)
)
SELECT
    txn_id,
    account_id,
    amount,
    txn_datetime,
    minutes_since_last,
    location_city,
    prev_city,
    CASE
        WHEN minutes_since_last < 2 AND location_city != prev_city
            THEN 'HIGH RISK: Impossible travel'
        WHEN minutes_since_last < 5
            THEN 'MEDIUM RISK: Rapid succession'
        ELSE 'LOW RISK'
    END AS velocity_flag
FROM txn_gaps
WHERE minutes_since_last < 10
ORDER BY minutes_since_last;

Python Analysis

Feature Engineering

Transform raw transaction data into features that capture behavioral patterns. These engineered features are far more predictive than raw amounts alone.

import pandas as pd
import numpy as np

txn = pd.read_csv('transactions.csv', parse_dates=['txn_datetime'])
fraud = pd.read_csv('fraud_labels.csv')
merchants = pd.read_csv('merchants.csv')
accounts = pd.read_csv('accounts.csv')

# Merge datasets
df = txn.merge(fraud, on='txn_id', how='left')
df = df.merge(merchants[['merchant_id', 'category', 'risk_score']], on='merchant_id')
df = df.merge(accounts[['account_id', 'home_country', 'credit_limit']], on='account_id')
df['is_fraud'] = df['is_fraud'].fillna(0).astype(int)

# Feature: Transaction amount deviation from account average
acct_stats = df.groupby('account_id')['amount'].agg(['mean', 'std']).reset_index()
acct_stats.columns = ['account_id', 'acct_avg_amount', 'acct_std_amount']
df = df.merge(acct_stats, on='account_id')
df['amount_zscore'] = (df['amount'] - df['acct_avg_amount']) / df['acct_std_amount'].replace(0, 1)

# Feature: Transaction velocity (count in last 1 hour)
df = df.sort_values(['account_id', 'txn_datetime'])
df['txn_hour'] = df['txn_datetime'].dt.hour
df['txn_count_1hr'] = df.groupby('account_id')['txn_id'].transform(
    lambda x: x.rolling(window='1H', on=df.loc[x.index, 'txn_datetime']).count()
) if False else 1  # Simplified; real impl uses time-window rolling

# Simpler velocity: transactions per account per day
daily_counts = df.groupby(['account_id', df['txn_datetime'].dt.date]).size().reset_index(name='daily_txn_count')
daily_counts.columns = ['account_id', 'txn_date', 'daily_txn_count']
df['txn_date'] = df['txn_datetime'].dt.date
df = df.merge(daily_counts, on=['account_id', 'txn_date'], how='left')

# Feature: Foreign transaction flag
df['is_foreign'] = (df['location_country'] != df['home_country']).astype(int)

# Feature: Credit utilization of single transaction
df['txn_pct_of_limit'] = df['amount'] / df['credit_limit'].replace(0, np.nan)

# Feature: Weekend / late-night flag
df['is_weekend'] = df['txn_datetime'].dt.dayofweek.isin([5, 6]).astype(int)
df['is_night'] = df['txn_hour'].between(0, 5).astype(int)

print(f"Dataset: {len(df)} transactions, {df['is_fraud'].sum()} fraud ({df['is_fraud'].mean()*100:.2f}%)")
print(f"Features engineered: amount_zscore, daily_txn_count, is_foreign, txn_pct_of_limit, is_weekend, is_night")

Isolation Forest Anomaly Detection

Isolation Forest works by randomly partitioning data. Anomalies (fraudulent transactions) are isolated in fewer splits because they differ significantly from normal patterns. This is ideal for fraud detection because it does not require labeled data for training.

from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler

feature_cols = ['amount', 'amount_zscore', 'daily_txn_count', 'is_foreign',
                'txn_pct_of_limit', 'is_weekend', 'is_night', 'risk_score']

X = df[feature_cols].fillna(0)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Train Isolation Forest
# contamination = expected fraud rate (~0.5% of transactions)
iso_forest = IsolationForest(
    n_estimators=200,
    contamination=0.005,
    max_samples='auto',
    random_state=42
)
iso_forest.fit(X_scaled)

# Predict: -1 = anomaly (potential fraud), 1 = normal
df['anomaly_pred'] = iso_forest.predict(X_scaled)
df['anomaly_score'] = iso_forest.decision_function(X_scaled)

# Convert to risk score (0-100, higher = more suspicious)
df['risk_score_model'] = ((1 - df['anomaly_score']) * 50).clip(0, 100).round(1)

flagged = df[df['anomaly_pred'] == -1]
print(f"Flagged {len(flagged)} transactions as suspicious ({len(flagged)/len(df)*100:.2f}%)")
print(f"Of those, {flagged['is_fraud'].sum()} are confirmed fraud "
      f"({flagged['is_fraud'].sum()/flagged['is_fraud'].count()*100:.1f}% precision)")

Precision-Recall Evaluation

In fraud detection, precision (what fraction of flagged transactions are actually fraud) and recall (what fraction of all fraud is caught) are more meaningful than accuracy because of extreme class imbalance.

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve, average_precision_score

# Use the anomaly score as a continuous predictor
# Negate because lower scores = more anomalous
y_true = df['is_fraud']
y_scores = -df['anomaly_score']

precision, recall, thresholds = precision_recall_curve(y_true, y_scores)
avg_precision = average_precision_score(y_true, y_scores)

plt.figure(figsize=(10, 6))
plt.plot(recall, precision, color='darkorange', lw=2,
         label=f'Isolation Forest (AP = {avg_precision:.3f})')
plt.axhline(y=y_true.mean(), color='gray', linestyle='--', label='Baseline (random)')
plt.xlabel('Recall (Fraction of Fraud Caught)')
plt.ylabel('Precision (Fraction of Flags That Are Fraud)')
plt.title('Precision-Recall Curve - Fraud Detection')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('precision_recall_fraud.png', dpi=150)
plt.show()

# Find threshold for 80% recall
idx_80 = np.argmin(np.abs(recall - 0.80))
print(f"At 80% recall: precision = {precision[idx_80]:.2%}")
print(f"At 80% recall: threshold = {thresholds[min(idx_80, len(thresholds)-1)]:.3f}")

Excel Component

Note on Excel Usage

Fraud detection is primarily a SQL + Python project due to the volume of transactions and the need for statistical modeling. However, Excel can supplement the analysis in the following ways:

  • Summary Reporting: Export the top 100 flagged transactions to Excel for manual review by the fraud team. Include columns for transaction details, risk score, and a "Reviewed" checkbox column for the analyst to mark.
  • Merchant Risk Table: The high-risk merchant query results can be formatted in Excel with conditional formatting to highlight merchants exceeding a 5% fraud rate threshold.
  • Weekly Fraud Dashboard: Create a pivot table summarizing fraud counts and amounts by week, transaction type, and merchant category for management reporting.

Key Insights

Unsupervised Advantage

Isolation Forest does not require labeled fraud data for training, making it effective against new fraud patterns that have never been seen before. It detects anomalies based purely on how different a transaction is from normal behavior.

Feature Engineering is Critical

Raw transaction amount alone catches only obvious fraud. Engineered features like velocity (transactions per hour), z-score deviation, and impossible travel detection dramatically improve detection rates.

Precision vs Recall Tradeoff

Catching 95% of fraud (high recall) may require flagging 10x more legitimate transactions (low precision). The business must decide: is it worse to miss fraud or to block legitimate customers? Most banks target 80% recall with acceptable precision.

Layered Defense

The best fraud systems combine SQL rule-based checks (velocity, amount limits) with ML anomaly detection. Rules catch known patterns instantly; ML catches subtle, evolving patterns that rules would miss.

Quiz

Question 1

How does Isolation Forest detect anomalies?

Question 2

What does the "impossible travel" velocity check detect?

Question 3

Why is precision-recall preferred over accuracy for evaluating fraud models?

Question 4

What does the contamination parameter in Isolation Forest control?

Question 5

Why is a z-score useful as a feature for fraud detection?

← Previous Project Next Project →