1353  Machine Learning for Anomaly Detection

1353.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Apply Isolation Forest: Detect anomalies in high-dimensional data without labeled examples
  • Build Autoencoders: Use neural networks to learn normal patterns and flag deviations
  • Implement LSTM Networks: Detect temporal sequence anomalies in streaming data
  • Choose Between ML Methods: Select appropriate algorithms based on data characteristics and deployment constraints
TipMinimum Viable Understanding: ML Anomaly Detection

Core Concept: Machine learning methods learn what โ€œnormalโ€ looks like from historical data and flag points that donโ€™t fit the learned patterns. No explicit rules or thresholds required.

Why It Matters: ML methods detect complex, multi-dimensional anomalies that statistical methods miss. A motor with normal temperature, normal vibration, and normal current individually might still be anomalous if the combination of these values is unusual.

Key Takeaway: Use Isolation Forest when you have multi-dimensional data and no labeled anomalies. Use autoencoders for high-dimensional sensor data. Use LSTM for sequential patterns over time.

1353.2 Prerequisites

Before diving into this chapter, you should be familiar with:

  • Statistical Methods: Understanding when simpler methods fail and ML is needed
  • Anomaly Types: Understanding collective anomalies that require pattern-based detection

~20 min | Advanced | P10.C01.U04

1353.3 Introduction

When statistical methods reach their limits - complex patterns, high-dimensional data, or no known distribution - machine learning becomes essential.

1353.4 Unsupervised Learning

Key Advantage: No labeled anomaly data required. The model learns โ€œnormalโ€ from unlabeled data and flags deviations.

1353.4.1 Isolation Forest

Core Concept: Anomalies are โ€œeasier to isolateโ€ than normal points. Build random decision trees that split data - anomalies require fewer splits to isolate.

Why It Works: - Normal points are clustered - requires many splits to isolate one point - Anomalies are isolated - few splits needed

Algorithm: 1. Randomly select a feature and split value 2. Partition data recursively 3. Anomalies end up in shorter paths (fewer splits) 4. Anomaly score = average path length across trees

Implementation:

from sklearn.ensemble import IsolationForest
import numpy as np

class IsolationForestDetector:
    def __init__(self, contamination=0.01, n_estimators=100):
        """
        contamination: expected proportion of anomalies (1% default)
        n_estimators: number of trees
        """
        self.model = IsolationForest(
            contamination=contamination,
            n_estimators=n_estimators,
            random_state=42
        )
        self.is_trained = False

    def train(self, normal_data):
        """
        Train on normal operating data
        normal_data: shape (n_samples, n_features)
        """
        self.model.fit(normal_data)
        self.is_trained = True

    def predict(self, data):
        """
        Predict anomalies in new data
        Returns: array of -1 (anomaly) or 1 (normal)
        """
        if not self.is_trained:
            raise ValueError("Model must be trained first")
        return self.model.predict(data)

    def score(self, data):
        """
        Get anomaly scores (more negative = more anomalous)
        """
        return self.model.score_samples(data)

# Example: Multi-sensor motor monitoring
# Features: [temperature, vibration, current, RPM]

# Normal operating data (1000 samples)
normal_data = np.array([
    [75 + np.random.normal(0, 2),      # Temperature ~75C
     1.2 + np.random.normal(0, 0.1),   # Vibration ~1.2 mm/s
     8.5 + np.random.normal(0, 0.3),   # Current ~8.5 A
     1450 + np.random.normal(0, 10)]   # RPM ~1450
    for _ in range(1000)
])

detector = IsolationForestDetector(contamination=0.01)
detector.train(normal_data)

# Test data including anomalies
test_data = np.array([
    [76, 1.3, 8.4, 1455],   # Normal
    [74, 1.2, 8.6, 1448],   # Normal
    [92, 3.8, 12.1, 1480],  # Anomaly: overheating + high vibration
    [75, 1.1, 8.5, 1200],   # Anomaly: RPM drop
])

predictions = detector.predict(test_data)
scores = detector.score(test_data)

for i, (pred, score) in enumerate(zip(predictions, scores)):
    status = "ANOMALY" if pred == -1 else "Normal"
    print(f"Sample {i}: {status} (score: {score:.3f})")

# Output:
# Sample 0: Normal (score: 0.215)
# Sample 1: Normal (score: 0.198)
# Sample 2: ANOMALY (score: -0.156)  <- Detected!
# Sample 3: ANOMALY (score: -0.089)  <- Detected!

Advantages: - No labeled anomalies needed for training - Handles high-dimensional data well - Computationally efficient (can run on edge gateways)

Limitations: - โ€œcontaminationโ€ parameter must be set correctly - Struggles with evolving โ€œnormalโ€ (concept drift)

1353.4.2 One-Class SVM

Core Concept: Learn a boundary that encloses normal data points in high-dimensional space. Points outside boundary are anomalies.

When to Use: - Small, well-defined โ€œnormalโ€ region - You have clean training data (no anomalies in training set) - Moderate dimensionality (up to ~50 features)

Trade-Off: More accurate than Isolation Forest but computationally expensive - typically cloud-deployed.

1353.4.3 K-Nearest Neighbors (KNN)

Core Concept: For each point, find K nearest neighbors. If average distance to neighbors is high, point is anomalous.

Algorithm:

For each test point:
  1. Find K nearest neighbors in training data
  2. Calculate average distance to K neighbors
  3. If distance > threshold, flag as anomaly

Pros/Cons:

Aspect Rating Notes
Accuracy High Very accurate for low-dimensional data
Speed Low Slow for large datasets (must search all points)
Memory High Must store entire training dataset
Edge Deploy No Too resource-intensive for constrained devices
WarningTradeoff: Statistical Methods vs ML-Based Anomaly Detection

Option A: Statistical methods (Z-score, IQR, ARIMA residuals) Option B: Machine learning approaches (Isolation Forest, Autoencoders, LSTM) Decision Factors: Statistical methods require minimal training data, are interpretable, run efficiently on edge devices, and work well for point anomalies with known distributions. ML methods excel at detecting complex multi-dimensional patterns, collective anomalies, and subtle deviations without explicit threshold tuning, but require representative training data and more computational resources. For edge deployment with <100KB RAM, use statistical methods. For cloud-based detection of complex industrial equipment patterns, use ML. Many production systems use statistical methods at the edge for fast response, with ML in the cloud for deeper analysis.

1353.5 Deep Learning

When anomalies have complex temporal or spatial patterns, deep learning becomes powerful.

1353.5.1 Autoencoders

Core Concept: Neural network that compresses data (encodes) then reconstructs it (decodes). Train on normal data - anomalies reconstruct poorly (high reconstruction error).

Architecture:

Input -> Encoder (compress) -> Bottleneck -> Decoder (reconstruct) -> Output
        [Normal data only]                                         |
                                                          Compare with Input
                                                          High error = Anomaly

Implementation:

import tensorflow as tf
from tensorflow import keras
import numpy as np

def build_autoencoder(input_dim, encoding_dim=8):
    """
    Build autoencoder for anomaly detection

    input_dim: number of features
    encoding_dim: compressed representation size
    """
    # Encoder
    input_layer = keras.layers.Input(shape=(input_dim,))
    encoded = keras.layers.Dense(32, activation='relu')(input_layer)
    encoded = keras.layers.Dense(16, activation='relu')(encoded)
    encoded = keras.layers.Dense(encoding_dim, activation='relu')(encoded)

    # Decoder
    decoded = keras.layers.Dense(16, activation='relu')(encoded)
    decoded = keras.layers.Dense(32, activation='relu')(decoded)
    output_layer = keras.layers.Dense(input_dim, activation='linear')(decoded)

    # Autoencoder model
    autoencoder = keras.Model(inputs=input_layer, outputs=output_layer)
    autoencoder.compile(optimizer='adam', loss='mse')

    return autoencoder

# Example: Sensor data with 10 features
input_dim = 10
model = build_autoencoder(input_dim, encoding_dim=4)

# Train on normal data (shape: samples x features)
normal_data = np.random.normal(0, 1, (10000, input_dim))
model.fit(normal_data, normal_data, epochs=50, batch_size=32,
          validation_split=0.1, verbose=0)

# Test on normal and anomalous data
test_normal = np.random.normal(0, 1, (100, input_dim))
test_anomaly = np.random.normal(5, 2, (10, input_dim))  # Different distribution

# Calculate reconstruction errors
normal_errors = np.mean(np.square(test_normal - model.predict(test_normal)), axis=1)
anomaly_errors = np.mean(np.square(test_anomaly - model.predict(test_anomaly)), axis=1)

# Set threshold at 95th percentile of normal errors
threshold = np.percentile(normal_errors, 95)

print(f"Normal reconstruction error: {np.mean(normal_errors):.4f}")
print(f"Anomaly reconstruction error: {np.mean(anomaly_errors):.4f}")
print(f"Threshold: {threshold:.4f}")
print(f"Anomalies detected: {np.sum(anomaly_errors > threshold)}/10")
# Output: Anomalies detected: 10/10 <- Perfect detection!

When Autoencoders Excel: - High-dimensional sensor data (>20 features) - Complex, nonlinear relationships between sensors - Sufficient training data (>10,000 samples)

Deployment Considerations: - Training: Cloud-based (GPUs) - Inference: Can run on edge gateways (Raspberry Pi 4, NVIDIA Jetson)

1353.5.2 LSTM (Long Short-Term Memory) Networks

Core Concept: Recurrent neural network that learns temporal sequences. Predict next value(s) based on recent history - large prediction errors indicate anomalies.

Architecture for Anomaly Detection:

Sliding window of past N timesteps -> LSTM layers -> Predict next value
                                                   |
                                      Compare prediction vs actual
                                      High error = Anomaly

Ideal for: - Vibration patterns (bearing wear develops over time) - Network traffic (DDoS attacks show temporal patterns) - Power consumption (usage patterns span hours/days)

1353.6 ML Method Comparison

Method Training Data Needed Computational Cost Edge Deployable Best For
Isolation Forest 1,000-10,000 samples Low Yes Multi-sensor, moderate features
One-Class SVM 500-5,000 samples Medium Gateway only Well-defined normal region
KNN 1,000+ samples High (inference) No Low dimensions, small datasets
Autoencoder 10,000+ samples High (training) Gateway only High dimensions, nonlinear
LSTM 50,000+ sequences Very High No (cloud) Temporal patterns, sequences
WarningTradeoff: Batch Model Training vs Online Learning

Option A (Batch Training): Train models offline on historical data (weeks to months of sensor readings), deploy frozen models that remain static until manually retrained. Training takes hours on cloud GPUs, but inference is fast (10-100ms) with predictable performance. Option B (Online Learning): Continuously update model parameters as new data arrives, adapting to concept drift in real-time. Each new reading triggers incremental model updates (100-500ms), allowing detection thresholds to evolve with changing equipment behavior. Decision Factors: Choose batch training when normal behavior is stable over deployment lifetime (factory equipment with fixed operating conditions), regulatory compliance requires reproducible model behavior (medical devices, financial systems), or compute resources at edge cannot support training workloads. Choose online learning when โ€œnormalโ€ shifts frequently (seasonal HVAC patterns, user behavior in smart homes), deploying to environments without historical training data (new equipment types), or concept drift causes batch models to degrade within weeks. Hybrid approach: batch-trained base model with online threshold adaptation handles most industrial IoT scenarios - the model structure stays frozen but decision boundaries adjust monthly based on confirmed anomaly feedback.

1353.7 Worked Examples

NoteWorked Example: Isolation Forest for HVAC Anomaly Detection

Scenario: A commercial building management company deploys anomaly detection across 200 HVAC units to identify failing compressors before breakdown. Each unit reports temperature, humidity, power consumption, and compressor vibration every 30 seconds.

Given: - Training data: 6 months of normal operation (5.2 million samples from 200 units) - Feature vector per sample: [supply_temp, return_temp, humidity, power_kW, vibration_rms, outdoor_temp] - Known anomaly rate from historical maintenance records: ~0.8% - Deployment target: Raspberry Pi 4 gateway (4GB RAM) serving 50 units - Requirement: <2 second detection latency, <5% false positive rate

Steps: 1. Data preparation: - Remove known maintenance periods and startup transients (first 10 minutes after power-on) - Normalize features: Z-score standardization using training set statistics - Clean dataset: 4.8 million samples after filtering

  1. Model training with Isolation Forest:
    • Parameters: n_estimators=100, contamination=0.008 (matching historical rate), max_samples=256
    • Training time: 8.2 minutes on 8-core server
    • Model size: 12.4 MB (serialized with joblib)
  2. Threshold calibration using validation set (500K samples, 1 month holdout):
    • Default threshold (contamination=0.008): 4.1% false positive rate (too high)
    • Adjusted contamination=0.005: 2.8% false positive rate, 94% recall
    • Adjusted contamination=0.003: 1.9% false positive rate, 89% recall
  3. Edge deployment optimization:
    • Quantize model features to float16: Model size 6.8 MB
    • Reduce n_estimators from 100 to 50: Model size 3.4 MB, recall drops 2% to 87%
    • Batch inference: Process 50 units in single call (50 samples x 6 features)
  4. Production performance metrics:
    • Inference time: 45ms for 50-sample batch on RPi4
    • Memory usage: 380 MB including model, buffers, and Python runtime
    • Detection latency: 30s sampling + 45ms inference = ~31 seconds end-to-end

Result: Isolation Forest with 50 trees achieves 87% recall and 2.3% false positive rate. Over 3 months of production operation, the system detected 23 of 26 actual failures (88.5% real-world recall) with 847 false alarms across 200 units (1.4 false alarms per unit per month). Estimated savings: $156,000 in prevented emergency repairs vs. $12,000 investigation cost for false alarms.

Key Insight: Isolation Forestโ€™s contamination parameter directly controls the precision-recall tradeoff. Start with your historical anomaly rate, then adjust based on the cost ratio of false negatives (missed failures) to false positives (unnecessary investigations). For HVAC where emergency repairs cost 5-10x preventive maintenance, accepting higher false positive rates (2-5%) to achieve 85%+ recall is economically justified.

NoteWorked Example: Autoencoder for Multi-Sensor Industrial Anomaly Detection

Scenario: A semiconductor fab deploys deep learning anomaly detection on plasma etch chambers. Each chamber has 47 sensors (gas flows, pressures, RF power, temperatures, optical emission spectra) sampled at 10 Hz during 3-minute etch processes.

Given: - Training data: 45,000 normal etch processes (no defective wafers) over 8 months - Input shape per process: 1,800 timesteps x 47 features = 84,600 values - Anomaly definition: Process that produces defective wafer (known from downstream metrology) - Historical defect rate: 0.3% of processes - Deployment: NVIDIA Jetson AGX Xavier (32GB RAM, 512 CUDA cores) - Requirement: Process-level decision within 10 seconds of etch completion

Steps: 1. Feature engineering for temporal data: - Segment each process into 6 phases (30s each): gas stabilization, plasma ignition, main etch (3 phases), purge - Extract per-phase statistics: mean, std, min, max, slope for each sensor (47 x 5 x 6 = 1,410 features) - Dimensionality: 84,600 raw values compressed to 1,410 phase-aggregated features

  1. Autoencoder architecture design:
    • Encoder: 1410 -> 512 -> 256 -> 64 (bottleneck)
    • Decoder: 64 -> 256 -> 512 -> 1410
    • Activation: ReLU (hidden), linear (output)
    • Total parameters: 1.8 million (7.2 MB float32)
  2. Training configuration:
    • Loss function: MSE reconstruction loss
    • Optimizer: Adam, learning rate 1e-4 with cosine decay
    • Batch size: 128 processes
    • Training epochs: 150 (early stopping at epoch 112)
    • Training time: 2.4 hours on 4x V100 GPUs
    • Validation reconstruction error (normal): mean 0.023, std 0.008
  3. Threshold determination:
    • Compute reconstruction error on 5,000 validation processes (all normal)
    • Set threshold at 99.5th percentile: 0.047 (3 sigma above mean)
    • Test on 200 known-defective processes: 178 exceed threshold (89% recall)
    • Test on 5,000 normal processes: 24 exceed threshold (0.48% false positive rate)
  4. Deployment optimization for Jetson:
    • Convert to TensorRT: 2.1x inference speedup
    • Mixed precision (FP16): Model size 3.6 MB, inference time 82ms
    • Batch inference: 10 processes in parallel, 340ms total

Result: TensorRT-optimized autoencoder achieves 89% recall on defective processes with 0.48% false positive rate. End-to-end latency is 3.2 seconds (feature extraction) + 0.34s (inference) = 3.5 seconds per process. Over 6-month deployment, the system flagged 156 processes; 141 were true defects (90.4% precision), preventing $2.1M in downstream processing of defective wafers. 15 false alarms cost ~$45K in additional metrology.

Key Insight: For high-dimensional time-series anomaly detection, phase-based feature aggregation dramatically reduces autoencoder input size while preserving discriminative information. A 60x dimensionality reduction (84,600 to 1,410) enables faster training, smaller models, and more robust generalization. The key is domain knowledge - knowing that plasma etch has distinct phases allows meaningful aggregation rather than arbitrary windowing.

1353.8 Summary

Machine learning methods excel at detecting complex anomalies that statistical methods miss:

  • Isolation Forest: Best for multi-dimensional data, no labels required, edge-deployable
  • Autoencoders: Excel at high-dimensional sensor data with complex correlations
  • LSTM: Captures temporal sequence patterns for time-series anomalies

Key Takeaway: Start with statistical methods for point anomalies. Graduate to ML for collective anomalies, high dimensions, or when statistical methods have too many false positives.

1353.9 Whatโ€™s Next

Continue learning about production anomaly detection: