1353 Machine Learning for Anomaly Detection

1353.1 Learning Objectives

By the end of this chapter, you will be able to:

Apply Isolation Forest: Detect anomalies in high-dimensional data without labeled examples
Build Autoencoders: Use neural networks to learn normal patterns and flag deviations
Implement LSTM Networks: Detect temporal sequence anomalies in streaming data
Choose Between ML Methods: Select appropriate algorithms based on data characteristics and deployment constraints

Minimum Viable Understanding: ML Anomaly Detection

Core Concept: Machine learning methods learn what “normal” looks like from historical data and flag points that don’t fit the learned patterns. No explicit rules or thresholds required.

Why It Matters: ML methods detect complex, multi-dimensional anomalies that statistical methods miss. A motor with normal temperature, normal vibration, and normal current individually might still be anomalous if the combination of these values is unusual.

Key Takeaway: Use Isolation Forest when you have multi-dimensional data and no labeled anomalies. Use autoencoders for high-dimensional sensor data. Use LSTM for sequential patterns over time.

1353.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Statistical Methods: Understanding when simpler methods fail and ML is needed
Anomaly Types: Understanding collective anomalies that require pattern-based detection

~20 min | Advanced | P10.C01.U04

1353.3 Introduction

When statistical methods reach their limits - complex patterns, high-dimensional data, or no known distribution - machine learning becomes essential.

1353.4 Unsupervised Learning

Key Advantage: No labeled anomaly data required. The model learns “normal” from unlabeled data and flags deviations.

1353.4.1 Isolation Forest

Core Concept: Anomalies are “easier to isolate” than normal points. Build random decision trees that split data - anomalies require fewer splits to isolate.

Why It Works: - Normal points are clustered - requires many splits to isolate one point - Anomalies are isolated - few splits needed

Algorithm: 1. Randomly select a feature and split value 2. Partition data recursively 3. Anomalies end up in shorter paths (fewer splits) 4. Anomaly score = average path length across trees

Implementation:

from sklearn.ensemble import IsolationForest
import numpy as np

class IsolationForestDetector:
    def __init__(self, contamination=0.01, n_estimators=100):
        """
        contamination: expected proportion of anomalies (1% default)
        n_estimators: number of trees
        """
        self.model = IsolationForest(
            contamination=contamination,
            n_estimators=n_estimators,
            random_state=42
        )
        self.is_trained = False

    def train(self, normal_data):
        """
        Train on normal operating data
        normal_data: shape (n_samples, n_features)
        """
        self.model.fit(normal_data)
        self.is_trained = True

    def predict(self, data):
        """
        Predict anomalies in new data
        Returns: array of -1 (anomaly) or 1 (normal)
        """
        if not self.is_trained:
            raise ValueError("Model must be trained first")
        return self.model.predict(data)

    def score(self, data):
        """
        Get anomaly scores (more negative = more anomalous)
        """
        return self.model.score_samples(data)

# Example: Multi-sensor motor monitoring
# Features: [temperature, vibration, current, RPM]

# Normal operating data (1000 samples)
normal_data = np.array([
    [75 + np.random.normal(0, 2),      # Temperature ~75C
     1.2 + np.random.normal(0, 0.1),   # Vibration ~1.2 mm/s
     8.5 + np.random.normal(0, 0.3),   # Current ~8.5 A
     1450 + np.random.normal(0, 10)]   # RPM ~1450
    for _ in range(1000)
])

detector = IsolationForestDetector(contamination=0.01)
detector.train(normal_data)

# Test data including anomalies
test_data = np.array([
    [76, 1.3, 8.4, 1455],   # Normal
    [74, 1.2, 8.6, 1448],   # Normal
    [92, 3.8, 12.1, 1480],  # Anomaly: overheating + high vibration
    [75, 1.1, 8.5, 1200],   # Anomaly: RPM drop
])

predictions = detector.predict(test_data)
scores = detector.score(test_data)

for i, (pred, score) in enumerate(zip(predictions, scores)):
    status = "ANOMALY" if pred == -1 else "Normal"
    print(f"Sample {i}: {status} (score: {score:.3f})")

# Output:
# Sample 0: Normal (score: 0.215)
# Sample 1: Normal (score: 0.198)
# Sample 2: ANOMALY (score: -0.156)  <- Detected!
# Sample 3: ANOMALY (score: -0.089)  <- Detected!

Show code

{
  const container = document.getElementById('kc-anomaly-6');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "You are deploying anomaly detection for a new type of industrial pump where you have 6 months of normal operational data but zero examples of failures (the pumps have never failed yet). Which ML approach is most suitable?",
      options: [
        {text: "Supervised classification (Random Forest) trained to distinguish normal from anomalous samples", correct: false, feedback: "Incorrect. Supervised classification requires labeled examples of both classes. With zero failure examples, you cannot train a classifier to recognize anomalies - there is nothing for the model to learn about what anomalies look like."},
        {text: "Isolation Forest trained only on normal operational data", correct: true, feedback: "Correct! Isolation Forest is an unsupervised method that learns the structure of normal data. It identifies anomalies as points that are 'easy to isolate' - different from the learned normal patterns. No anomaly labels are required, making it ideal for this scenario."},
        {text: "Logistic regression with anomaly probability output", correct: false, feedback: "Incorrect. Logistic regression is a supervised method requiring labeled training examples from both classes. Without failure examples, you cannot train it to predict anomaly probability."},
        {text: "K-means clustering to identify failure clusters", correct: false, feedback: "Incorrect. K-means could identify clusters in the data, but without failure examples, you do not know which clusters represent anomalies. It also assumes anomalies form distinct clusters rather than being isolated points, which may not be true."}
      ],
      difficulty: "medium",
      topic: "anomaly-detection"
    }));
  }
}

Advantages: - No labeled anomalies needed for training - Handles high-dimensional data well - Computationally efficient (can run on edge gateways)

Limitations: - “contamination” parameter must be set correctly - Struggles with evolving “normal” (concept drift)

1353.4.2 One-Class SVM

Core Concept: Learn a boundary that encloses normal data points in high-dimensional space. Points outside boundary are anomalies.

When to Use: - Small, well-defined “normal” region - You have clean training data (no anomalies in training set) - Moderate dimensionality (up to ~50 features)

Trade-Off: More accurate than Isolation Forest but computationally expensive - typically cloud-deployed.

1353.4.3 K-Nearest Neighbors (KNN)

Core Concept: For each point, find K nearest neighbors. If average distance to neighbors is high, point is anomalous.

Algorithm:

For each test point:
  1. Find K nearest neighbors in training data
  2. Calculate average distance to K neighbors
  3. If distance > threshold, flag as anomaly

Pros/Cons:

Aspect	Rating	Notes
Accuracy	High	Very accurate for low-dimensional data
Speed	Low	Slow for large datasets (must search all points)
Memory	High	Must store entire training dataset
Edge Deploy	No	Too resource-intensive for constrained devices

Tradeoff: Statistical Methods vs ML-Based Anomaly Detection

Option A: Statistical methods (Z-score, IQR, ARIMA residuals) Option B: Machine learning approaches (Isolation Forest, Autoencoders, LSTM) Decision Factors: Statistical methods require minimal training data, are interpretable, run efficiently on edge devices, and work well for point anomalies with known distributions. ML methods excel at detecting complex multi-dimensional patterns, collective anomalies, and subtle deviations without explicit threshold tuning, but require representative training data and more computational resources. For edge deployment with <100KB RAM, use statistical methods. For cloud-based detection of complex industrial equipment patterns, use ML. Many production systems use statistical methods at the edge for fast response, with ML in the cloud for deeper analysis.

1353.5 Deep Learning

When anomalies have complex temporal or spatial patterns, deep learning becomes powerful.

1353.5.1 Autoencoders

Core Concept: Neural network that compresses data (encodes) then reconstructs it (decodes). Train on normal data - anomalies reconstruct poorly (high reconstruction error).

Architecture:

Input -> Encoder (compress) -> Bottleneck -> Decoder (reconstruct) -> Output
        [Normal data only]                                         |
                                                          Compare with Input
                                                          High error = Anomaly

Implementation:

import tensorflow as tf
from tensorflow import keras
import numpy as np

def build_autoencoder(input_dim, encoding_dim=8):
    """
    Build autoencoder for anomaly detection

    input_dim: number of features
    encoding_dim: compressed representation size
    """
    # Encoder
    input_layer = keras.layers.Input(shape=(input_dim,))
    encoded = keras.layers.Dense(32, activation='relu')(input_layer)
    encoded = keras.layers.Dense(16, activation='relu')(encoded)
    encoded = keras.layers.Dense(encoding_dim, activation='relu')(encoded)

    # Decoder
    decoded = keras.layers.Dense(16, activation='relu')(encoded)
    decoded = keras.layers.Dense(32, activation='relu')(decoded)
    output_layer = keras.layers.Dense(input_dim, activation='linear')(decoded)

    # Autoencoder model
    autoencoder = keras.Model(inputs=input_layer, outputs=output_layer)
    autoencoder.compile(optimizer='adam', loss='mse')

    return autoencoder

# Example: Sensor data with 10 features
input_dim = 10
model = build_autoencoder(input_dim, encoding_dim=4)

# Train on normal data (shape: samples x features)
normal_data = np.random.normal(0, 1, (10000, input_dim))
model.fit(normal_data, normal_data, epochs=50, batch_size=32,
          validation_split=0.1, verbose=0)

# Test on normal and anomalous data
test_normal = np.random.normal(0, 1, (100, input_dim))
test_anomaly = np.random.normal(5, 2, (10, input_dim))  # Different distribution

# Calculate reconstruction errors
normal_errors = np.mean(np.square(test_normal - model.predict(test_normal)), axis=1)
anomaly_errors = np.mean(np.square(test_anomaly - model.predict(test_anomaly)), axis=1)

# Set threshold at 95th percentile of normal errors
threshold = np.percentile(normal_errors, 95)

print(f"Normal reconstruction error: {np.mean(normal_errors):.4f}")
print(f"Anomaly reconstruction error: {np.mean(anomaly_errors):.4f}")
print(f"Threshold: {threshold:.4f}")
print(f"Anomalies detected: {np.sum(anomaly_errors > threshold)}/10")
# Output: Anomalies detected: 10/10 <- Perfect detection!

Show code

{
  const container = document.getElementById('kc-anomaly-7');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "An autoencoder is trained on 50,000 samples of normal sensor data from a chemical plant (35 sensors). During deployment, the reconstruction error threshold is set at the 95th percentile of training data errors. What does it mean when a new sample has a reconstruction error above this threshold?",
      options: [
        {text: "The sample is definitely a critical equipment failure requiring immediate shutdown", correct: false, feedback: "Incorrect. High reconstruction error indicates the sample differs from learned patterns, but does not determine severity or root cause. It could be a sensor glitch, process change, or actual failure - further investigation is needed."},
        {text: "The sample's sensor readings deviate from the patterns the autoencoder learned as 'normal'", correct: true, feedback: "Correct! The autoencoder learned to compress and reconstruct normal patterns. High reconstruction error means the new sample does not match these learned patterns - it is anomalous relative to training data. This triggers investigation but does not diagnose the cause."},
        {text: "The model has overfitted and needs retraining with more data", correct: false, feedback: "Incorrect. High reconstruction error on a new sample is expected behavior for anomaly detection. Overfitting would cause high errors on ALL new data, not just specific anomalous samples."},
        {text: "The 95th percentile threshold is set too low and should be increased to 99th", correct: false, feedback: "Incorrect. Threshold selection depends on the cost of false positives vs false negatives, not on whether individual samples exceed it. The threshold should be tuned based on operational requirements, not adjusted every time an anomaly is detected."}
      ],
      difficulty: "medium",
      topic: "anomaly-detection"
    }));
  }
}

When Autoencoders Excel: - High-dimensional sensor data (>20 features) - Complex, nonlinear relationships between sensors - Sufficient training data (>10,000 samples)

Deployment Considerations: - Training: Cloud-based (GPUs) - Inference: Can run on edge gateways (Raspberry Pi 4, NVIDIA Jetson)

1353.5.2 LSTM (Long Short-Term Memory) Networks

Core Concept: Recurrent neural network that learns temporal sequences. Predict next value(s) based on recent history - large prediction errors indicate anomalies.

Architecture for Anomaly Detection:

Sliding window of past N timesteps -> LSTM layers -> Predict next value
                                                   |
                                      Compare prediction vs actual
                                      High error = Anomaly

Ideal for: - Vibration patterns (bearing wear develops over time) - Network traffic (DDoS attacks show temporal patterns) - Power consumption (usage patterns span hours/days)

1353.6 ML Method Comparison

Method	Training Data Needed	Computational Cost	Edge Deployable	Best For
Isolation Forest	1,000-10,000 samples	Low	Yes	Multi-sensor, moderate features
One-Class SVM	500-5,000 samples	Medium	Gateway only	Well-defined normal region
KNN	1,000+ samples	High (inference)	No	Low dimensions, small datasets
Autoencoder	10,000+ samples	High (training)	Gateway only	High dimensions, nonlinear
LSTM	50,000+ sequences	Very High	No (cloud)	Temporal patterns, sequences

Tradeoff: Batch Model Training vs Online Learning

Option A (Batch Training): Train models offline on historical data (weeks to months of sensor readings), deploy frozen models that remain static until manually retrained. Training takes hours on cloud GPUs, but inference is fast (10-100ms) with predictable performance. Option B (Online Learning): Continuously update model parameters as new data arrives, adapting to concept drift in real-time. Each new reading triggers incremental model updates (100-500ms), allowing detection thresholds to evolve with changing equipment behavior. Decision Factors: Choose batch training when normal behavior is stable over deployment lifetime (factory equipment with fixed operating conditions), regulatory compliance requires reproducible model behavior (medical devices, financial systems), or compute resources at edge cannot support training workloads. Choose online learning when “normal” shifts frequently (seasonal HVAC patterns, user behavior in smart homes), deploying to environments without historical training data (new equipment types), or concept drift causes batch models to degrade within weeks. Hybrid approach: batch-trained base model with online threshold adaptation handles most industrial IoT scenarios - the model structure stays frozen but decision boundaries adjust monthly based on confirmed anomaly feedback.

1353.7 Worked Examples

Worked Example: Isolation Forest for HVAC Anomaly Detection

Scenario: A commercial building management company deploys anomaly detection across 200 HVAC units to identify failing compressors before breakdown. Each unit reports temperature, humidity, power consumption, and compressor vibration every 30 seconds.

Given: - Training data: 6 months of normal operation (5.2 million samples from 200 units) - Feature vector per sample: [supply_temp, return_temp, humidity, power_kW, vibration_rms, outdoor_temp] - Known anomaly rate from historical maintenance records: ~0.8% - Deployment target: Raspberry Pi 4 gateway (4GB RAM) serving 50 units - Requirement: <2 second detection latency, <5% false positive rate

Steps: 1. Data preparation: - Remove known maintenance periods and startup transients (first 10 minutes after power-on) - Normalize features: Z-score standardization using training set statistics - Clean dataset: 4.8 million samples after filtering

Model training with Isolation Forest:
- Parameters: n_estimators=100, contamination=0.008 (matching historical rate), max_samples=256
- Training time: 8.2 minutes on 8-core server
- Model size: 12.4 MB (serialized with joblib)
Threshold calibration using validation set (500K samples, 1 month holdout):
- Default threshold (contamination=0.008): 4.1% false positive rate (too high)
- Adjusted contamination=0.005: 2.8% false positive rate, 94% recall
- Adjusted contamination=0.003: 1.9% false positive rate, 89% recall
Edge deployment optimization:
- Quantize model features to float16: Model size 6.8 MB
- Reduce n_estimators from 100 to 50: Model size 3.4 MB, recall drops 2% to 87%
- Batch inference: Process 50 units in single call (50 samples x 6 features)
Production performance metrics:
- Inference time: 45ms for 50-sample batch on RPi4
- Memory usage: 380 MB including model, buffers, and Python runtime
- Detection latency: 30s sampling + 45ms inference = ~31 seconds end-to-end

Result: Isolation Forest with 50 trees achieves 87% recall and 2.3% false positive rate. Over 3 months of production operation, the system detected 23 of 26 actual failures (88.5% real-world recall) with 847 false alarms across 200 units (1.4 false alarms per unit per month). Estimated savings: $156,000 in prevented emergency repairs vs. $12,000 investigation cost for false alarms.

Key Insight: Isolation Forest’s contamination parameter directly controls the precision-recall tradeoff. Start with your historical anomaly rate, then adjust based on the cost ratio of false negatives (missed failures) to false positives (unnecessary investigations). For HVAC where emergency repairs cost 5-10x preventive maintenance, accepting higher false positive rates (2-5%) to achieve 85%+ recall is economically justified.

Worked Example: Autoencoder for Multi-Sensor Industrial Anomaly Detection

Scenario: A semiconductor fab deploys deep learning anomaly detection on plasma etch chambers. Each chamber has 47 sensors (gas flows, pressures, RF power, temperatures, optical emission spectra) sampled at 10 Hz during 3-minute etch processes.

Given: - Training data: 45,000 normal etch processes (no defective wafers) over 8 months - Input shape per process: 1,800 timesteps x 47 features = 84,600 values - Anomaly definition: Process that produces defective wafer (known from downstream metrology) - Historical defect rate: 0.3% of processes - Deployment: NVIDIA Jetson AGX Xavier (32GB RAM, 512 CUDA cores) - Requirement: Process-level decision within 10 seconds of etch completion

Steps: 1. Feature engineering for temporal data: - Segment each process into 6 phases (30s each): gas stabilization, plasma ignition, main etch (3 phases), purge - Extract per-phase statistics: mean, std, min, max, slope for each sensor (47 x 5 x 6 = 1,410 features) - Dimensionality: 84,600 raw values compressed to 1,410 phase-aggregated features

Autoencoder architecture design:
- Encoder: 1410 -> 512 -> 256 -> 64 (bottleneck)
- Decoder: 64 -> 256 -> 512 -> 1410
- Activation: ReLU (hidden), linear (output)
- Total parameters: 1.8 million (7.2 MB float32)
Training configuration:
- Loss function: MSE reconstruction loss
- Optimizer: Adam, learning rate 1e-4 with cosine decay
- Batch size: 128 processes
- Training epochs: 150 (early stopping at epoch 112)
- Training time: 2.4 hours on 4x V100 GPUs
- Validation reconstruction error (normal): mean 0.023, std 0.008
Threshold determination:
- Compute reconstruction error on 5,000 validation processes (all normal)
- Set threshold at 99.5th percentile: 0.047 (3 sigma above mean)
- Test on 200 known-defective processes: 178 exceed threshold (89% recall)
- Test on 5,000 normal processes: 24 exceed threshold (0.48% false positive rate)
Deployment optimization for Jetson:
- Convert to TensorRT: 2.1x inference speedup
- Mixed precision (FP16): Model size 3.6 MB, inference time 82ms
- Batch inference: 10 processes in parallel, 340ms total

Result: TensorRT-optimized autoencoder achieves 89% recall on defective processes with 0.48% false positive rate. End-to-end latency is 3.2 seconds (feature extraction) + 0.34s (inference) = 3.5 seconds per process. Over 6-month deployment, the system flagged 156 processes; 141 were true defects (90.4% precision), preventing $2.1M in downstream processing of defective wafers. 15 false alarms cost ~$45K in additional metrology.

Key Insight: For high-dimensional time-series anomaly detection, phase-based feature aggregation dramatically reduces autoencoder input size while preserving discriminative information. A 60x dimensionality reduction (84,600 to 1,410) enables faster training, smaller models, and more robust generalization. The key is domain knowledge - knowing that plasma etch has distinct phases allows meaningful aggregation rather than arbitrary windowing.

Show code

{
  const container = document.getElementById('kc-anomaly-8');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "A manufacturing company has two deployment options for anomaly detection: (A) Z-score on individual sensors at the edge (ESP32 microcontroller) or (B) LSTM autoencoder in the cloud analyzing 1-hour sequences. What is the primary advantage of deploying option A at the edge despite its lower detection sophistication?",
      options: [
        {text: "Edge detection achieves higher accuracy than cloud-based deep learning", correct: false, feedback: "Incorrect. LSTM autoencoders can detect more complex patterns and generally achieve higher accuracy for sophisticated anomalies. The advantage of edge deployment is not accuracy."},
        {text: "Edge detection provides immediate response (<100ms) for safety-critical anomalies without network dependency", correct: true, feedback: "Correct! Edge detection enables real-time response (milliseconds) for critical failures like overcurrent or overtemperature. If the motor is about to destroy itself, you cannot wait for cloud round-trip latency. Edge detection also works during network outages when cloud connectivity is lost."},
        {text: "Edge microcontrollers have more processing power than cloud servers", correct: false, feedback: "Incorrect. Cloud servers have vastly more processing power. ESP32 microcontrollers are severely resource-constrained (520KB RAM, 240MHz CPU) compared to cloud GPUs. The advantage is latency and reliability, not compute power."},
        {text: "Z-score requires more training data than LSTM autoencoders", correct: false, feedback: "Incorrect. Z-score requires minimal or no training data (can calculate online), while LSTM autoencoders require substantial training data (10,000+ sequences). This is actually an advantage of Z-score, not the primary deployment consideration."}
      ],
      difficulty: "medium",
      topic: "anomaly-detection"
    }));
  }
}

1353.8 Summary

Machine learning methods excel at detecting complex anomalies that statistical methods miss:

Isolation Forest: Best for multi-dimensional data, no labels required, edge-deployable
Autoencoders: Excel at high-dimensional sensor data with complex correlations
LSTM: Captures temporal sequence patterns for time-series anomalies

Key Takeaway: Start with statistical methods for point anomalies. Graduate to ML for collective anomalies, high dimensions, or when statistical methods have too many false positives.

1353.9 What’s Next

Continue learning about production anomaly detection:

Real-Time Pipelines: Building production detection systems with edge-cloud architecture
Performance Metrics: Evaluating detection accuracy and tuning thresholds for your domain