1351  Production ML: Monitoring and Anomaly Detection

1351.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Monitor Production ML: Track key metrics and detect model degradation
  • Design Anomaly Detection Pipelines: Build predictive maintenance systems
  • Handle Concept Drift: Detect and respond to changing data distributions
  • Debug IoT ML Systems: Diagnose and fix common production issues

1351.2 Prerequisites

NoteChapter Series: Modeling and Inferencing

This is part 7 (final) of the IoT Machine Learning series:

  1. ML Fundamentals - Core concepts
  2. Mobile Sensing - HAR, transportation
  3. IoT ML Pipeline - 7-step pipeline
  4. Edge ML & Deployment - TinyML
  5. Audio Feature Processing - MFCC
  6. Feature Engineering - Feature design
  7. Production ML (this chapter) - Monitoring and anomaly detection

1351.3 Monitoring IoT ML Models

Deploying ML to IoT introduces unique challenges. Unlike cloud ML with direct server access, IoT models run on distributed, often disconnected edge devices.

1351.3.1 Key Metrics to Track

Metric Category Critical Metrics Alert Threshold
Model Performance Inference accuracy, confidence distribution < 80% baseline, KL divergence > 0.3
Inference Latency P50/P99 latency, timeout rate > 100ms / > 500ms, > 1%
Resource Usage CPU, memory, battery drain > 80% sustained, > 5mW avg
Data Quality Missing features, out-of-range values > 5% devices, > 1% readings

1351.3.2 Common Issues and Solutions

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    Issues[Production Issues]

    Issues --> Drift[Accuracy Degraded<br/>Concept Drift]
    Issues --> Latency[High Latency<br/>Model Too Large]
    Issues --> Battery[Battery Drain<br/>Too Frequent]
    Issues --> Memory[Model Crashes<br/>Out of Memory]

    Drift --> DriftSol[Compare distributions<br/>Retrain with recent data]
    Latency --> LatencySol[Profile inference<br/>Quantize/prune model]
    Battery --> BatterySol[Check duty cycle<br/>Reduce sampling rate]
    Memory --> MemSol[Monitor heap usage<br/>Optimize batch size]

    style Issues fill:#E67E22,stroke:#2C3E50,color:#fff
    style Drift fill:#E74C3C,stroke:#2C3E50,color:#fff
    style Latency fill:#E74C3C,stroke:#2C3E50,color:#fff
    style Battery fill:#E74C3C,stroke:#2C3E50,color:#fff
    style Memory fill:#E74C3C,stroke:#2C3E50,color:#fff
    style DriftSol fill:#27AE60,stroke:#2C3E50,color:#fff
    style LatencySol fill:#27AE60,stroke:#2C3E50,color:#fff
    style BatterySol fill:#27AE60,stroke:#2C3E50,color:#fff
    style MemSol fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1351.1: IoT ML Production Issues with Diagnostic Solutions

1351.4 Case Study: Fall Detection False Positives

Problem: Fall detection achieves 95% accuracy and 99.9% specificity, but generates 182 false alarms per user per year.

Root Cause: Class imbalance. Falls are rare (1 per year), but system checks every 100ms. With 3.15M non-fall checks per year Γ— 0.1% FP rate = 3,150 false alarms!

1351.4.1 Solution: Three-Stage Pipeline

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    Stream[Sensor Stream<br/>Every 100ms]

    Stream --> Stage1[Stage 1:<br/>High Sensitivity<br/>Accel > 3.0g<br/>99% sensitivity<br/>90% specificity]

    Stage1 -->|Alert| Stage2[Stage 2:<br/>High Specificity<br/>Full ML Model<br/>95% sensitivity<br/>99.99% specificity]

    Stage2 -->|Confirmed| Stage3[Stage 3:<br/>User Confirmation<br/>10-second window]

    Stage3 -->|No response| Alert[Emergency<br/>Services]

    Stage1 -->|Normal| Stream
    Stage2 -->|False Alarm| Stream
    Stage3 -->|User OK| Stream

    style Stream fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style Stage1 fill:#2C3E50,stroke:#16A085,color:#fff
    style Stage2 fill:#16A085,stroke:#2C3E50,color:#fff
    style Stage3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style Alert fill:#E74C3C,stroke:#2C3E50,color:#fff

Figure 1351.2: Three-Stage Fall Detection with Progressive Filtering

Results: False positives reduced from 3,150 to 1.5 per user per year (acceptable!)

1351.5 Monitoring Tools Comparison

Tool Best For IoT Support Key Features
TensorBoard Training metrics Limited Visualization, profiling
MLflow Experiment tracking Good Model versioning
Prometheus + Grafana Infrastructure Excellent Time-series, alerting
Edge Impulse TinyML Native Device profiling, OTA

1351.6 Production ML Checklist

Before deploying ML models to IoT devices:

  • Model Performance: Establish baseline accuracy, track per-class metrics
  • Data Quality: Track feature drift using KS test, KL divergence
  • Inference Performance: Monitor latency percentiles (P50/P90/P99)
  • Fleet Management: Track model versions, implement gradual rollout
  • Operational: Automate retraining, configure drift alerts

1351.7 Worked Example: Predictive Maintenance Pipeline

Scenario: A water utility operates 200 industrial pumps with vibration sensors (4 kHz accelerometer). Goal: Detect failures 2-4 weeks before they occur.

1351.7.1 Step 1: Feature Engineering from Vibration Data

import numpy as np
from scipy import signal, stats
from scipy.fftpack import fft

def extract_vibration_features(raw_signal, sample_rate=4000):
    """
    Extract predictive maintenance features from vibration.
    Input: 1 second of raw accelerometer (4000 samples)
    Output: 18 features capturing mechanical health
    """
    features = {}

    # Time-domain features
    features['rms'] = np.sqrt(np.mean(raw_signal**2))
    features['peak'] = np.max(np.abs(raw_signal))
    features['crest_factor'] = features['peak'] / features['rms']
    features['kurtosis'] = stats.kurtosis(raw_signal)

    # Frequency-domain features
    fft_vals = np.abs(fft(raw_signal))[:len(raw_signal)//2]
    freqs = np.fft.fftfreq(len(raw_signal), 1/sample_rate)[:len(raw_signal)//2]

    # Spectral energy in bands
    features['energy_0_500hz'] = np.sum(fft_vals[(freqs < 500)]**2)
    features['energy_500_1000hz'] = np.sum(fft_vals[(freqs >= 500) & (freqs < 1000)]**2)

    # Bearing fault frequencies
    rpm = 1800
    bpfo = 0.4 * (rpm/60) * 9  # Ball Pass Frequency Outer
    features['bpfo_energy'] = np.sum(fft_vals[(freqs >= bpfo-5) & (freqs <= bpfo+5)]**2)

    return features

Feature importance from domain knowledge:

Feature Physical Meaning Fault Correlation
RMS Overall vibration General degradation (r=0.85)
Kurtosis Impulsiveness Bearing pitting (r=0.92)
Crest factor Peak-to-RMS Localized damage (r=0.78)
BPFO energy Bearing fault frequency Outer race wear (r=0.95)

1351.7.2 Step 2: Model Selection

With only 12 failures vs 400,000+ hours of normal data, supervised classification would overfit. Use semi-supervised anomaly detection.

Model Approach Strengths Best For
Isolation Forest Tree-based isolation Fast, handles high-dim General anomalies
One-Class SVM Boundary around normal Robust to outliers Small datasets
LSTM-Autoencoder Temporal reconstruction Captures sequences Time-series

Selected: Ensemble of Isolation Forest + LSTM-Autoencoder - Isolation Forest: Fast (<10ms), runs on edge gateway - LSTM-AE: Captures temporal patterns, runs in cloud for flagged pumps

1351.7.3 Step 3: Threshold Tuning (Cost-Based)

Event Cost
Unplanned failure $15,000 (repair + downtime)
Planned maintenance (true positive) $3,000 (scheduled repair)
False alarm inspection $500 (technician visit)

Threshold analysis:

Threshold Recall FP/year FN/year Annual Cost
0.70 100% 80 0 $76,000
0.80 95% 35 0.6 $48,500
0.85 92% 18 1 $42,000
0.90 83% 8 2 $44,000

Selected: 0.85 - Optimal cost with 92% recall

1351.7.4 Step 4: Deployment Architecture

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Sensors["200 Pumps"]
        P1[Pump 1]
        P2[Pump 2]
        PN[Pump N]
    end

    subgraph Edge["Edge Gateway"]
        FE[Feature Extraction<br/>4 kHz to 18 features/sec]
        IF[Isolation Forest<br/>Stage 1 Inference]
        ALERT[Alert if<br/>score > 0.7]
    end

    subgraph Cloud["Cloud Platform"]
        LSTM[LSTM-Autoencoder<br/>Stage 2 Inference]
        TREND[Trend Analysis<br/>7-day rolling]
        DASH[Maintenance<br/>Dashboard]
    end

    P1 & P2 & PN --> FE --> IF --> ALERT
    ALERT -->|Cellular| LSTM --> TREND --> DASH

    style Sensors fill:#2C3E50,stroke:#16A085,color:#fff
    style Edge fill:#E67E22,stroke:#2C3E50,color:#fff
    style Cloud fill:#16A085,stroke:#2C3E50,color:#fff

Figure 1351.3: Predictive Maintenance Edge-to-Cloud Architecture

Edge gateway specs: - Hardware: Raspberry Pi 4 ($75) - Feature extraction: 4 kHz β†’ 1 Hz (4000Γ— reduction) - Isolation Forest: 2 MB, 5ms inference - Connectivity: 4G LTE, 1 MB/day

1351.7.5 Step 5: Continuous Learning Pipeline

def monthly_model_update():
    """
    1. Collect new data
    2. Validate data quality
    3. Retrain models
    4. A/B test new model
    5. Deploy if improved
    """
    # Step 1: Data collection
    new_normal_data = fetch_last_month_normal_data()
    new_failure_data = fetch_confirmed_failures()

    # Step 2: Data quality checks
    assert len(new_normal_data) > 10000
    assert check_feature_drift(new_normal_data) < 0.3

    # Step 3: Retrain
    combined_normal = np.vstack([historical_normal_data, new_normal_data])
    new_iso_forest = IsolationForest(n_estimators=100)
    new_iso_forest.fit(combined_normal)

    # Step 4: A/B validation
    old_recall = evaluate_model(current_model, holdout_failures)
    new_recall = evaluate_model(new_iso_forest, holdout_failures)

    # Step 5: Deploy if improved
    if new_recall >= old_recall - 0.05:
        deploy_to_edge_gateways(new_iso_forest)

Drift detection thresholds:

Metric Threshold Action
Feature distribution shift KL > 0.3 Investigate
False positive rate increase > 50% Retrain
Missed failure Any Immediately retrain

1351.7.6 Results

Metric Value
Failure detection rate 92% (11/12 failures caught)
Lead time 2-4 weeks before failure
False positive rate 18/year (< 2/pump/year)
Annual cost reduction $123,000 (68% savings)
ROI 820% (cost: $15,000 development)

1351.8 Production Monitoring Pitfall

CautionPitfall: Deploying ML Without Production Monitoring

The Mistake: Treating deployment as the final step without continuous monitoring for model degradation, data drift, or silent failures.

Why It Happens: Once accuracy looks good in testing, teams move on. ML monitoring is less established than application monitoring.

The Fix: 1. Prediction distribution monitoring: Track model outputs over time 2. Feature drift detection: Monitor input distributions (KL-divergence, PSI) 3. Ground truth sampling: Review 1-5% of predictions manually 4. Automatic alerting: Set thresholds for key metrics 5. Model versioning: Enable instant rollback

Warning sign: If you cannot answer β€œwhat was our model accuracy last week?”, you are operating blind.

1351.9 Knowledge Check

Question 1: A fall detection system achieves 95% accuracy but has 5% false positives. In a population of 1000 users with 1 fall per user per year, approximately how many false alarms occur annually?

Explanation: False positive rate applies to NON-FALL events. Events per user: 3650 non-falls (10/day Γ— 365). Total: 1000 users Γ— 3650 = 3.65M non-fall events. False alarms = 3.65M Γ— 5% = 182,500/year. Solution: Increase specificity to 99.9% for only 3,650 false alarms total.

1351.10 Summary

This chapter covered production ML for IoT:

  • Monitoring: Track accuracy, latency, resource usage, and data quality
  • Anomaly Detection: Use semi-supervised learning for rare-event detection
  • Cost-Based Thresholds: Optimize for business cost, not just accuracy
  • Hybrid Architecture: Edge for real-time, cloud for complex analysis
  • Continuous Learning: Monthly retraining with drift detection

Key Insight: Production ML requires as much engineering as model development. Monitoring, drift detection, and automated retraining are essential for long-term success.

1351.11 What’s Next

You have completed the IoT Machine Learning series. Explore related topics: