15  Anomaly Detection Metrics Lab

15.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Evaluate Detection Performance: Use precision, recall, F1, and confusion matrices for imbalanced data
  • Tune Thresholds: Optimize detection thresholds based on business costs
  • Reduce False Alarms: Apply temporal persistence and multi-sensor correlation
  • Build a Detection System: Implement multi-method anomaly detection on embedded hardware
In 60 Seconds

Standard accuracy is a misleading metric for anomaly detection because normal readings dominate; precision, recall, and F1-score reveal whether your system actually finds rare critical events. Tuning the detection threshold is ultimately a business decision driven by the relative cost of missed anomalies versus false alarms.

Anomaly detection metrics are like a report card for your alarm system. They measure how often it correctly catches real problems versus how often it cries wolf. Getting this balance right is crucial – too many false alarms and people start ignoring alerts, too few and real problems slip through unnoticed.

Minimum Viable Understanding: Anomaly Detection Metrics

Core Concept: Standard accuracy is meaningless for anomaly detection. A detector that always says “normal” achieves 99.9% accuracy but catches zero anomalies. Use precision, recall, and F1 instead.

Why It Matters: The cost of a missed anomaly (false negative) is often 10-100x the cost of a false alarm (false positive). Threshold tuning is a business decision, not a statistical one.

Key Takeaway: Set thresholds based on the cost ratio of false negatives to false positives. For safety-critical systems, optimize for recall (>99%). For consumer systems, optimize for precision (>90%).

15.2 Prerequisites

Before diving into this chapter, you should be familiar with:

~10 min | Intermediate | P10.C01.U06

15.3 Introduction

How do you know if your anomaly detector is working well? Standard accuracy is misleading for imbalanced data (99.9% normal, 0.1% anomalies). This chapter introduces the metrics that actually matter for anomaly detection – precision, recall, and F1 score – and shows how to tune detection thresholds based on the real-world costs of false alarms versus missed anomalies. You will also build a multi-method anomaly detection system on an ESP32 to see these concepts in action.

15.4 The Fundamental Trade-Off

Sensitivity vs Specificity:

  • High Sensitivity (Recall): Catch all anomalies, but many false alarms
  • High Specificity: Few false alarms, but miss some anomalies

Real-world costs:

  • False Positive: Operator investigates, finds nothing - wastes time, alarm fatigue
  • False Negative: Miss critical failure - equipment damage, safety risk

The balance depends on domain:

Domain Priority Target Metrics Rationale
Industrial Safety Recall >99% recall Cannot miss critical failures
Consumer IoT Precision >80% precision Users ignore frequent false alarms
Predictive Maintenance Balanced F1 > 0.85 Balance early detection vs maintenance costs

15.5 Key Metrics Explained

Confusion Matrix:

                  Predicted
                Normal  Anomaly
Actual Normal     TN      FP    (False Positive = False Alarm)
      Anomaly     FN      TP    (False Negative = Missed Anomaly)

Derived Metrics:

  1. Precision (Positive Predictive Value)

    Precision = TP / (TP + FP)
    "Of alerts raised, how many were real anomalies?"
    • High precision means few false alarms
    • Critical for systems with alert fatigue risk
  2. Recall (Sensitivity, True Positive Rate)

    Recall = TP / (TP + FN)
    "Of real anomalies, how many did we catch?"
    • High recall means don’t miss critical events
    • Critical for safety systems
  3. F1 Score (Harmonic Mean)

    F1 = 2 x (Precision x Recall) / (Precision + Recall)
    • Balanced metric when both precision and recall matter
    • Single number for model comparison
  4. False Positive Rate

    FPR = FP / (FP + TN)
    "Of normal samples, how many did we incorrectly flag?"
    • Critical for operational burden
    • Target: <0.1% for industrial (1 false alarm per 1000 samples)

The cost-benefit optimal threshold balances false positive and false negative costs. For cost \(C_{FP}\) per false alarm and \(C_{FN}\) per missed anomaly, minimize:

\[\text{Total Cost} = (FP \times C_{FP}) + (FN \times C_{FN})\]

Example: Predictive maintenance with \(C_{FP} = \$250\) (truck roll) and \(C_{FN} = \$8,500\) (emergency repair). Current model over 4 weeks: 95% recall, 20 real failures in 1M readings:

  • \(TP = 19\), \(FN = 1\), \(FP = 35\) (precision = 19/54 = 35%)
  • Cost = \((35 \times \$250) + (1 \times \$8,500) = \$17,250\)

Tune threshold to 99% recall, 62% precision: - \(TP = 19.8\), \(FN = 0.2\), \(FP = 12\) (precision = 19.8/31.8 = 62%) - Cost = \((12 \times \$250) + (0.2 \times \$8,500) = \$4,700\)

**Savings: \(12,550 per 4-week period** by optimizing for cost ratio (\)C_{FN}/C_{FP} = 34$).

Try It: Detection Cost Calculator

Adjust the costs and confusion matrix to calculate the total cost of your anomaly detection system:

Worked Example:

# Motor monitoring system over 1 week
# 1,000,000 sensor readings, 100 real anomalies

TP = 95   # Detected 95 real anomalies
FN = 5    # Missed 5 real anomalies
FP = 200  # 200 false alarms
TN = 999700  # Correctly identified 999,700 normal samples

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * (precision * recall) / (precision + recall)
fpr = FP / (FP + TN)

print(f"Precision: {precision:.3f} (95/295 alerts were real)")
print(f"Recall: {recall:.3f} (caught 95/100 anomalies)")
print(f"F1 Score: {f1:.3f}")
print(f"False Positive Rate: {fpr:.5f} (0.02%)")

# Output:
# Precision: 0.322 (32% of alerts were real) <- LOW, too many false alarms
# Recall: 0.950 (95% of anomalies caught) <- HIGH, good detection
# F1 Score: 0.481 <- Mediocre balance
# False Positive Rate: 0.00020 (0.02%) <- EXCELLENT, few false alarms per sample

# Interpretation: System catches most anomalies but operator receives
# ~200 false alarms per week (28/day) -> likely alarm fatigue
# Solution: Increase detection threshold to improve precision

Try It: Anomaly Detection Metrics Calculator

Adjust the confusion matrix values below to see how precision, recall, F1, and false positive rate change:

15.6 Real-World Performance Targets

Industry Benchmarks:

Application Precision Target Recall Target False Alarm Tolerance
Manufacturing Safety >70% >99% <10 false alarms/day
Predictive Maintenance >80% >95% <5 false alarms/week
Energy Management >85% >90% <2 false alarms/week
Smart Home >90% >80% <1 false alarm/month
Network Security >60% >99.9% <100 false alarms/day

15.7 Threshold Tuning

Tuning Strategy:

import numpy as np
from sklearn.metrics import precision_recall_curve

def find_optimal_threshold(y_true, y_scores, target_recall=0.95):
    """
    Find threshold that achieves target recall while maximizing precision

    y_true: actual labels (0=normal, 1=anomaly)
    y_scores: anomaly scores from model
    target_recall: minimum recall required
    """
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

    # precision and recall have len(thresholds)+1 elements;
    # drop the last element (recall=0, precision=1) to align with thresholds
    precision = precision[:-1]
    recall = recall[:-1]

    # Find thresholds that meet recall target
    valid_mask = recall >= target_recall

    if not any(valid_mask):
        print(f"Cannot achieve {target_recall} recall")
        return None

    # Among valid thresholds, pick one with best precision
    best_idx = np.argmax(precision[valid_mask])
    valid_thresholds = thresholds[valid_mask]
    valid_precisions = precision[valid_mask]

    best_threshold = valid_thresholds[best_idx]
    best_precision = valid_precisions[best_idx]

    print(f"Threshold: {best_threshold:.3f}")
    print(f"Achieves: {target_recall:.1%} recall, {best_precision:.1%} precision")

    return best_threshold

A facilities management company monitors 500 commercial HVAC units using vibration sensors. Initial deployment: Isolation Forest model with default threshold yields 95% recall but only 10% precision, generating 171 false alarms over 4 weeks.

Step 1 - Establish Business Costs:

  • False positive cost: Technician truck roll + 2 hours labor = $250/incident
  • False negative cost: Emergency repair + downtime = $8,500/incident (HVAC failure during business hours)
  • Cost ratio: FN/FP = $8,500 / $250 = 34:1

Step 2 - Current Performance Analysis: Over 4 weeks (1,120,000 sensor readings, 20 actual HVAC failures): - TP = 19 (caught 19 failures) - FN = 1 (missed 1 failure) - FP = 171 (171 false alarms over 4 weeks, ~43/week) - Precision = 19/(19+171) = 10% - Recall = 19/20 = 95% - 4-week cost = 171 x $250 + 1 x $8,500 = $42,750 + $8,500 = $51,250 ($12,813/week)

Step 3 - Optimize Threshold for Business: Using precision-recall curve, find threshold that achieves recall >= 99% (cannot miss failures): - New threshold: anomaly score > 0.72 (raised from 0.55) - TP = 19.8 (99% of 20) - FN = 0.2 - FP = 40 (reduced from 171) - Precision = 19.8/(19.8+40) = 33% - 4-week cost = 40 x $250 + 0.2 x $8,500 = $10,000 + $1,700 = $11,700 ($2,925/week)

Step 4 - Add Temporal Persistence: Require 3 consecutive anomalous readings (15-minute window) before alerting: - Filters transient sensor noise (vibration from nearby traffic, door slams) - TP = 19.6 (98% recall), FN = 0.4 - FP = 12 over 4 weeks (3/week) - Precision = 19.6/(19.6+12) = 62% - 4-week cost = 12 x $250 + 0.4 x $8,500 = $3,000 + $3,400 = $6,400 ($1,600/week)

Results: Tuned system saves $44,850 per 4-week period ($11,213/week) versus initial deployment, while maintaining 98% recall to catch critical failures before emergency breakdowns.

Tradeoff Decision Guide: Statistical vs ML Anomaly Detection
Factor Statistical (Z-score/IQR) ML (Isolation Forest/Autoencoder) When to Choose
Compute Requirements Minimal (<1KB RAM) Significant (MB-GB RAM) Statistical for edge devices; ML for cloud/gateway
Training Data Needed None (online calculation) 1000+ normal samples Statistical for cold-start; ML with historical data
Interpretability High (clear thresholds) Low (black box scores) Statistical for regulated/auditable systems
Multivariate Patterns Poor (single variable) Excellent (cross-sensor) ML for complex correlations; statistical for single sensors
Concept Drift Handling Manual threshold updates Automatic with retraining Statistical with domain expertise; ML for autonomous
False Positive Rate Higher (simple rules) Lower (learned patterns) ML when false alarm cost is high
Setup Time Minutes Days to weeks Statistical for rapid deployment; ML for mature systems

Quick Decision Rule: Start with Z-score/IQR for immediate value with minimal setup; graduate to ML methods only when you have sufficient training data AND the false positive reduction justifies the computational and maintenance overhead.


15.8 Lab: Build an Anomaly Detection System

~45 min | Intermediate | P10.C01.LAB01

15.8.1 Learning Objectives

By completing this hands-on lab, you will be able to:

  • Implement Z-score based anomaly detection on embedded hardware
  • Build a moving average baseline for adaptive thresholds
  • Apply IQR (Interquartile Range) method for robust outlier detection
  • Design threshold-based alerts with hysteresis to reduce false positives
  • Compare different anomaly detection methods and understand their tradeoffs
  • Visualize anomaly detection decisions in real-time
What You’ll Build

A complete anomaly detection system on ESP32 that demonstrates four detection methods running simultaneously: Z-score (statistical), Moving Average deviation, IQR-based outliers, and threshold with hysteresis. You’ll see how each method responds differently to the same sensor data, helping you understand when to use each approach in production IoT systems.

15.8.2 Anomaly Detection Methods Demonstrated

This lab implements several key anomaly detection patterns:

Method How It Works Strengths Weaknesses
Z-Score Measures standard deviations from mean Mathematically rigorous, well-understood Assumes normal distribution, sensitive to outliers in baseline
Moving Average Compares to rolling baseline Adapts to slow changes, simple Lag in detection, window size tuning needed
IQR (Interquartile Range) Uses quartiles for robust bounds Resistant to outliers, no distribution assumption Requires sorted data buffer, higher memory
Threshold + Hysteresis Fixed bounds with entry/exit gap Prevents oscillation, deterministic Requires domain knowledge for thresholds

15.8.3 Wokwi Simulator

Use the embedded simulator below to build your anomaly detection system:

15.8.4 Circuit Setup

Connect the temperature sensor and indicator LEDs to the ESP32:

Component ESP32 Pin Purpose
Temperature Sensor (NTC) GPIO 34 Primary data source for anomaly detection
Potentiometer GPIO 35 Simulate temperature variations/anomalies
Red LED GPIO 18 Z-score anomaly indicator
Orange LED GPIO 19 Moving average anomaly indicator
Yellow LED GPIO 21 IQR anomaly indicator
Green LED GPIO 22 Normal operation / hysteresis state
Blue LED GPIO 23 Threshold + hysteresis anomaly indicator

Add this diagram.json configuration in Wokwi:

{
  "version": 1,
  "author": "IoT Class - Anomaly Detection Lab",
  "editor": "wokwi",
  "parts": [
    { "type": "wokwi-esp32-devkit-v1", "id": "esp", "top": 0, "left": 0 },
    { "type": "wokwi-ntc-temperature-sensor", "id": "temp1", "top": -120, "left": 120 },
    { "type": "wokwi-potentiometer", "id": "pot1", "top": -120, "left": 260, "attrs": { "value": "50" } },
    { "type": "wokwi-led", "id": "led_red", "top": 180, "left": 80, "attrs": { "color": "red" } },
    { "type": "wokwi-led", "id": "led_orange", "top": 180, "left": 130, "attrs": { "color": "orange" } },
    { "type": "wokwi-led", "id": "led_yellow", "top": 180, "left": 180, "attrs": { "color": "yellow" } },
    { "type": "wokwi-led", "id": "led_green", "top": 180, "left": 230, "attrs": { "color": "green" } },
    { "type": "wokwi-led", "id": "led_blue", "top": 180, "left": 280, "attrs": { "color": "blue" } },
    { "type": "wokwi-resistor", "id": "r1", "top": 240, "left": 80, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r2", "top": 240, "left": 130, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r3", "top": 240, "left": 180, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r4", "top": 240, "left": 230, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r5", "top": 240, "left": 280, "attrs": { "value": "220" } }
  ],
  "connections": [
    ["esp:GND.1", "temp1:GND", "black", ["h0"]],
    ["esp:3V3", "temp1:VCC", "red", ["h0"]],
    ["esp:34", "temp1:OUT", "green", ["h0"]],
    ["esp:GND.1", "pot1:GND", "black", ["h0"]],
    ["esp:3V3", "pot1:VCC", "red", ["h0"]],
    ["esp:35", "pot1:SIG", "purple", ["h0"]],
    ["esp:18", "led_red:A", "red", ["h0"]],
    ["led_red:C", "r1:1", "black", ["h0"]],
    ["r1:2", "esp:GND.2", "black", ["h0"]],
    ["esp:19", "led_orange:A", "orange", ["h0"]],
    ["led_orange:C", "r2:1", "black", ["h0"]],
    ["r2:2", "esp:GND.2", "black", ["h0"]],
    ["esp:21", "led_yellow:A", "yellow", ["h0"]],
    ["led_yellow:C", "r3:1", "black", ["h0"]],
    ["r3:2", "esp:GND.2", "black", ["h0"]],
    ["esp:22", "led_green:A", "green", ["h0"]],
    ["led_green:C", "r4:1", "black", ["h0"]],
    ["r4:2", "esp:GND.2", "black", ["h0"]],
    ["esp:23", "led_blue:A", "blue", ["h0"]],
    ["led_blue:C", "r5:1", "black", ["h0"]],
    ["r5:2", "esp:GND.2", "black", ["h0"]]
  ]
}

15.8.5 Step-by-Step Instructions

15.8.5.1 Step 1: Set Up the Simulator

  1. Open the Wokwi simulator embedded above (or visit wokwi.com)
  2. Create a new ESP32 project
  3. Click the diagram.json tab and paste the circuit configuration
  4. Copy the Arduino code from the collapsible section below

15.8.5.2 Step 2: Run and Observe Normal Operation

  1. Click the Play button to start the simulation
  2. Open the Serial Monitor to see detection output
  3. Keep the potentiometer at center position (normal temperature range)
  4. Observe: All four methods should show “Normal” with the green LED lit
  5. Watch the buffer fill as the system collects baseline data

15.8.5.3 Step 3: Trigger Anomalies with the Potentiometer

  1. Slowly rotate the potentiometer to the right (increase temperature)
  2. Watch which method detects the anomaly first:
    • Hysteresis: Triggers when crossing 45C threshold
    • Z-score: Triggers when 2.5 standard deviations from mean
    • Moving Average: Triggers at 15% deviation
    • IQR: Triggers outside 1.5x interquartile range
  3. Note: Different LEDs light up as each method triggers

15.8.5.4 Step 4: Observe Hysteresis Behavior

  1. Push temperature above 45C (potentiometer far right)
  2. Blue LED turns ON (entered anomaly state)
  3. Slowly decrease temperature by rotating potentiometer left
  4. Notice: Blue LED stays ON until temperature drops below 38C
  5. This gap (45C to 38C) is the hysteresis band - prevents oscillation

Copy this code into the Wokwi editor:

// Anomaly Detection Lab: Multi-Method Comparison System
// Demonstrates: Z-Score, Moving Average, IQR, Hysteresis

const int TEMP_PIN = 34;
const int POT_PIN = 35;
const int LED_ZSCORE = 18;
const int LED_MAVG = 19;
const int LED_IQR = 21;
const int LED_NORMAL = 22;
const int LED_HYSTERESIS = 23;

const int SAMPLE_INTERVAL_MS = 200;
const int WINDOW_SIZE = 50;
const float ZSCORE_THRESHOLD = 2.5;
const float MAVG_DEVIATION_PCT = 15.0;
const float IQR_MULTIPLIER = 1.5;
const float HYSTERESIS_HIGH = 45.0;
const float HYSTERESIS_LOW = 38.0;
const int CONSECUTIVE_REQUIRED = 3;

float dataBuffer[WINDOW_SIZE];
float sortedBuffer[WINDOW_SIZE];
int bufferIndex = 0;
int bufferCount = 0;

float runningSum = 0;
float runningSumSq = 0;
float movingAverage = 0;
bool inHysteresisAnomaly = false;

int zscoreConsecutive = 0;
int mavgConsecutive = 0;
int iqrConsecutive = 0;

unsigned long totalSamples = 0;
unsigned long lastSampleTime = 0;

void setup() {
  Serial.begin(115200);
  delay(1000);

  pinMode(TEMP_PIN, INPUT);
  pinMode(POT_PIN, INPUT);
  pinMode(LED_ZSCORE, OUTPUT);
  pinMode(LED_MAVG, OUTPUT);
  pinMode(LED_IQR, OUTPUT);
  pinMode(LED_NORMAL, OUTPUT);
  pinMode(LED_HYSTERESIS, OUTPUT);

  Serial.println("Anomaly Detection Lab Started");
  Serial.println("Adjust potentiometer to simulate anomalies");
}

float readTemperature() {
  int ntcRaw = analogRead(TEMP_PIN);
  int potRaw = analogRead(POT_PIN);
  float baseTemp = map(ntcRaw, 0, 4095, 2000, 4000) / 100.0;
  float offset = map(potRaw, 0, 4095, -2000, 3000) / 100.0;
  return baseTemp + offset;
}

void addToBuffer(float value) {
  if (bufferCount == WINDOW_SIZE) {
    float oldValue = dataBuffer[bufferIndex];
    runningSum -= oldValue;
    runningSumSq -= oldValue * oldValue;
  }

  dataBuffer[bufferIndex] = value;
  runningSum += value;
  runningSumSq += value * value;

  bufferIndex = (bufferIndex + 1) % WINDOW_SIZE;
  if (bufferCount < WINDOW_SIZE) bufferCount++;
}

float calculateZScore(float value) {
  if (bufferCount < 10) return 0;
  float mean = runningSum / bufferCount;
  float variance = (runningSumSq / bufferCount) - (mean * mean);
  if (variance <= 0) return 0;
  return abs((value - mean) / sqrt(variance));
}

float calculateMADeviation(float value) {
  if (bufferCount == 0) return 0;
  float avg = runningSum / bufferCount;
  if (avg == 0) return 0;
  return abs((value - avg) / avg) * 100.0;
}

void sortBuffer() {
  for (int i = 0; i < bufferCount; i++) {
    sortedBuffer[i] = dataBuffer[i];
  }
  for (int i = 0; i < bufferCount - 1; i++) {
    for (int j = 0; j < bufferCount - i - 1; j++) {
      if (sortedBuffer[j] > sortedBuffer[j + 1]) {
        float temp = sortedBuffer[j];
        sortedBuffer[j] = sortedBuffer[j + 1];
        sortedBuffer[j + 1] = temp;
      }
    }
  }
}

bool isIQRAnomaly(float value) {
  if (bufferCount < 20) return false;
  sortBuffer();
  float q1 = sortedBuffer[bufferCount / 4];
  float q3 = sortedBuffer[(3 * bufferCount) / 4];
  float iqr = q3 - q1;
  float lowerFence = q1 - (IQR_MULTIPLIER * iqr);
  float upperFence = q3 + (IQR_MULTIPLIER * iqr);
  return (value < lowerFence || value > upperFence);
}

bool checkHysteresis(float value) {
  if (!inHysteresisAnomaly) {
    if (value > HYSTERESIS_HIGH) inHysteresisAnomaly = true;
  } else {
    if (value < HYSTERESIS_LOW) inHysteresisAnomaly = false;
  }
  return inHysteresisAnomaly;
}

bool updateConsecutive(bool detected, int* counter) {
  if (detected) {
    (*counter)++;
    return (*counter) >= CONSECUTIVE_REQUIRED;
  } else {
    *counter = 0;
    return false;
  }
}

void loop() {
  unsigned long now = millis();

  if (now - lastSampleTime >= SAMPLE_INTERVAL_MS) {
    lastSampleTime = now;
    totalSamples++;

    float temp = readTemperature();
    addToBuffer(temp);

    float zscore = calculateZScore(temp);
    float maDeviation = calculateMADeviation(temp);

    bool zscoreRaw = (zscore > ZSCORE_THRESHOLD) && (bufferCount >= 10);
    bool mavgRaw = (maDeviation > MAVG_DEVIATION_PCT) && (bufferCount >= 5);
    bool iqrRaw = isIQRAnomaly(temp);
    bool hystAnomaly = checkHysteresis(temp);

    bool zscoreAnomaly = updateConsecutive(zscoreRaw, &zscoreConsecutive);
    bool mavgAnomaly = updateConsecutive(mavgRaw, &mavgConsecutive);
    bool iqrAnomaly = updateConsecutive(iqrRaw, &iqrConsecutive);

    digitalWrite(LED_ZSCORE, zscoreAnomaly ? HIGH : LOW);
    digitalWrite(LED_MAVG, mavgAnomaly ? HIGH : LOW);
    digitalWrite(LED_IQR, iqrAnomaly ? HIGH : LOW);
    digitalWrite(LED_HYSTERESIS, hystAnomaly ? HIGH : LOW);

    bool anyAnomaly = zscoreAnomaly || mavgAnomaly || iqrAnomaly || hystAnomaly;
    digitalWrite(LED_NORMAL, anyAnomaly ? LOW : HIGH);

    Serial.print("T:");
    Serial.print(temp, 1);
    Serial.print(" Z:");
    Serial.print(zscore, 2);
    Serial.print(" MA%:");
    Serial.print(maDeviation, 1);
    Serial.print(" | ");
    if (zscoreAnomaly) Serial.print("Z-SCORE ");
    if (mavgAnomaly) Serial.print("MA ");
    if (iqrAnomaly) Serial.print("IQR ");
    if (hystAnomaly) Serial.print("HYST ");
    if (!anyAnomaly) Serial.print("NORMAL");
    Serial.println();
  }
}

15.8.6 Key Concepts Explained

How It Works:

  • Define two thresholds: HIGH (enter anomaly state) and LOW (exit anomaly state)
  • Once in anomaly state, must drop below LOW to exit
  • The gap between thresholds prevents oscillation

In This Lab:

  • HIGH = 45C (enter anomaly)
  • LOW = 38C (exit anomaly)
  • 7C hysteresis band

Strengths:

  • Deterministic and predictable
  • Prevents “bouncing” alerts at threshold boundary
  • Simple state machine implementation

When to Use: Safety-critical systems with known limits, regulatory compliance

Domain Primary Metric Target Value Secondary Metric Reason
Manufacturing Safety Recall (Sensitivity) ≥99.5% FPR <0.1% Cannot miss critical failures; false alarms tolerable if <10/day
Predictive Maintenance F1 Score ≥0.85 Balanced precision/recall Balance early detection vs maintenance cost
Network Intrusion Recall ≥99.9% Precision ≥60% Missing attacks catastrophic; SOC can triage false positives
Energy Optimization Precision ≥85% Recall ≥90% Frequent false alarms cause user fatigue and system distrust
Smart Home Precision ≥90% Recall ≥80% Users ignore systems with >1 false alarm/month
Medical Monitoring Recall ≥99.99% FPR <0.01% False negatives life-threatening; false positives trigger clinician review

Quick Selection Rules:

  • Safety-critical (human life/injury risk): Optimize for recall ≥99%, tolerate FPR <1%
  • Financial loss >$10K per missed anomaly: Optimize for recall ≥95%, precision ≥70%
  • User-facing consumer products: Optimize for precision ≥90%, recall ≥80% (trust matters)
  • High-volume data with human review: Optimize for F1 score (balance efficiency)

Threshold Tuning Strategy:

  1. Plot precision-recall curve for your detector
  2. Identify minimum acceptable recall (business requirement)
  3. Among thresholds meeting recall target, select one maximizing precision
  4. If F1 score is priority metric, find threshold at harmonic mean peak
Common Mistake: Using Accuracy for Imbalanced Anomaly Detection

The Error: A manufacturing quality control team deploys an anomaly detector for defect detection. They celebrate achieving “99.2% accuracy!” But production line managers notice most defects still reach customers.

The Reality:

  • Dataset: 1,000,000 products/week, 800 actual defects (0.08% defect rate)
  • Detector predicts “normal” for every single product
  • Accuracy = (999,200 correct “normal” predictions) / 1,000,000 = 99.92%
  • But recall = 0% – catches ZERO defects!

Why Accuracy Fails: Accuracy = (TP + TN) / (TP + TN + FP + FN)

For imbalanced data (defects are rare): - TN dominates (999,200 true negatives) - TP, FP, FN are tiny (800 combined) - A “predict everything is normal” model achieves 99.92% accuracy while being completely useless

Correct Metrics for Anomaly Detection:

  • Precision: Of products flagged as defects, how many truly are? (TP / (TP + FP))
  • Recall: Of actual defects, how many did we catch? (TP / (TP + FN))
  • F1 Score: Harmonic mean balancing precision and recall

Real-World Example - Corrected Approach: Same 1M products, 800 defects. Detector tuned for quality control: - TP = 760 (caught 95% of defects) - FN = 40 (missed 5%) - FP = 2,000 (flagged 2,000 normal products as defects) - TN = 997,200

Metrics: - Accuracy: 99.8% (misleading – looks nearly identical to useless model) - Precision: 760/2,760 = 27.5% (1 in 4 flags is real defect) - Recall: 760/800 = 95% (catches 95% of defects) - F1 Score: 0.42

While 27.5% precision means human inspectors review 2,760 products/week (versus 1M without detector), this is acceptable – manual inspection of 2,760 items costs $5,520/week while catching defects worth $152,000 in returns/reputation damage.

Key Lesson: For rare events (anomalies <5% of data), never use accuracy. Use precision, recall, and F1 score. Optimize based on business costs of false positives vs false negatives.

Concept Relationships

Concept relationship diagram showing Anomaly Detection System branching into Detection Algorithm, Performance Metrics, and Threshold Tuning, with cost analysis and business requirements converging on Threshold Selection

How These Concepts Connect:

  • Metrics evaluate algorithms: Precision/recall/F1 measure how well your chosen detection method performs
  • Thresholds bridge algorithms and business: Same algorithm at different thresholds produces different precision/recall trade-offs
  • Cost drives threshold selection: If false negatives cost 100x more than false positives, optimize for recall
  • Temporal persistence reduces false alarms: Requiring 3 consecutive anomalies improves precision while maintaining recall

See Also

Required Prerequisites:

Related Performance Topics:

Business Alignment:

Advanced Tuning:

Common Pitfalls

A detector predicting ‘normal’ for every reading achieves 99.9% accuracy on data with 0.1% anomalies — and catches nothing. Always report precision and recall alongside accuracy.

Lowering the threshold raises recall but floods operators with false alarms. Calculate the cost ratio (missed failure cost / false alarm cost) and set the threshold where expected total cost is minimised.

If you up-sample anomalies to 50/50 for evaluation, reported metrics will not reflect real-world performance. Test on held-out data that preserves the natural class imbalance.

Two alerts 50 ms apart on the same sensor are almost certainly one event. Count grouped alert bursts as single events when computing recall to avoid inflated true-positive counts.

In safety-critical systems, missing an anomaly is catastrophically more expensive than investigating a false alarm. Optimise for recall ≥ 99% first, then tighten precision as operational capacity allows.

15.9 Summary

Performance metrics and threshold tuning are critical for production anomaly detection:

  • Metrics: Use precision, recall, F1 - never accuracy for imbalanced data
  • Threshold Tuning: Based on cost ratio of false negatives to false positives
  • False Alarm Reduction: Temporal persistence and multi-sensor correlation
  • Lab: Multiple methods have different trade-offs - use the right tool for the job

Key Takeaway: Anomaly detection is as much about operational tuning as algorithm selection. The best algorithm with poor thresholds performs worse than a simple method with well-calibrated thresholds.

Sammy the Sensor had a problem. He was guarding the school’s science lab refrigerator, and every time the temperature went up even a tiny bit, he shouted “DANGER! DANGER!” to Lila the LED.

After the first day, the teacher had received 47 false alarms and only 1 real problem (someone left the door open at lunch). The teacher started ignoring all of Sammy’s warnings!

“This is terrible!” cried Bella the Battery. “If there is a REAL emergency, nobody will listen anymore!”

Max the Microcontroller called a team meeting. “We need to learn about something called precision and recall,” he said, drawing on the whiteboard.

“Precision means: when we DO sound the alarm, how often is it a REAL problem? Right now, only 1 out of 47 alarms was real. That is awful precision!”

“Recall means: of all the REAL problems, how many did we catch? We caught 1 out of 1 real problem – perfect recall!”

“So we are great at catching problems but terrible at NOT crying wolf,” Sammy summarized.

“Right!” said Max. “Here is my plan: instead of alarming on ONE high reading, we wait for THREE readings in a row. That way, a tiny blip will not trigger an alarm, but a real problem – like a door left open – will still get caught.”

The next week, Sammy only raised 3 alarms – and all 3 were real problems! The teacher started trusting the alerts again.

Key lesson: A detector that cries wolf too often gets ignored. The best alarm systems balance catching real problems (recall) with not raising false alarms (precision)!

15.10 What’s Next

If you want to… Read this
Understand the full detection pipeline Real-Time Anomaly Pipelines
Learn the anomaly classification framework Types of Anomalies
Apply lightweight edge detection Statistical Methods
Deploy ML-based detection Machine Learning Approaches
Return to the module overview Anomaly Detection Overview