15 Anomaly Detection Metrics Lab

15.1 Learning Objectives

By the end of this chapter, you will be able to:

Evaluate Detection Performance: Use precision, recall, F1, and confusion matrices for imbalanced data
Tune Thresholds: Optimize detection thresholds based on business costs
Reduce False Alarms: Apply temporal persistence and multi-sensor correlation
Build a Detection System: Implement multi-method anomaly detection on embedded hardware

In 60 Seconds

Standard accuracy is a misleading metric for anomaly detection because normal readings dominate; precision, recall, and F1-score reveal whether your system actually finds rare critical events. Tuning the detection threshold is ultimately a business decision driven by the relative cost of missed anomalies versus false alarms.

For Beginners: Anomaly Detection Metrics

Anomaly detection metrics are like a report card for your alarm system. They measure how often it correctly catches real problems versus how often it cries wolf. Getting this balance right is crucial – too many false alarms and people start ignoring alerts, too few and real problems slip through unnoticed.

Minimum Viable Understanding: Anomaly Detection Metrics

Core Concept: Standard accuracy is meaningless for anomaly detection. A detector that always says “normal” achieves 99.9% accuracy but catches zero anomalies. Use precision, recall, and F1 instead.

Why It Matters: The cost of a missed anomaly (false negative) is often 10-100x the cost of a false alarm (false positive). Threshold tuning is a business decision, not a statistical one.

Key Takeaway: Set thresholds based on the cost ratio of false negatives to false positives. For safety-critical systems, optimize for recall (>99%). For consumer systems, optimize for precision (>90%).

15.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Statistical Methods: Z-score, IQR, and adaptive thresholds
Real-Time Pipelines: Edge-fog-cloud architecture for detection

~10 min | Intermediate | P10.C01.U06

15.3 Introduction

How do you know if your anomaly detector is working well? Standard accuracy is misleading for imbalanced data (99.9% normal, 0.1% anomalies). This chapter introduces the metrics that actually matter for anomaly detection – precision, recall, and F1 score – and shows how to tune detection thresholds based on the real-world costs of false alarms versus missed anomalies. You will also build a multi-method anomaly detection system on an ESP32 to see these concepts in action.

15.4 The Fundamental Trade-Off

Sensitivity vs Specificity:

High Sensitivity (Recall): Catch all anomalies, but many false alarms
High Specificity: Few false alarms, but miss some anomalies

Real-world costs:

False Positive: Operator investigates, finds nothing - wastes time, alarm fatigue
False Negative: Miss critical failure - equipment damage, safety risk

The balance depends on domain:

Domain	Priority	Target Metrics	Rationale
Industrial Safety	Recall	>99% recall	Cannot miss critical failures
Consumer IoT	Precision	>80% precision	Users ignore frequent false alarms
Predictive Maintenance	Balanced	F1 > 0.85	Balance early detection vs maintenance costs

Mobile Quick Guide: Domain Priorities

Industrial Safety: optimize for recall above 99% because missing critical failures is unacceptable.
Consumer IoT: optimize for precision above 80% so users do not tune out noisy alerts.
Predictive Maintenance: balance both sides and target roughly F1 > 0.85.

15.5 Key Metrics Explained

Confusion Matrix:

                  Predicted
                Normal  Anomaly
Actual Normal     TN      FP    (False Positive = False Alarm)
      Anomaly     FN      TP    (False Negative = Missed Anomaly)

Mobile Walkthrough: Confusion Matrix

True Positive (TP): predicted anomaly and the event really is an anomaly.
False Positive (FP): predicted anomaly but the reading was actually normal.
False Negative (FN): predicted normal but a real anomaly was missed.
True Negative (TN): predicted normal and the reading was actually normal.

Derived Metrics:

Precision (Positive Predictive Value)
```
Precision = TP / (TP + FP)
"Of alerts raised, how many were real anomalies?"
```
- High precision means few false alarms
- Critical for systems with alert fatigue risk
Recall (Sensitivity, True Positive Rate)
```
Recall = TP / (TP + FN)
"Of real anomalies, how many did we catch?"
```
- High recall means don’t miss critical events
- Critical for safety systems
F1 Score (Harmonic Mean)
```
F1 = 2 x (Precision x Recall) / (Precision + Recall)
```
- Balanced metric when both precision and recall matter
- Single number for model comparison
False Positive Rate
```
FPR = FP / (FP + TN)
"Of normal samples, how many did we incorrectly flag?"
```
- Critical for operational burden
- Target: <0.1% for industrial (1 false alarm per 1000 samples)

Putting Numbers to It

The cost-benefit optimal threshold balances false positive and false negative costs. For cost $C_{FP}$ per false alarm and $C_{FN}$ per missed anomaly, minimize:

\[\text{Total Cost} = (FP \times C_{FP}) + (FN \times C_{FN})\]

Example: Predictive maintenance with $C_{FP} = \$250$ (truck roll) and $C_{FN} = \$8,500$ (emergency repair). Current model over 4 weeks: 95% recall, 20 real failures in 1M readings:

$TP = 19$, $FN = 1$, $FP = 35$ (precision = 19/54 = 35%)
Cost = $(35 \times \$250) + (1 \times \$8,500) = \$17,250$

Tune threshold to 99% recall, 62% precision: - $TP = 19.8$, $FN = 0.2$, $FP = 12$ (precision = 19.8/31.8 = 62%) - Cost = $(12 \times \$250) + (0.2 \times \$8,500) = \$4,700$

**Savings: $12,550 per 4-week period** by optimizing for cost ratio ($C_{FN}/C_{FP} = 34$).

Try It: Detection Cost Calculator

Adjust the costs and confusion matrix to calculate the total cost of your anomaly detection system:

Show code

viewof cost_fp = Inputs.range([10, 5000], {value: 250, step: 10, label: "Cost per False Positive ($)"})
viewof cost_fn = Inputs.range([100, 50000], {value: 8500, step: 100, label: "Cost per False Negative ($)"})
viewof count_fp = Inputs.range([0, 500], {value: 35, step: 1, label: "Number of False Positives"})
viewof count_fn = Inputs.range([0, 50], {value: 1, step: 1, label: "Number of False Negatives"})

Show code

{
  const fpCost = count_fp * cost_fp;
  const fnCost = count_fn * cost_fn;
  const totalCost = fpCost + fnCost;
  const costRatio = cost_fp > 0 ? (cost_fn / cost_fp).toFixed(1) : "N/A";

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #E67E22; font-family: sans-serif;">
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 0.75rem;">
      <div><strong>FP Cost:</strong> ${count_fp} x $${cost_fp} = <strong>$${fpCost.toLocaleString()}</strong></div>
      <div><strong>FN Cost:</strong> ${count_fn} x $${cost_fn} = <strong>$${fnCost.toLocaleString()}</strong></div>
      <div style="grid-column: 1 / -1;"><strong>Total Cost:</strong> <span style="font-size: 1.2em; color: #e74c3c;">$${totalCost.toLocaleString()}</span></div>
      <div><strong>Cost Ratio (FN/FP):</strong> ${costRatio}:1</div>
      <div><strong>FP Share:</strong> ${totalCost > 0 ? (fpCost / totalCost * 100).toFixed(0) : 0}% | <strong>FN Share:</strong> ${totalCost > 0 ? (fnCost / totalCost * 100).toFixed(0) : 0}%</div>
    </div>
    <div style="margin-top: 0.75rem; padding-top: 0.5rem; border-top: 1px solid #dee2e6; font-size: 0.9em;">
      ${fnCost > fpCost ? html`<span>False negatives dominate cost -- optimize for <strong>recall</strong> (catch more anomalies).</span>` : html`<span>False positives dominate cost -- optimize for <strong>precision</strong> (reduce false alarms).</span>`}
    </div>
  </div>`;
}

Worked Example:

# Motor monitoring system over 1 week
# 1,000,000 sensor readings, 100 real anomalies

TP = 95   # Detected 95 real anomalies
FN = 5    # Missed 5 real anomalies
FP = 200  # 200 false alarms
TN = 999700  # Correctly identified 999,700 normal samples

precision = TP / (TP + FP)
recall = TP / (TP + FN)
f1 = 2 * (precision * recall) / (precision + recall)
fpr = FP / (FP + TN)

print(f"Precision: {precision:.3f} (95/295 alerts were real)")
print(f"Recall: {recall:.3f} (caught 95/100 anomalies)")
print(f"F1 Score: {f1:.3f}")
print(f"False Positive Rate: {fpr:.5f} (0.02%)")

# Output:
# Precision: 0.322 (32% of alerts were real) <- LOW, too many false alarms
# Recall: 0.950 (95% of anomalies caught) <- HIGH, good detection
# F1 Score: 0.481 <- Mediocre balance
# False Positive Rate: 0.00020 (0.02%) <- EXCELLENT, few false alarms per sample

# Interpretation: System catches most anomalies but operator receives
# ~200 false alarms per week (28/day) -> likely alarm fatigue
# Solution: Increase detection threshold to improve precision

Try It: Anomaly Detection Metrics Calculator

Adjust the confusion matrix values below to see how precision, recall, F1, and false positive rate change:

Show code

viewof tp_val = Inputs.range([0, 200], {value: 95, step: 1, label: "True Positives (TP)"})
viewof fn_val = Inputs.range([0, 200], {value: 5, step: 1, label: "False Negatives (FN)"})
viewof fp_val = Inputs.range([0, 1000], {value: 200, step: 5, label: "False Positives (FP)"})
viewof tn_val = Inputs.range([0, 1000000], {value: 999700, step: 100, label: "True Negatives (TN)"})

Show code

{
  const total = tp_val + fn_val + fp_val + tn_val;
  const precision = tp_val + fp_val > 0 ? tp_val / (tp_val + fp_val) : 0;
  const recall = tp_val + fn_val > 0 ? tp_val / (tp_val + fn_val) : 0;
  const f1 = precision + recall > 0 ? 2 * (precision * recall) / (precision + recall) : 0;
  const fpr = fp_val + tn_val > 0 ? fp_val / (fp_val + tn_val) : 0;
  const accuracy = total > 0 ? (tp_val + tn_val) / total : 0;
  const anomalyRate = total > 0 ? (tp_val + fn_val) / total * 100 : 0;

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #3498DB; font-family: sans-serif;">
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 0.75rem;">
      <div><strong>Precision:</strong> ${(precision * 100).toFixed(1)}%</div>
      <div><strong>Recall:</strong> ${(recall * 100).toFixed(1)}%</div>
      <div><strong>F1 Score:</strong> ${f1.toFixed(3)}</div>
      <div><strong>FPR:</strong> ${(fpr * 100).toFixed(4)}%</div>
      <div><strong>Accuracy:</strong> ${(accuracy * 100).toFixed(2)}% <span style="color: #e74c3c;">(misleading!)</span></div>
      <div><strong>Anomaly Rate:</strong> ${anomalyRate.toFixed(3)}%</div>
    </div>
    <div style="margin-top: 0.75rem; padding-top: 0.5rem; border-top: 1px solid #dee2e6; font-size: 0.9em;">
      ${precision < 0.5 ? html`<span style="color: #e74c3c;">Low precision: ${((1 - precision) * 100).toFixed(0)}% of alerts are false alarms -- risk of alert fatigue.</span>` : html`<span style="color: #16a085;">Good precision: most alerts are real anomalies.</span>`}
      ${recall < 0.9 ? html`<br><span style="color: #e74c3c;">Low recall: missing ${((1 - recall) * 100).toFixed(0)}% of real anomalies.</span>` : html`<br><span style="color: #16a085;">Good recall: catching ${(recall * 100).toFixed(0)}% of anomalies.</span>`}
    </div>
  </div>`;
}

15.6 Real-World Performance Targets

Industry Benchmarks:

Application	Precision Target	Recall Target	False Alarm Tolerance
Manufacturing Safety	>70%	>99%	<10 false alarms/day
Predictive Maintenance	>80%	>95%	<5 false alarms/week
Energy Management	>85%	>90%	<2 false alarms/week
Smart Home	>90%	>80%	<1 false alarm/month
Network Security	>60%	>99.9%	<100 false alarms/day

Mobile Benchmarks

Manufacturing Safety: precision >70%, recall >99%, fewer than 10 false alarms per day.
Predictive Maintenance: precision >80%, recall >95%, fewer than 5 false alarms per week.
Energy Management: precision >85%, recall >90%, fewer than 2 false alarms per week.
Smart Home: precision >90%, recall >80%, fewer than 1 false alarm per month.
Network Security: precision >60%, recall >99.9%, fewer than 100 false alarms per day.

15.7 Threshold Tuning

Tuning Strategy:

import numpy as np
from sklearn.metrics import precision_recall_curve

def find_optimal_threshold(y_true, y_scores, target_recall=0.95):
    """
    Find threshold that achieves target recall while maximizing precision

    y_true: actual labels (0=normal, 1=anomaly)
    y_scores: anomaly scores from model
    target_recall: minimum recall required
    """
    precision, recall, thresholds = precision_recall_curve(y_true, y_scores)

    # precision and recall have len(thresholds)+1 elements;
    # drop the last element (recall=0, precision=1) to align with thresholds
    precision = precision[:-1]
    recall = recall[:-1]

    # Find thresholds that meet recall target
    valid_mask = recall >= target_recall

    if not any(valid_mask):
        print(f"Cannot achieve {target_recall} recall")
        return None

    # Among valid thresholds, pick one with best precision
    best_idx = np.argmax(precision[valid_mask])
    valid_thresholds = thresholds[valid_mask]
    valid_precisions = precision[valid_mask]

    best_threshold = valid_thresholds[best_idx]
    best_precision = valid_precisions[best_idx]

    print(f"Threshold: {best_threshold:.3f}")
    print(f"Achieves: {target_recall:.1%} recall, {best_precision:.1%} precision")

    return best_threshold

Worked Example: Tuning Anomaly Detection for HVAC Predictive Maintenance

A facilities management company monitors 500 commercial HVAC units using vibration sensors. Initial deployment: Isolation Forest model with default threshold yields 95% recall but only 10% precision, generating 171 false alarms over 4 weeks.

Step 1 - Establish Business Costs:

False positive cost: Technician truck roll + 2 hours labor = $250/incident
False negative cost: Emergency repair + downtime = $8,500/incident (HVAC failure during business hours)
Cost ratio: FN/FP = $8,500 / $250 = 34:1

Step 2 - Current Performance Analysis: Over 4 weeks (1,120,000 sensor readings, 20 actual HVAC failures): - TP = 19 (caught 19 failures) - FN = 1 (missed 1 failure) - FP = 171 (171 false alarms over 4 weeks, ~43/week) - Precision = 19/(19+171) = 10% - Recall = 19/20 = 95% - 4-week cost = 171 x $250 + 1 x $8,500 = $42,750 + $8,500 = $51,250 ($12,813/week)

Step 3 - Optimize Threshold for Business: Using precision-recall curve, find threshold that achieves recall >= 99% (cannot miss failures): - New threshold: anomaly score > 0.72 (raised from 0.55) - TP = 19.8 (99% of 20) - FN = 0.2 - FP = 40 (reduced from 171) - Precision = 19.8/(19.8+40) = 33% - 4-week cost = 40 x $250 + 0.2 x $8,500 = $10,000 + $1,700 = $11,700 ($2,925/week)

Step 4 - Add Temporal Persistence: Require 3 consecutive anomalous readings (15-minute window) before alerting: - Filters transient sensor noise (vibration from nearby traffic, door slams) - TP = 19.6 (98% recall), FN = 0.4 - FP = 12 over 4 weeks (3/week) - Precision = 19.6/(19.6+12) = 62% - 4-week cost = 12 x $250 + 0.4 x $8,500 = $3,000 + $3,400 = $6,400 ($1,600/week)

Results: Tuned system saves $44,850 per 4-week period ($11,213/week) versus initial deployment, while maintaining 98% recall to catch critical failures before emergency breakdowns.

Tradeoff Decision Guide: Statistical vs ML Anomaly Detection

Factor	Statistical (Z-score/IQR)	ML (Isolation Forest/Autoencoder)	When to Choose
Compute Requirements	Minimal (<1KB RAM)	Significant (MB-GB RAM)	Statistical for edge devices; ML for cloud/gateway
Training Data Needed	None (online calculation)	1000+ normal samples	Statistical for cold-start; ML with historical data
Interpretability	High (clear thresholds)	Low (black box scores)	Statistical for regulated/auditable systems
Multivariate Patterns	Poor (single variable)	Excellent (cross-sensor)	ML for complex correlations; statistical for single sensors
Concept Drift Handling	Manual threshold updates	Automatic with retraining	Statistical with domain expertise; ML for autonomous
False Positive Rate	Higher (simple rules)	Lower (learned patterns)	ML when false alarm cost is high
Setup Time	Minutes	Days to weeks	Statistical for rapid deployment; ML for mature systems

Compute requirements: statistical methods fit tiny edge devices; ML methods usually need gateway or cloud resources.
Training data: statistical methods work immediately, while ML methods need at least 1000+ normal samples.
Interpretability: statistical thresholds are easy to audit; ML scores are harder to explain in regulated settings.
Pattern complexity: statistical methods suit single-sensor rules; ML methods are better for cross-sensor correlations.
Concept drift: statistical thresholds need manual retuning; ML can adapt through retraining.
False alarms: ML can reduce false positives when the extra compute and maintenance are justified.
Setup time: statistical methods take minutes; ML usually takes days or weeks to deploy well.

Quick Decision Rule: Start with Z-score/IQR for immediate value with minimal setup; graduate to ML methods only when you have sufficient training data AND the false positive reduction justifies the computational and maintenance overhead.

15.8 Lab: Build an Anomaly Detection System

~45 min | Intermediate | P10.C01.LAB01

15.8.1 Learning Objectives

By completing this hands-on lab, you will be able to:

Implement Z-score based anomaly detection on embedded hardware
Build a moving average baseline for adaptive thresholds
Apply IQR (Interquartile Range) method for robust outlier detection
Design threshold-based alerts with hysteresis to reduce false positives
Compare different anomaly detection methods and understand their tradeoffs
Visualize anomaly detection decisions in real-time

What You’ll Build

A complete anomaly detection system on ESP32 that demonstrates four detection methods running simultaneously: Z-score (statistical), Moving Average deviation, IQR-based outliers, and threshold with hysteresis. You’ll see how each method responds differently to the same sensor data, helping you understand when to use each approach in production IoT systems.

15.8.2 Anomaly Detection Methods Demonstrated

This lab implements several key anomaly detection patterns:

Method	How It Works	Strengths	Weaknesses
Z-Score	Measures standard deviations from mean	Mathematically rigorous, well-understood	Assumes normal distribution, sensitive to outliers in baseline
Moving Average	Compares to rolling baseline	Adapts to slow changes, simple	Lag in detection, window size tuning needed
IQR (Interquartile Range)	Uses quartiles for robust bounds	Resistant to outliers, no distribution assumption	Requires sorted data buffer, higher memory
Threshold + Hysteresis	Fixed bounds with entry/exit gap	Prevents oscillation, deterministic	Requires domain knowledge for thresholds

Mobile Method Comparison

Z-Score: compares readings to the mean in standard deviations. Strong mathematically, but assumes a stable normal baseline.
Moving Average: compares each reading to a rolling baseline. Simple and adaptive, but slower to react and sensitive to window size.
IQR: uses quartiles and fences instead of mean and variance. More robust to outliers, but needs a sorted history buffer.
Threshold + Hysteresis: uses fixed entry and exit points. Deterministic and stable, but relies on good domain-specific thresholds.

15.8.3 Wokwi Simulator

Use the lab launcher below to build your anomaly detection system:

Launch the Wokwi Lab

Use Wokwi in a separate tab so you can keep the wiring guide and code visible in this chapter.

Open a new ESP32 project in Wokwi
Paste the diagram.json block and the Arduino sketch from the sections below.
Keep this chapter open beside Wokwi while you compare Z-score, moving average, IQR, and hysteresis behavior.

15.8.4 Circuit Setup

Connect the temperature sensor and indicator LEDs to the ESP32:

Component	ESP32 Pin	Purpose
Temperature Sensor (NTC)	GPIO 34	Primary data source for anomaly detection
Potentiometer	GPIO 35	Simulate temperature variations/anomalies
Red LED	GPIO 18	Z-score anomaly indicator
Orange LED	GPIO 19	Moving average anomaly indicator
Yellow LED	GPIO 21	IQR anomaly indicator
Green LED	GPIO 22	Normal operation / hysteresis state
Blue LED	GPIO 23	Threshold + hysteresis anomaly indicator

Mobile Wiring Map

GPIO 34: NTC temperature sensor input.
GPIO 35: potentiometer input for simulated anomalies.
GPIO 18: red LED for Z-score alerts.
GPIO 19: orange LED for moving-average alerts.
GPIO 21: yellow LED for IQR alerts.
GPIO 22: green LED for normal-operation / hysteresis state.
GPIO 23: blue LED for threshold-plus-hysteresis alerts.

Add this diagram.json configuration in Wokwi:

{
  "version": 1,
  "author": "IoT Class - Anomaly Detection Lab",
  "editor": "wokwi",
  "parts": [
    { "type": "wokwi-esp32-devkit-v1", "id": "esp", "top": 0, "left": 0 },
    { "type": "wokwi-ntc-temperature-sensor", "id": "temp1", "top": -120, "left": 120 },
    { "type": "wokwi-potentiometer", "id": "pot1", "top": -120, "left": 260, "attrs": { "value": "50" } },
    { "type": "wokwi-led", "id": "led_red", "top": 180, "left": 80, "attrs": { "color": "red" } },
    { "type": "wokwi-led", "id": "led_orange", "top": 180, "left": 130, "attrs": { "color": "orange" } },
    { "type": "wokwi-led", "id": "led_yellow", "top": 180, "left": 180, "attrs": { "color": "yellow" } },
    { "type": "wokwi-led", "id": "led_green", "top": 180, "left": 230, "attrs": { "color": "green" } },
    { "type": "wokwi-led", "id": "led_blue", "top": 180, "left": 280, "attrs": { "color": "blue" } },
    { "type": "wokwi-resistor", "id": "r1", "top": 240, "left": 80, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r2", "top": 240, "left": 130, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r3", "top": 240, "left": 180, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r4", "top": 240, "left": 230, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r5", "top": 240, "left": 280, "attrs": { "value": "220" } }
  ],
  "connections": [
    ["esp:GND.1", "temp1:GND", "black", ["h0"]],
    ["esp:3V3", "temp1:VCC", "red", ["h0"]],
    ["esp:34", "temp1:OUT", "green", ["h0"]],
    ["esp:GND.1", "pot1:GND", "black", ["h0"]],
    ["esp:3V3", "pot1:VCC", "red", ["h0"]],
    ["esp:35", "pot1:SIG", "purple", ["h0"]],
    ["esp:18", "led_red:A", "red", ["h0"]],
    ["led_red:C", "r1:1", "black", ["h0"]],
    ["r1:2", "esp:GND.2", "black", ["h0"]],
    ["esp:19", "led_orange:A", "orange", ["h0"]],
    ["led_orange:C", "r2:1", "black", ["h0"]],
    ["r2:2", "esp:GND.2", "black", ["h0"]],
    ["esp:21", "led_yellow:A", "yellow", ["h0"]],
    ["led_yellow:C", "r3:1", "black", ["h0"]],
    ["r3:2", "esp:GND.2", "black", ["h0"]],
    ["esp:22", "led_green:A", "green", ["h0"]],
    ["led_green:C", "r4:1", "black", ["h0"]],
    ["r4:2", "esp:GND.2", "black", ["h0"]],
    ["esp:23", "led_blue:A", "blue", ["h0"]],
    ["led_blue:C", "r5:1", "black", ["h0"]],
    ["r5:2", "esp:GND.2", "black", ["h0"]]
  ]
}

15.8.5 Step-by-Step Instructions

15.8.5.1 Step 1: Set Up the Simulator

Open the Wokwi simulator using the launch link above (or visit wokwi.com)
Create a new ESP32 project
Click the diagram.json tab and paste the circuit configuration
Copy the Arduino code from the collapsible section below

15.8.5.2 Step 2: Run and Observe Normal Operation

Click the Play button to start the simulation
Open the Serial Monitor to see detection output
Keep the potentiometer at center position (normal temperature range)
Observe: All four methods should show “Normal” with the green LED lit
Watch the buffer fill as the system collects baseline data

15.8.5.3 Step 3: Trigger Anomalies with the Potentiometer

Slowly rotate the potentiometer to the right (increase temperature)
Watch which method detects the anomaly first:
- Hysteresis: Triggers when crossing 45C threshold
- Z-score: Triggers when 2.5 standard deviations from mean
- Moving Average: Triggers at 15% deviation
- IQR: Triggers outside 1.5x interquartile range
Note: Different LEDs light up as each method triggers

15.8.5.4 Step 4: Observe Hysteresis Behavior

Push temperature above 45C (potentiometer far right)
Blue LED turns ON (entered anomaly state)
Slowly decrease temperature by rotating potentiometer left
Notice: Blue LED stays ON until temperature drops below 38C
This gap (45C to 38C) is the hysteresis band - prevents oscillation

Complete Arduino Code

Copy this code into the Wokwi editor:

// Anomaly Detection Lab: Multi-Method Comparison System
// Demonstrates: Z-Score, Moving Average, IQR, Hysteresis

const int TEMP_PIN = 34;
const int POT_PIN = 35;
const int LED_ZSCORE = 18;
const int LED_MAVG = 19;
const int LED_IQR = 21;
const int LED_NORMAL = 22;
const int LED_HYSTERESIS = 23;

const int SAMPLE_INTERVAL_MS = 200;
const int WINDOW_SIZE = 50;
const float ZSCORE_THRESHOLD = 2.5;
const float MAVG_DEVIATION_PCT = 15.0;
const float IQR_MULTIPLIER = 1.5;
const float HYSTERESIS_HIGH = 45.0;
const float HYSTERESIS_LOW = 38.0;
const int CONSECUTIVE_REQUIRED = 3;

float dataBuffer[WINDOW_SIZE];
float sortedBuffer[WINDOW_SIZE];
int bufferIndex = 0;
int bufferCount = 0;

float runningSum = 0;
float runningSumSq = 0;
float movingAverage = 0;
bool inHysteresisAnomaly = false;

int zscoreConsecutive = 0;
int mavgConsecutive = 0;
int iqrConsecutive = 0;

unsigned long totalSamples = 0;
unsigned long lastSampleTime = 0;

void setup() {
  Serial.begin(115200);
  delay(1000);

  pinMode(TEMP_PIN, INPUT);
  pinMode(POT_PIN, INPUT);
  pinMode(LED_ZSCORE, OUTPUT);
  pinMode(LED_MAVG, OUTPUT);
  pinMode(LED_IQR, OUTPUT);
  pinMode(LED_NORMAL, OUTPUT);
  pinMode(LED_HYSTERESIS, OUTPUT);

  Serial.println("Anomaly Detection Lab Started");
  Serial.println("Adjust potentiometer to simulate anomalies");
}

float readTemperature() {
  int ntcRaw = analogRead(TEMP_PIN);
  int potRaw = analogRead(POT_PIN);
  float baseTemp = map(ntcRaw, 0, 4095, 2000, 4000) / 100.0;
  float offset = map(potRaw, 0, 4095, -2000, 3000) / 100.0;
  return baseTemp + offset;
}

void addToBuffer(float value) {
  if (bufferCount == WINDOW_SIZE) {
    float oldValue = dataBuffer[bufferIndex];
    runningSum -= oldValue;
    runningSumSq -= oldValue * oldValue;
  }

  dataBuffer[bufferIndex] = value;
  runningSum += value;
  runningSumSq += value * value;

  bufferIndex = (bufferIndex + 1) % WINDOW_SIZE;
  if (bufferCount < WINDOW_SIZE) bufferCount++;
}

float calculateZScore(float value) {
  if (bufferCount < 10) return 0;
  float mean = runningSum / bufferCount;
  float variance = (runningSumSq / bufferCount) - (mean * mean);
  if (variance <= 0) return 0;
  return abs((value - mean) / sqrt(variance));
}

float calculateMADeviation(float value) {
  if (bufferCount == 0) return 0;
  float avg = runningSum / bufferCount;
  if (avg == 0) return 0;
  return abs((value - avg) / avg) * 100.0;
}

void sortBuffer() {
  for (int i = 0; i < bufferCount; i++) {
    sortedBuffer[i] = dataBuffer[i];
  }
  for (int i = 0; i < bufferCount - 1; i++) {
    for (int j = 0; j < bufferCount - i - 1; j++) {
      if (sortedBuffer[j] > sortedBuffer[j + 1]) {
        float temp = sortedBuffer[j];
        sortedBuffer[j] = sortedBuffer[j + 1];
        sortedBuffer[j + 1] = temp;
      }
    }
  }
}

bool isIQRAnomaly(float value) {
  if (bufferCount < 20) return false;
  sortBuffer();
  float q1 = sortedBuffer[bufferCount / 4];
  float q3 = sortedBuffer[(3 * bufferCount) / 4];
  float iqr = q3 - q1;
  float lowerFence = q1 - (IQR_MULTIPLIER * iqr);
  float upperFence = q3 + (IQR_MULTIPLIER * iqr);
  return (value < lowerFence || value > upperFence);
}

bool checkHysteresis(float value) {
  if (!inHysteresisAnomaly) {
    if (value > HYSTERESIS_HIGH) inHysteresisAnomaly = true;
  } else {
    if (value < HYSTERESIS_LOW) inHysteresisAnomaly = false;
  }
  return inHysteresisAnomaly;
}

bool updateConsecutive(bool detected, int* counter) {
  if (detected) {
    (*counter)++;
    return (*counter) >= CONSECUTIVE_REQUIRED;
  } else {
    *counter = 0;
    return false;
  }
}

void loop() {
  unsigned long now = millis();

  if (now - lastSampleTime >= SAMPLE_INTERVAL_MS) {
    lastSampleTime = now;
    totalSamples++;

    float temp = readTemperature();
    addToBuffer(temp);

    float zscore = calculateZScore(temp);
    float maDeviation = calculateMADeviation(temp);

    bool zscoreRaw = (zscore > ZSCORE_THRESHOLD) && (bufferCount >= 10);
    bool mavgRaw = (maDeviation > MAVG_DEVIATION_PCT) && (bufferCount >= 5);
    bool iqrRaw = isIQRAnomaly(temp);
    bool hystAnomaly = checkHysteresis(temp);

    bool zscoreAnomaly = updateConsecutive(zscoreRaw, &zscoreConsecutive);
    bool mavgAnomaly = updateConsecutive(mavgRaw, &mavgConsecutive);
    bool iqrAnomaly = updateConsecutive(iqrRaw, &iqrConsecutive);

    digitalWrite(LED_ZSCORE, zscoreAnomaly ? HIGH : LOW);
    digitalWrite(LED_MAVG, mavgAnomaly ? HIGH : LOW);
    digitalWrite(LED_IQR, iqrAnomaly ? HIGH : LOW);
    digitalWrite(LED_HYSTERESIS, hystAnomaly ? HIGH : LOW);

    bool anyAnomaly = zscoreAnomaly || mavgAnomaly || iqrAnomaly || hystAnomaly;
    digitalWrite(LED_NORMAL, anyAnomaly ? LOW : HIGH);

    Serial.print("T:");
    Serial.print(temp, 1);
    Serial.print(" Z:");
    Serial.print(zscore, 2);
    Serial.print(" MA%:");
    Serial.print(maDeviation, 1);
    Serial.print(" | ");
    if (zscoreAnomaly) Serial.print("Z-SCORE ");
    if (mavgAnomaly) Serial.print("MA ");
    if (iqrAnomaly) Serial.print("IQR ");
    if (hystAnomaly) Serial.print("HYST ");
    if (!anyAnomaly) Serial.print("NORMAL");
    Serial.println();
  }
}

15.8.6 Key Concepts Explained

Concept: Hysteresis for Alert Stability

How It Works:

Define two thresholds: HIGH (enter anomaly state) and LOW (exit anomaly state)
Once in anomaly state, must drop below LOW to exit
The gap between thresholds prevents oscillation

In This Lab:

HIGH = 45C (enter anomaly)
LOW = 38C (exit anomaly)
7C hysteresis band

Strengths:

Deterministic and predictable
Prevents “bouncing” alerts at threshold boundary
Simple state machine implementation

When to Use: Safety-critical systems with known limits, regulatory compliance

Decision Framework: Selecting Anomaly Detection Metrics Based on Application Domain

Domain	Primary Metric	Target Value	Secondary Metric	Reason
Manufacturing Safety	Recall (Sensitivity)	≥99.5%	FPR <0.1%	Cannot miss critical failures; false alarms tolerable if <10/day
Predictive Maintenance	F1 Score	≥0.85	Balanced precision/recall	Balance early detection vs maintenance cost
Network Intrusion	Recall	≥99.9%	Precision ≥60%	Missing attacks catastrophic; SOC can triage false positives
Energy Optimization	Precision	≥85%	Recall ≥90%	Frequent false alarms cause user fatigue and system distrust
Smart Home	Precision	≥90%	Recall ≥80%	Users ignore systems with >1 false alarm/month
Medical Monitoring	Recall	≥99.99%	FPR <0.01%	False negatives life-threatening; false positives trigger clinician review

Quick Selection Rules:

Safety-critical (human life/injury risk): Optimize for recall ≥99%, tolerate FPR <1%
Financial loss >$10K per missed anomaly: Optimize for recall ≥95%, precision ≥70%
User-facing consumer products: Optimize for precision ≥90%, recall ≥80% (trust matters)
High-volume data with human review: Optimize for F1 score (balance efficiency)

Threshold Tuning Strategy:

Plot precision-recall curve for your detector
Identify minimum acceptable recall (business requirement)
Among thresholds meeting recall target, select one maximizing precision
If F1 score is priority metric, find threshold at harmonic mean peak

Common Mistake: Using Accuracy for Imbalanced Anomaly Detection

The Error: A manufacturing quality control team deploys an anomaly detector for defect detection. They celebrate achieving “99.2% accuracy!” But production line managers notice most defects still reach customers.

The Reality:

Dataset: 1,000,000 products/week, 800 actual defects (0.08% defect rate)
Detector predicts “normal” for every single product
Accuracy = (999,200 correct “normal” predictions) / 1,000,000 = 99.92%
But recall = 0% – catches ZERO defects!

Why Accuracy Fails: Accuracy = (TP + TN) / (TP + TN + FP + FN)

For imbalanced data (defects are rare): - TN dominates (999,200 true negatives) - TP, FP, FN are tiny (800 combined) - A “predict everything is normal” model achieves 99.92% accuracy while being completely useless

Correct Metrics for Anomaly Detection:

Precision: Of products flagged as defects, how many truly are? (TP / (TP + FP))
Recall: Of actual defects, how many did we catch? (TP / (TP + FN))
F1 Score: Harmonic mean balancing precision and recall

Real-World Example - Corrected Approach: Same 1M products, 800 defects. Detector tuned for quality control: - TP = 760 (caught 95% of defects) - FN = 40 (missed 5%) - FP = 2,000 (flagged 2,000 normal products as defects) - TN = 997,200

Metrics: - Accuracy: 99.8% (misleading – looks nearly identical to useless model) - Precision: 760/2,760 = 27.5% (1 in 4 flags is real defect) - Recall: 760/800 = 95% (catches 95% of defects) - F1 Score: 0.42

While 27.5% precision means human inspectors review 2,760 products/week (versus 1M without detector), this is acceptable – manual inspection of 2,760 items costs $5,520/week while catching defects worth $152,000 in returns/reputation damage.

Key Lesson: For rare events (anomalies <5% of data), never use accuracy. Use precision, recall, and F1 score. Optimize based on business costs of false positives vs false negatives.

Concept Relationships

Concept relationship diagram showing Anomaly Detection System branching into Detection Algorithm, Performance Metrics, and Threshold Tuning, with cost analysis and business requirements converging on Threshold Selection

Mobile Walkthrough: Concept Relationships

Detection algorithms produce anomaly scores or classifications from sensor readings.
Performance metrics translate those outputs into precision, recall, F1, and false-alarm trade-offs.
Threshold tuning connects algorithm behavior to business costs so alerts fire at the right time.

How These Concepts Connect:

Metrics evaluate algorithms: Precision/recall/F1 measure how well your chosen detection method performs
Thresholds bridge algorithms and business: Same algorithm at different thresholds produces different precision/recall trade-offs
Cost drives threshold selection: If false negatives cost 100x more than false positives, optimize for recall
Temporal persistence reduces false alarms: Requiring 3 consecutive anomalies improves precision while maintaining recall

Common Pitfalls

1. Reporting accuracy instead of precision and recall

A detector predicting ‘normal’ for every reading achieves 99.9% accuracy on data with 0.1% anomalies — and catches nothing. Always report precision and recall alongside accuracy.

2. Choosing a threshold without a cost model

Lowering the threshold raises recall but floods operators with false alarms. Calculate the cost ratio (missed failure cost / false alarm cost) and set the threshold where expected total cost is minimised.

3. Evaluating on balanced test sets

If you up-sample anomalies to 50/50 for evaluation, reported metrics will not reflect real-world performance. Test on held-out data that preserves the natural class imbalance.

4. Ignoring temporal correlation of alerts

Two alerts 50 ms apart on the same sensor are almost certainly one event. Count grouped alert bursts as single events when computing recall to avoid inflated true-positive counts.

5. Optimising for F1 when recall matters more

In safety-critical systems, missing an anomaly is catastrophically more expensive than investigating a false alarm. Optimise for recall ≥ 99% first, then tighten precision as operational capacity allows.

Label the Diagram

Code Challenge

15.9 Summary

Performance metrics and threshold tuning are critical for production anomaly detection:

Metrics: Use precision, recall, F1 - never accuracy for imbalanced data
Threshold Tuning: Based on cost ratio of false negatives to false positives
False Alarm Reduction: Temporal persistence and multi-sensor correlation
Lab: Multiple methods have different trade-offs - use the right tool for the job

Key Takeaway: Anomaly detection is as much about operational tuning as algorithm selection. The best algorithm with poor thresholds performs worse than a simple method with well-calibrated thresholds.

For Kids: Meet the Sensor Squad!

Sammy the Sensor had a problem. He was guarding the school’s science lab refrigerator, and every time the temperature went up even a tiny bit, he shouted “DANGER! DANGER!” to Lila the LED.

After the first day, the teacher had received 47 false alarms and only 1 real problem (someone left the door open at lunch). The teacher started ignoring all of Sammy’s warnings!

“This is terrible!” cried Bella the Battery. “If there is a REAL emergency, nobody will listen anymore!”

Max the Microcontroller called a team meeting. “We need to learn about something called precision and recall,” he said, drawing on the whiteboard.

“Precision means: when we DO sound the alarm, how often is it a REAL problem? Right now, only 1 out of 47 alarms was real. That is awful precision!”

“Recall means: of all the REAL problems, how many did we catch? We caught 1 out of 1 real problem – perfect recall!”

“So we are great at catching problems but terrible at NOT crying wolf,” Sammy summarized.

“Right!” said Max. “Here is my plan: instead of alarming on ONE high reading, we wait for THREE readings in a row. That way, a tiny blip will not trigger an alarm, but a real problem – like a door left open – will still get caught.”

The next week, Sammy only raised 3 alarms – and all 3 were real problems! The teacher started trusting the alerts again.

Key lesson: A detector that cries wolf too often gets ignored. The best alarm systems balance catching real problems (recall) with not raising false alarms (precision)!

15.10 What’s Next

If you want to…	Read this
Understand the full detection pipeline	Real-Time Anomaly Pipelines
Learn the anomaly classification framework	Types of Anomalies
Apply lightweight edge detection	Statistical Methods
Deploy ML-based detection	Machine Learning Approaches
Return to the module overview	Anomaly Detection Overview

15.1 Learning Objectives

15.2 Prerequisites

15.3 Introduction

15.4 The Fundamental Trade-Off

15.5 Key Metrics Explained

15.6 Real-World Performance Targets

15.7 Threshold Tuning

15.8 Lab: Build an Anomaly Detection System

15.8.1 Learning Objectives

15.8.2 Anomaly Detection Methods Demonstrated

15.8.3 Wokwi Simulator

15.8.4 Circuit Setup

15.8.5 Step-by-Step Instructions

15.8.5.1 Step 1: Set Up the Simulator

15.8.5.2 Step 2: Run and Observe Normal Operation

15.8.5.3 Step 3: Trigger Anomalies with the Potentiometer

15.8.5.4 Step 4: Observe Hysteresis Behavior

15.8.6 Key Concepts Explained

Concept Relationships

See Also

Common Pitfalls

15.9 Summary

15.10 What’s Next