Evaluate Detection Performance: Use precision, recall, F1, and confusion matrices for imbalanced data
Tune Thresholds: Optimize detection thresholds based on business costs
Reduce False Alarms: Apply temporal persistence and multi-sensor correlation
Build a Detection System: Implement multi-method anomaly detection on embedded hardware
In 60 Seconds
Standard accuracy is a misleading metric for anomaly detection because normal readings dominate; precision, recall, and F1-score reveal whether your system actually finds rare critical events. Tuning the detection threshold is ultimately a business decision driven by the relative cost of missed anomalies versus false alarms.
For Beginners: Anomaly Detection Metrics
Anomaly detection metrics are like a report card for your alarm system. They measure how often it correctly catches real problems versus how often it cries wolf. Getting this balance right is crucial – too many false alarms and people start ignoring alerts, too few and real problems slip through unnoticed.
Core Concept: Standard accuracy is meaningless for anomaly detection. A detector that always says “normal” achieves 99.9% accuracy but catches zero anomalies. Use precision, recall, and F1 instead.
Why It Matters: The cost of a missed anomaly (false negative) is often 10-100x the cost of a false alarm (false positive). Threshold tuning is a business decision, not a statistical one.
Key Takeaway: Set thresholds based on the cost ratio of false negatives to false positives. For safety-critical systems, optimize for recall (>99%). For consumer systems, optimize for precision (>90%).
15.2 Prerequisites
Before diving into this chapter, you should be familiar with:
How do you know if your anomaly detector is working well? Standard accuracy is misleading for imbalanced data (99.9% normal, 0.1% anomalies). This chapter introduces the metrics that actually matter for anomaly detection – precision, recall, and F1 score – and shows how to tune detection thresholds based on the real-world costs of false alarms versus missed anomalies. You will also build a multi-method anomaly detection system on an ESP32 to see these concepts in action.
15.4 The Fundamental Trade-Off
Sensitivity vs Specificity:
High Sensitivity (Recall): Catch all anomalies, but many false alarms
High Specificity: Few false alarms, but miss some anomalies
False Negative: Miss critical failure - equipment damage, safety risk
The balance depends on domain:
Domain
Priority
Target Metrics
Rationale
Industrial Safety
Recall
>99% recall
Cannot miss critical failures
Consumer IoT
Precision
>80% precision
Users ignore frequent false alarms
Predictive Maintenance
Balanced
F1 > 0.85
Balance early detection vs maintenance costs
15.5 Key Metrics Explained
Confusion Matrix:
Predicted
Normal Anomaly
Actual Normal TN FP (False Positive = False Alarm)
Anomaly FN TP (False Negative = Missed Anomaly)
Derived Metrics:
Precision (Positive Predictive Value)
Precision = TP / (TP + FP)
"Of alerts raised, how many were real anomalies?"
High precision means few false alarms
Critical for systems with alert fatigue risk
Recall (Sensitivity, True Positive Rate)
Recall = TP / (TP + FN)
"Of real anomalies, how many did we catch?"
High recall means don’t miss critical events
Critical for safety systems
F1 Score (Harmonic Mean)
F1 = 2 x (Precision x Recall) / (Precision + Recall)
Balanced metric when both precision and recall matter
Single number for model comparison
False Positive Rate
FPR = FP / (FP + TN)
"Of normal samples, how many did we incorrectly flag?"
Critical for operational burden
Target: <0.1% for industrial (1 false alarm per 1000 samples)
Putting Numbers to It
The cost-benefit optimal threshold balances false positive and false negative costs. For cost \(C_{FP}\) per false alarm and \(C_{FN}\) per missed anomaly, minimize:
Example: Predictive maintenance with \(C_{FP} = \$250\) (truck roll) and \(C_{FN} = \$8,500\) (emergency repair). Current model over 4 weeks: 95% recall, 20 real failures in 1M readings:
import numpy as npfrom sklearn.metrics import precision_recall_curvedef find_optimal_threshold(y_true, y_scores, target_recall=0.95):""" Find threshold that achieves target recall while maximizing precision y_true: actual labels (0=normal, 1=anomaly) y_scores: anomaly scores from model target_recall: minimum recall required """ precision, recall, thresholds = precision_recall_curve(y_true, y_scores)# precision and recall have len(thresholds)+1 elements;# drop the last element (recall=0, precision=1) to align with thresholds precision = precision[:-1] recall = recall[:-1]# Find thresholds that meet recall target valid_mask = recall >= target_recallifnotany(valid_mask):print(f"Cannot achieve {target_recall} recall")returnNone# Among valid thresholds, pick one with best precision best_idx = np.argmax(precision[valid_mask]) valid_thresholds = thresholds[valid_mask] valid_precisions = precision[valid_mask] best_threshold = valid_thresholds[best_idx] best_precision = valid_precisions[best_idx]print(f"Threshold: {best_threshold:.3f}")print(f"Achieves: {target_recall:.1%} recall, {best_precision:.1%} precision")return best_threshold
Worked Example: Tuning Anomaly Detection for HVAC Predictive Maintenance
A facilities management company monitors 500 commercial HVAC units using vibration sensors. Initial deployment: Isolation Forest model with default threshold yields 95% recall but only 10% precision, generating 171 false alarms over 4 weeks.
Results: Tuned system saves $44,850 per 4-week period ($11,213/week) versus initial deployment, while maintaining 98% recall to catch critical failures before emergency breakdowns.
Tradeoff Decision Guide: Statistical vs ML Anomaly Detection
Factor
Statistical (Z-score/IQR)
ML (Isolation Forest/Autoencoder)
When to Choose
Compute Requirements
Minimal (<1KB RAM)
Significant (MB-GB RAM)
Statistical for edge devices; ML for cloud/gateway
Training Data Needed
None (online calculation)
1000+ normal samples
Statistical for cold-start; ML with historical data
Interpretability
High (clear thresholds)
Low (black box scores)
Statistical for regulated/auditable systems
Multivariate Patterns
Poor (single variable)
Excellent (cross-sensor)
ML for complex correlations; statistical for single sensors
Concept Drift Handling
Manual threshold updates
Automatic with retraining
Statistical with domain expertise; ML for autonomous
False Positive Rate
Higher (simple rules)
Lower (learned patterns)
ML when false alarm cost is high
Setup Time
Minutes
Days to weeks
Statistical for rapid deployment; ML for mature systems
Quick Decision Rule: Start with Z-score/IQR for immediate value with minimal setup; graduate to ML methods only when you have sufficient training data AND the false positive reduction justifies the computational and maintenance overhead.
15.8 Lab: Build an Anomaly Detection System
~45 min | Intermediate | P10.C01.LAB01
15.8.1 Learning Objectives
By completing this hands-on lab, you will be able to:
Implement Z-score based anomaly detection on embedded hardware
Build a moving average baseline for adaptive thresholds
Apply IQR (Interquartile Range) method for robust outlier detection
Design threshold-based alerts with hysteresis to reduce false positives
Compare different anomaly detection methods and understand their tradeoffs
Visualize anomaly detection decisions in real-time
What You’ll Build
A complete anomaly detection system on ESP32 that demonstrates four detection methods running simultaneously: Z-score (statistical), Moving Average deviation, IQR-based outliers, and threshold with hysteresis. You’ll see how each method responds differently to the same sensor data, helping you understand when to use each approach in production IoT systems.
15.8.2 Anomaly Detection Methods Demonstrated
This lab implements several key anomaly detection patterns:
Method
How It Works
Strengths
Weaknesses
Z-Score
Measures standard deviations from mean
Mathematically rigorous, well-understood
Assumes normal distribution, sensitive to outliers in baseline
Moving Average
Compares to rolling baseline
Adapts to slow changes, simple
Lag in detection, window size tuning needed
IQR (Interquartile Range)
Uses quartiles for robust bounds
Resistant to outliers, no distribution assumption
Requires sorted data buffer, higher memory
Threshold + Hysteresis
Fixed bounds with entry/exit gap
Prevents oscillation, deterministic
Requires domain knowledge for thresholds
15.8.3 Wokwi Simulator
Use the embedded simulator below to build your anomaly detection system:
Among thresholds meeting recall target, select one maximizing precision
If F1 score is priority metric, find threshold at harmonic mean peak
Common Mistake: Using Accuracy for Imbalanced Anomaly Detection
The Error: A manufacturing quality control team deploys an anomaly detector for defect detection. They celebrate achieving “99.2% accuracy!” But production line managers notice most defects still reach customers.
The Reality:
Dataset: 1,000,000 products/week, 800 actual defects (0.08% defect rate)
Detector predicts “normal” for every single product
For imbalanced data (defects are rare): - TN dominates (999,200 true negatives) - TP, FP, FN are tiny (800 combined) - A “predict everything is normal” model achieves 99.92% accuracy while being completely useless
Correct Metrics for Anomaly Detection:
Precision: Of products flagged as defects, how many truly are? (TP / (TP + FP))
Recall: Of actual defects, how many did we catch? (TP / (TP + FN))
F1 Score: Harmonic mean balancing precision and recall
Real-World Example - Corrected Approach: Same 1M products, 800 defects. Detector tuned for quality control: - TP = 760 (caught 95% of defects) - FN = 40 (missed 5%) - FP = 2,000 (flagged 2,000 normal products as defects) - TN = 997,200
Metrics: - Accuracy: 99.8% (misleading – looks nearly identical to useless model) - Precision: 760/2,760 = 27.5% (1 in 4 flags is real defect) - Recall: 760/800 = 95% (catches 95% of defects) - F1 Score: 0.42
While 27.5% precision means human inspectors review 2,760 products/week (versus 1M without detector), this is acceptable – manual inspection of 2,760 items costs $5,520/week while catching defects worth $152,000 in returns/reputation damage.
Key Lesson: For rare events (anomalies <5% of data), never use accuracy. Use precision, recall, and F1 score. Optimize based on business costs of false positives vs false negatives.
Concept Relationships
How These Concepts Connect:
Metrics evaluate algorithms: Precision/recall/F1 measure how well your chosen detection method performs
Thresholds bridge algorithms and business: Same algorithm at different thresholds produces different precision/recall trade-offs
Cost drives threshold selection: If false negatives cost 100x more than false positives, optimize for recall
1. Reporting accuracy instead of precision and recall
A detector predicting ‘normal’ for every reading achieves 99.9% accuracy on data with 0.1% anomalies — and catches nothing. Always report precision and recall alongside accuracy.
2. Choosing a threshold without a cost model
Lowering the threshold raises recall but floods operators with false alarms. Calculate the cost ratio (missed failure cost / false alarm cost) and set the threshold where expected total cost is minimised.
3. Evaluating on balanced test sets
If you up-sample anomalies to 50/50 for evaluation, reported metrics will not reflect real-world performance. Test on held-out data that preserves the natural class imbalance.
4. Ignoring temporal correlation of alerts
Two alerts 50 ms apart on the same sensor are almost certainly one event. Count grouped alert bursts as single events when computing recall to avoid inflated true-positive counts.
5. Optimising for F1 when recall matters more
In safety-critical systems, missing an anomaly is catastrophically more expensive than investigating a false alarm. Optimise for recall ≥ 99% first, then tighten precision as operational capacity allows.
Label the Diagram
Code Challenge
15.9 Summary
Performance metrics and threshold tuning are critical for production anomaly detection:
Metrics: Use precision, recall, F1 - never accuracy for imbalanced data
Threshold Tuning: Based on cost ratio of false negatives to false positives
False Alarm Reduction: Temporal persistence and multi-sensor correlation
Lab: Multiple methods have different trade-offs - use the right tool for the job
Key Takeaway: Anomaly detection is as much about operational tuning as algorithm selection. The best algorithm with poor thresholds performs worse than a simple method with well-calibrated thresholds.
For Kids: Meet the Sensor Squad!
Sammy the Sensor had a problem. He was guarding the school’s science lab refrigerator, and every time the temperature went up even a tiny bit, he shouted “DANGER! DANGER!” to Lila the LED.
After the first day, the teacher had received 47 false alarms and only 1 real problem (someone left the door open at lunch). The teacher started ignoring all of Sammy’s warnings!
“This is terrible!” cried Bella the Battery. “If there is a REAL emergency, nobody will listen anymore!”
Max the Microcontroller called a team meeting. “We need to learn about something called precision and recall,” he said, drawing on the whiteboard.
“Precision means: when we DO sound the alarm, how often is it a REAL problem? Right now, only 1 out of 47 alarms was real. That is awful precision!”
“Recall means: of all the REAL problems, how many did we catch? We caught 1 out of 1 real problem – perfect recall!”
“So we are great at catching problems but terrible at NOT crying wolf,” Sammy summarized.
“Right!” said Max. “Here is my plan: instead of alarming on ONE high reading, we wait for THREE readings in a row. That way, a tiny blip will not trigger an alarm, but a real problem – like a door left open – will still get caught.”
The next week, Sammy only raised 3 alarms – and all 3 were real problems! The teacher started trusting the alerts again.
Key lesson: A detector that cries wolf too often gets ignored. The best alarm systems balance catching real problems (recall) with not raising false alarms (precision)!