9 Anomaly Detection for IoT Systems

analytics-ml

anomaly

detection

9.1 Start With the Story

Picture an IoT team using the ideas in Anomaly Detection for IoT Systems during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

9.2 Anomaly Scores from Normal

Anomaly detection asks whether a reading, sequence, or multi-sensor pattern is unusual enough to review. The detector first defines normal behavior, then computes an anomaly score, then applies an alert rule. In IoT systems the alert rule matters as much as the scoring method because sensors are noisy, context changes, and anomalies are rare compared with normal data.

An IoT anomaly detector is a review loop: stream data becomes features, features feed a scoring model, thresholds route actions, and operator feedback tunes the next baseline.

A useful anomaly detector must separate several cases that can look similar in raw data: a real process fault, a sensor fault, a context shift, a missing-data artifact, a seasonal pattern, or a threshold that was tuned for the wrong operating mode. The detector should keep evidence for the baseline, score, threshold, persistence rule, context, and operator feedback.

Read the system from left to right. A vibration sensor may send raw acceleration, event counters, and maintenance logs. The feature engine converts those streams into rolling RMS vibration, peak count, operating mode, and hours since service. A simple statistical detector might score a bearing as unusual; a machine-learning detector might compare the feature vector with normal operating clusters. The scoring block then decides whether the result is normal, warning, or critical. The action block should not just page someone: it should also log the evidence, show the baseline version, and feed the review result back into tuning.

If you only need the intuition, this layer is enough: anomaly detection is not a magic label. It is a normal baseline plus a score plus a threshold or model rule, with evidence explaining why the alert was raised and when the rule must be retested.

Point Anomaly

One reading is unusual compared with the current baseline, such as a pressure spike or impossible temperature.

Contextual Anomaly

A reading is unusual only in context, such as normal daytime power demand appearing during a closed overnight period.

Collective Anomaly

A group of individually ordinary readings forms an unusual pattern, such as synchronized drift across several zones.

Sensor Fault

A stuck, clipped, stale, or miscalibrated sensor can look like a process anomaly unless health checks are kept separate.

Overview Knowledge Check

9.3 Baseline Plus Persistence Rule

A practical first detector for a stable single sensor is a z-score or robust alternative such as median absolute deviation. That does not mean the threshold should directly page an operator. IoT signals often need persistence, hysteresis, or confirmation from related sensors so single noisy samples do not become costly alerts. For drifting or seasonal signals, move from fixed thresholds to rolling baselines, EWMA, STL residuals, ARIMA residuals, or other context-aware scores.

Worked example: simple temperature z-score alert
baseline mean: 21.0 deg C
baseline standard deviation: 0.5 deg C
new reading: 23.0 deg C

z-score:
(23.0 - 21.0) / 0.5 = 4.0

candidate rule:
flag a candidate when abs(z) >= 3.0

alert rule:
raise an alert only if 3 consecutive readings are candidates,
or if a related sensor confirms the event.

why this matters:
The z-score identifies an unusual point.
The persistence rule decides whether it is operationally important.
The detector still needs sensor-health checks for stuck, clipped, stale,
or miscalibrated readings.

For a 1-minute temperature stream, three consecutive candidates means the signal must remain outside the gate for about 3 minutes before alerting. If the equipment can overheat in 30 seconds, that persistence rule is too slow; if the sensor occasionally spikes for one sample after radio reconnect, it may be exactly right. Match the rule to the process time constant, not to a generic analytics default.

Signal Pattern

First Method

Review Evidence

Stable single sensor

Z-score, IQR, median absolute deviation, control chart, or EWMA threshold.

Baseline window, threshold, persistence rule, missing-data handling, and false-alert review.

Seasonal or trending signal

Rolling baseline, STL residual, ARIMA residual, or context-specific threshold.

Calendar context, operating mode, residual definition, retrain schedule, and context-change trigger.

Many correlated sensors

Isolation forest, autoencoder, one-class model, or rule-assisted multivariate score.

Feature set, training window, normal-data assumptions, drift checks, and explanation examples.

Safety or operations alert

Hybrid detector with sensor-health checks, persistence, confirmation, and operator feedback.

Escalation path, acknowledgement state, cost of misses, cost of false alarms, and retest trigger.

Practitioner Knowledge Check

9.4 Base Rates and Alert Feedback

IoT anomaly data is usually imbalanced: normal samples dominate. That makes accuracy a weak metric, because a detector can look accurate by ignoring rare but important events. Precision, recall, false-alert rate, missed-event review, detection latency, and operator feedback are more useful for tuning. The threshold should be tied to the cost of investigation and the cost of missing an event, not copied from another site.

The base-rate arithmetic is the trap. Suppose a site produces 1,000,000 readings per day and only 20 are true process events. A detector with 90% recall catches 18 of them. If its false-positive rate is 0.1%, it also creates about 1,000 false alerts, so precision is only 18 / (18 + 1000) = 1.8%. Operators will mostly see noise even though the headline accuracy still looks excellent. Reducing the false-positive rate to 0.01% gives about 100 false alerts and precision of 18 / (18 + 100) = 15.3%, still imperfect but far more reviewable.

That does not mean "raise the threshold until alerts are rare." A threshold that hides dangerous events may improve precision while damaging recall. Under the hood, the detector needs an explicit review ledger: every alert gets a disposition such as true process fault, sensor fault, maintenance activity, expected context shift, duplicate, or unknown. Each disposition has a retest action. Sensor faults update health rules; maintenance activity updates context calendars; real misses lower a gate or add a confirming feature. Without that feedback loop, the model will drift while the dashboard continues to show familiar numbers. The review should also record who accepted the change, which baseline version changed, and what replay window proved the new rule does not break recent normal data.

Base Rate

When true events are rare, even a small false-positive rate can produce many false alerts.

Latency

Some alerts need immediate edge action; others can wait for cloud confirmation or batch review.

Drift

Normal changes over time when equipment ages, seasons change, or operating modes shift.

Feedback

Operator confirmation, dismissal, and root-cause notes are training data for better thresholds and models.

Boundary

Failure Mode

Retest Trigger

Baseline

Normal data no longer represents current equipment, season, occupancy, load, or sensor placement.

Maintenance event, firmware change, seasonal shift, new operating mode, or baseline-window update.

Threshold

Alerts overwhelm operators or miss reviewed incidents because the cost tradeoff changed.

New alert workflow, operator feedback, miss review, false-alert spike, or escalation policy.

Sensor health

Stale, clipped, stuck, or drifting measurements are treated as process anomalies.

New sensor batch, calibration change, dropout pattern, clipping rate, or health-check rule.

Model

Multivariate or seasonal detector no longer explains current feature distributions.

Feature change, retraining, deployment site change, drift signal, or new anomaly class.

Under-the-Hood Knowledge Check

9.5 Summary

Anomaly detection for IoT systems defines normal behavior, scores departures from that baseline, and applies an alert rule that matches the operating context. Start with simple statistical or robust thresholds when the signal is stable, move to time-series residuals when context or seasonality matters, and use multivariate models when joint patterns exceed single-signal rules. Keep sensor-health checks separate from process-fault alerts, and tune thresholds with review feedback rather than copying generic values.

Key Takeaway

An anomaly alert is useful only when it is reviewable. Preserve the baseline, score, threshold, persistence rule, context, sensor-health state, and operator feedback so the detector can be retested when normal behavior changes.