16 Real-Time Anomaly Pipelines

In 60 Seconds

Production anomaly detection requires a three-tier pipeline architecture spanning edge, fog, and cloud – simple statistical checks at the edge respond in under 10ms for safety-critical alerts, ML models at fog gateways aggregate multi-sensor data in about 1 second, and deep learning in the cloud discovers long-term patterns across thousands of devices. This layered approach achieves 99% bandwidth reduction while preserving all anomalies, and feedback loops from operator decisions continuously improve detection accuracy across all tiers.

16.1 Learning Objectives

By the end of this chapter, you will be able to:

Design Three-Tier Architecture: Build anomaly detection systems spanning edge, fog, and cloud
Balance Latency and Accuracy: Choose where to deploy detection based on requirements
Handle Concept Drift: Implement adaptive systems that evolve with changing data patterns
Build Production Pipelines: Create end-to-end systems from data ingestion to alerting

For Beginners: Anomaly Detection Pipelines

An anomaly detection pipeline is like an assembly line for finding unusual events in sensor data. Raw readings enter at one end, get cleaned and analyzed step by step, and alerts come out the other end when something abnormal is detected. Building this as a pipeline means each step can be tested and improved independently.

Minimum Viable Understanding: Production Pipelines

Core Concept: Production anomaly detection requires a layered architecture - simple statistical methods at the edge for immediate response, ML models at gateways for deeper analysis, and cloud systems for training and pattern discovery.

Why It Matters: Edge-only detection is fast but misses complex patterns. Cloud-only detection has high latency and network dependency. The right architecture balances both.

Key Takeaway: Deploy threshold detection at the edge (<100ms), ML models at gateways (~1s), and deep learning in the cloud (~10s). Use feedback loops to continuously improve all layers.

16.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Statistical Methods: Methods suitable for edge deployment
Machine Learning Methods: Methods requiring cloud or gateway resources
Anomaly Types: Understanding which anomalies to detect where

~15 min | Advanced | P10.C01.U05

Key Concepts

Three-tier pipeline: An architecture placing threshold detection at the edge (<100 ms), ML-based detection at fog gateways (~1 s), and deep-learning analysis in the cloud (~10 s).
Concept drift: The gradual change in statistical properties of sensor data over time that causes a previously accurate model to degrade without retraining.
Ensemble detection: Combining outputs from multiple algorithms and only raising an alert when a majority agree, reducing false positives without sacrificing recall.
Alert fusion: Aggregating related alerts from multiple sensors or time windows into a single prioritised incident to reduce operator notification volume.
Backpressure: A flow-control mechanism that prevents downstream pipeline stages from being overwhelmed by upstream data bursts, typically implemented with bounded queues.
Feedback loop: A mechanism routing operator decisions (false positive/confirmed anomaly) back to the model training pipeline so the detector continuously improves.
Dead-letter queue: A storage location for data packets that failed pipeline processing, preserving them for later analysis rather than silently dropping them.

16.3 Introduction

Moving from algorithms to production systems requires careful architecture design balancing latency, accuracy, and resource constraints. This chapter walks through a three-tier pipeline architecture – edge, fog, and cloud – showing what each tier does, how data flows between them, and how to keep the system accurate as conditions change over time.

How It Works

Production anomaly detection pipelines operate as multi-tier systems where each layer has distinct responsibilities:

Edge Tier (Device/Sensor Level):

Runs simple statistical checks (Z-score, threshold) in <10ms
Triggers immediate safety responses (motor shutdown, valve close)
Filters 90% of data locally before transmission
Example: ESP32 with 520KB RAM running Z-score on temperature readings

Fog Tier (Gateway/Concentrator Level):

Aggregates data from 10-1000 sensors
Runs ML models (Isolation Forest) on multi-sensor windows in ~1 second
Provides local dashboards and alerts
Example: Raspberry Pi 4 processing 100 sensors with trained Isolation Forest model

Cloud Tier (Data Center Level):

Trains ML models on historical data from all devices
Runs deep learning (LSTM autoencoders) for pattern discovery
Provides cross-site analytics and long-term trends
Example: AWS running Spark to train models on 6 months of data from 10,000 devices

The Flow: Raw data enters at the edge tier, gets filtered to summaries at the fog tier (typically 90%+ reduction), then aggregates to compact insights sent to the cloud. Alerts flow in reverse: cloud discovers gradual degradation pattern, updates fog model, which refines edge thresholds.

For Kids: Meet the Sensor Squad!

The Sensor Squad was protecting a big chocolate factory, and they had a clever three-layer defense system!

Sammy the Sensor was stationed right next to the chocolate mixer. “I am the FIRST line of defense!” he said proudly. “If the temperature shoots above 80 degrees, I sound the alarm in less than one second. I am fast but I only know simple rules.”

Lila the LED was at the factory gateway computer. “I am the SECOND line of defense! Every few seconds, I look at readings from ALL the sensors together – temperature, stirring speed, and chocolate thickness. If the combination looks weird, I alert the manager. I am smarter than Sammy but a bit slower.”

Max the Microcontroller was up in the cloud. “And I am the THIRD line of defense! I study weeks of data to spot slow changes. Last month, I noticed the mixer was getting slightly louder each day – a pattern nobody else could see. I predicted the motor would break in two weeks, and the factory fixed it before it failed!”

Bella the Battery summed it up: “It is like having three guards – a fast one at the door who checks your badge, a smart one in the hallway who checks your face, and a detective in the back room who studies patterns over time. Together, nothing gets past us!”

Key lesson: Production anomaly detection uses layers – fast simple checks at the edge, smarter analysis at gateways, and deep pattern detection in the cloud. Each layer catches different types of problems!

16.4 End-to-End Architecture

With the three-tier model established, the next question is how data actually flows through the pipeline – from raw sensor readings to actionable alerts:

16.4.0.1 Edge Devices

Latency: 10-100 ms
Ingest raw temperature, vibration, and current readings close to the machine
Preprocess locally with lightweight filtering and aggregation
Run threshold and Z-score checks for safety-critical anomalies
Trigger a local shutdown alert immediately when the anomaly is critical

16.4.0.2 Gateway / Fog

Latency: 100 ms-1 s
Receive filtered streams over MQTT or Kafka from many devices
Build windowed features such as FFT, min/max, and trend statistics
Score each window with an Isolation Forest or similar ML model
Escalate moderate anomalies to operators and forward suspicious windows upstream

16.4.0.3 Cloud Analytics

Latency: 1-10 s
Store historical windows in a time-series database for long-term context
Retrain models periodically to handle concept drift and seasonal change
Run autoencoders or LSTMs to detect slow degradation patterns
Publish predictive maintenance alerts and feed confirmed outcomes back to lower tiers

Pipeline flow: Edge filters and reacts first, the gateway scores richer multi-sensor windows next, and the cloud closes the loop with retraining, historical pattern detection, and model updates sent back down the stack.

Figure 16.1. Three-tier anomaly detection architecture balancing latency and accuracy.

Pipeline Stages:

Data Ingestion
- Protocols: MQTT (lightweight, edge), Kafka (high-throughput, cloud)
- Buffering: Handle network interruptions with local queues
- Sampling: Adaptive sampling - increase frequency when anomalies suspected
Feature Extraction
- Windowing: Sliding windows (e.g., 10-second windows, 5-second overlap)
- Aggregation: Mean, std, min, max, percentiles per window
- Domain-specific: FFT for vibration, wavelets for signals
- Normalization: Scale features to common range (0-1 or z-score)
Model Scoring
- Edge: Simple statistical methods (<10ms per sample)
- Gateway: Trained ML models (10-100ms per batch)
- Cloud: Deep learning ensembles (1-10s per batch)

Putting Numbers to It

Data reduction through the three-tier pipeline follows geometric progression. For 280 sensors at 1 Hz each:

Raw Rate: $280 \text{ sensors} \times 1 \text{ Hz} \times 100 \text{ bytes} = 28 \text{ KB/s}$

Edge Tier (Z-score filter, keep anomalies + 10% samples): \[\text{Edge Out} = 28 \text{ KB/s} \times (0.01 \text{ anomaly rate} + 0.10 \text{ sampling}) = 3.08 \text{ KB/s}\]

Fog Tier (aggregate to 1-min summaries, forward anomalies): \[\text{Fog Out} = \frac{280 \times 50 \text{ bytes}}{60 \text{ s}} + (3.08 \text{ KB/s} \times 0.01) \approx 233 + 31 = 264 \text{ bytes/s}\]

Cloud receives $264 \text{ bytes/s} = 22.8 \text{ MB/day}$ vs. $28 \text{ KB/s} = 2.4 \text{ GB/day}$ raw.

Reduction: $99.1\%$ bandwidth savings while preserving all anomalies and statistical trends.

Interactive: Pipeline Data Reduction Calculator

Explore how the three-tier pipeline reduces data volume from edge to cloud.

Show code

viewof pipelineSensors = Inputs.range([10, 1000], {value: 280, step: 10, label: "Number of sensors"})
viewof pipelineSampleHz = Inputs.range([0.1, 10], {value: 1.0, step: 0.1, label: "Sample rate (Hz)"})
viewof pipelineBytesPerSample = Inputs.range([10, 500], {value: 100, step: 10, label: "Bytes per sample"})
viewof pipelineAnomalyRate = Inputs.range([0.001, 0.10], {value: 0.01, step: 0.001, label: "Anomaly rate (fraction)"})
viewof pipelineSamplingRate = Inputs.range([0.01, 0.50], {value: 0.10, step: 0.01, label: "Edge sampling keep rate"})
viewof pipelineSummaryBytes = Inputs.range([10, 200], {value: 50, step: 10, label: "Fog summary bytes per sensor"})

Show code

{
  const rawRate = pipelineSensors * pipelineSampleHz * pipelineBytesPerSample;
  const edgeOut = rawRate * (pipelineAnomalyRate + pipelineSamplingRate);
  const fogSummary = (pipelineSensors * pipelineSummaryBytes) / 60;
  const fogAnomaly = edgeOut * pipelineAnomalyRate;
  const fogOut = fogSummary + fogAnomaly;
  const rawPerDay = (rawRate * 86400) / (1024 * 1024);
  const fogPerDay = (fogOut * 86400) / (1024 * 1024);
  const reduction = ((1 - fogOut / rawRate) * 100).toFixed(1);

  const fmtRate = (bytes) => {
    if (bytes >= 1024 * 1024) return (bytes / (1024 * 1024)).toFixed(2) + " MB/s";
    if (bytes >= 1024) return (bytes / 1024).toFixed(2) + " KB/s";
    return bytes.toFixed(0) + " bytes/s";
  };

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 6px; border-left: 4px solid #16A085; font-family: inherit;">
    <p style="margin: 0.3rem 0;"><strong>Raw sensor rate:</strong> ${fmtRate(rawRate)}</p>
    <p style="margin: 0.3rem 0;"><strong>After edge filtering:</strong> ${fmtRate(edgeOut)}</p>
    <p style="margin: 0.3rem 0;"><strong>After fog aggregation:</strong> ${fmtRate(fogOut)}</p>
    <p style="margin: 0.3rem 0;"><strong>Cloud receives:</strong> ${fogPerDay.toFixed(1)} MB/day (vs. ${rawPerDay.toFixed(1)} MB/day raw)</p>
    <p style="margin: 0.5rem 0 0 0; font-size: 1.1em;"><strong>Total bandwidth reduction: ${reduction}%</strong></p>
  </div>`;
}

Alert Generation
- Thresholding: Confidence scores to binary decision
- Hysteresis: Require N consecutive anomalies to reduce false alarms
- Severity levels: Critical (act immediately) vs Warning (monitor)
Feedback Loop
- Operator confirmation: Anomaly true/false?
- Model updates: Retrain with confirmed labels
- Threshold tuning: Adjust based on false positive rate

16.5 Edge vs Cloud Detection

Decision Framework:

16.5.0.1 Latency

Deploy at edge when response must happen in <100 ms
Deploy in cloud when 1-10 s is acceptable

16.5.0.2 Criticality

Deploy at edge for safety-critical actions such as motor shutdown
Deploy in cloud for dashboards, reporting, and informational alerts

16.5.0.3 Complexity

Deploy at edge for simple out-of-range or point anomalies
Deploy in cloud for temporal, multivariate, or spatial patterns

16.5.0.4 Connectivity

Deploy at edge when links are intermittent or unreliable
Deploy in cloud when the system has stable high-bandwidth connectivity

16.5.0.5 Power and Compute

Deploy at edge on battery-powered or resource-constrained devices
Deploy in cloud when mains power and elastic compute are available

Hybrid Approach (Recommended):

Edge: Detect critical anomalies (safety), filter obvious noise
Cloud: Detect subtle patterns, perform root cause analysis, retrain models

Example: Factory Motor Monitoring

def edge_detector(temperature, vibration):
    if temperature > 95 or vibration > 5.0:
        stop_motor()
        return {"critical": True, "forward_to_gateway": False}
    return {"critical": False, "forward_to_gateway": True}


def gateway_detector(window, model):
    features = extract_stats(window)
    score = model.decision_function([features])[0]
    if score < -0.1:
        send_alert("Moderate anomaly detected")
        send_to_cloud(window)
        return True
    return False


def cloud_detector(history, model):
    predicted_next = model.predict(history)
    error = np.mean((predicted_next - history[-1]) ** 2)
    if error > THRESHOLD:
        create_maintenance_ticket("Bearing wear pattern detected")
    return error

16.6 Handling Concept Drift

The Problem: “Normal” changes over time: - Sensors degrade and drift - Seasonal patterns (summer vs winter temperatures) - Equipment wear (vibration baseline increases) - Process changes (new production schedules)

Strategies:

Continuous Retraining
- Retrain models weekly/monthly on recent data
- Use sliding window (last 30 days, not all historical data)
- Automate via MLOps pipelines
Adaptive Thresholds
- Dynamic thresholds that adjust to recent baseline
- Example: “Flag if >3 sigma from last 7 days mean” (not all-time mean)
Ensemble Models
- Combine short-term (catches sudden changes) and long-term (catches drift) models
- Example: Z-score (1-day window) + ARIMA (30-day window)
Human-in-the-Loop
- Operator labels false positives/negatives
- Feedback trains model incrementally (online learning)

Implementation:

class AdaptiveDetector:
    def __init__(self, retrain_interval=86400):  # Retrain daily
        self.model = IsolationForest()
        self.training_buffer = deque(maxlen=10000)  # Keep last 10k samples
        self.retrain_interval = retrain_interval
        self.last_retrain = time.time()

    def update(self, sample):
        is_anomaly = self.model.predict([sample])[0] == -1
        if not is_anomaly:
            self.training_buffer.append(sample)
        if (
            time.time() - self.last_retrain > self.retrain_interval
            and len(self.training_buffer) > 1000
        ):
            self.model.fit(list(self.training_buffer))
            self.last_retrain = time.time()
        return is_anomaly

Pitfall: Ignoring Concept Drift in Production Systems

The Mistake: Training anomaly detection models once on historical data and deploying them permanently, assuming the definition of “normal” remains constant over the system’s operational lifetime.

Why It Happens: Model training is resource-intensive and disrupts operations. “If it ain’t broke, don’t fix it” thinking prevails. Concept drift is gradual and invisible - performance degrades slowly without obvious failure points. Teams lack automated drift detection and retraining pipelines.

The Fix: Design for continuous learning from the start:

Drift detection: Monitor model performance indicators continuously - track the distribution of anomaly scores over time using Page-Hinkley test or ADWIN algorithm. If the score distribution shifts significantly (KL divergence > 0.1), trigger investigation.
Scheduled retraining: Establish periodic retraining cycles (weekly, monthly) using recent data. For seasonal systems (HVAC, agriculture), retrain at season boundaries.
Online learning: For fast-drifting environments, use incremental learning algorithms (online Isolation Forest, streaming autoencoders) that update continuously without full retraining.
Human-in-the-loop feedback: When operators dismiss alerts as false positives or confirm true anomalies, feed this labeled data back into the training pipeline. Over 6 months, this feedback can improve precision by 20-40%.
Model versioning with A/B testing: Deploy new models alongside old ones, compare performance on held-out data, automatically promote better performers.

Reality check: a predictive maintenance model trained in summer may have 30% higher false positive rate in winter due to thermal effects on vibration patterns.

16.7 Worked Example: Three-Tier Anomaly Detection for Water Treatment

Worked Example: Designing a Layered Anomaly Detection Pipeline for a Municipal Water Plant

Scenario: Thames Water operates the Beckton Sewage Treatment Works in east London, processing 900 million litres of wastewater daily. The plant monitors chlorine residual, turbidity, pH, flow rate, and dissolved oxygen across 45 treatment stages. A contamination event must be detected within 60 seconds (safety-critical), while gradual process degradation should be caught within hours.

Given:

280 sensors across 45 treatment stages
Sampling rates: Chlorine/pH at 1 Hz, turbidity at 0.5 Hz, flow at 0.1 Hz
Total data rate: 280 sensors x avg 0.5 Hz = 140 readings/second
Safety requirement: Chlorine deviation >0.5 mg/L detected within 60 seconds
Process optimization: Detect sludge thickener degradation within 4 hours
Historical data: 3 years of sensor readings with 47 labeled contamination events

Step 1 – Deploy edge detection (safety-critical, <60 second response):

Each treatment stage has a PLC (Siemens S7-1500) performing edge detection:

16.7.0.1 Chlorine Out of Range

Method: Hard bounds
Threshold: <0.2 or >4.0 mg/L
Response time: <1 second
Action: Emergency shutdown of the dosing system

16.7.0.2 pH Extreme

Method: Hard bounds
Threshold: <5.0 or >10.0
Response time: <1 second
Action: Divert flow to a holding tank

16.7.0.3 Chlorine Rate-of-Change

Method: Delta check
Threshold: >0.3 mg/L in 30 seconds
Response time: 30 seconds
Action: Alert the operator and increase sampling frequency

16.7.0.4 Turbidity Spike

Method: Z-score using a window=60
Threshold: Z > 4.0
Response time: 60 seconds
Action: Alert the operator and trigger a grab sample

Edge detection catches: instantaneous sensor failures, sudden contamination events, dosing equipment malfunctions.

Edge detection misses: gradual fouling, seasonal baseline shifts, multi-sensor correlated degradation.

Step 2 – Deploy gateway detection (process anomalies, ~5 minute response):

A gateway server (Dell PowerEdge, on-premises) runs Isolation Forest models:

Input: 5-minute rolling features from 280 sensors (mean, std, min, max, trend slope)
Model: 45 Isolation Forest models (one per treatment stage), trained on 2 years of normal operation
Detection targets: Multi-sensor anomalies where individual readings are in-range but the combination is abnormal
Example: Turbidity normal (2.1 NTU), pH normal (7.2), but dissolved oxygen dropping from 6.0 to 4.5 mg/L while flow is constant – indicates biological treatment failure

16.7.0.5 Multi-Sensor Correlation

Feature window: 5-minute aggregates
Model update: Monthly retrain
False positive rate: 2.1%

16.7.0.6 Diurnal Pattern Deviation

Feature window: 1-hour features vs time-of-day profile
Model update: Weekly profile update
False positive rate: 1.8%

16.7.0.7 Inter-Stage Anomaly

Feature window: Difference between stage N and stage N+1
Model update: Monthly
False positive rate: 0.9%

Step 3 – Deploy cloud detection (long-term degradation, hours to days):

Azure cloud runs LSTM autoencoders on historical patterns:

Training data: 3 years, 47 labeled events, ~10 billion data points
Model: LSTM autoencoder per treatment stage, 168-hour (1-week) input window
Detection: Reconstruction error exceeding 99th percentile of training distribution
Targets: Membrane fouling (gradual over 2-4 weeks), sludge thickener degradation (5-10 day progression), seasonal algae bloom effects

Retraining: Quarterly, incorporating new labeled events. Model A/B testing before promotion to production.

Step 4 – Calculate detection coverage:

16.7.0.8 Sudden Contamination

Edge: 43/47 (91%)
Gateway: 46/47 (98%)
Cloud: 47/47 (100%)
Overall: 47/47 (100%)

16.7.0.9 Gradual Degradation

Edge: 2/23 (9%)
Gateway: 14/23 (61%)
Cloud: 21/23 (91%)
Overall: 21/23 (91%)

16.7.0.10 Sensor Malfunction

Edge: 148/156 (95%)
Gateway: 155/156 (99%)
Cloud: 156/156 (100%)
Overall: 156/156 (100%)

16.7.0.11 Weighted Detection Rate

Edge: 77%
Gateway: 90%
Cloud: 96%
Overall: 98%

Cost of the pipeline:

16.7.0.12 Edge (45 PLCs, existing)

Hardware: GBP 0 (already installed)
Software: Configuration only
Annual cost: GBP 5,000

16.7.0.13 Gateway (1 server)

Hardware: GBP 4,000 one-time
Software: Open-source ML stack
Annual cost: GBP 2,000

16.7.0.14 Cloud (Azure)

Hardware: Managed service
Software: Azure ML + storage
Annual cost: GBP 14,000

16.7.0.15 Total Annual Spend

Combined annual operating cost: GBP 21,000
The gateway hardware is a one-time capital expense, not a recurring annual fee

Value: A single undetected contamination event resulting in a Drinking Water Inspectorate prosecution costs GBP 250,000-500,000 in fines plus remediation. The pipeline has prevented 3 near-miss events in its first year of operation.

Result: The three-tier architecture achieves 98% weighted detection rate at GBP 21,000/year. Edge tier handles 77% of anomalies with <1 second response. Gateway adds 13% coverage for multi-sensor correlations. Cloud adds 6% for long-term degradation patterns. Each tier catches anomaly types that lower tiers miss.

Key Insight: The three tiers are not redundant – they detect fundamentally different anomaly types. Removing any tier creates blind spots. Edge catches sudden failures (critical for safety). Gateway catches cross-sensor correlations (critical for process quality). Cloud catches slow degradation (critical for maintenance planning). The layered cost also scales appropriately: edge detection is nearly free (reuses existing PLCs), gateway costs GBP 2,000/year, and cloud costs GBP 14,000/year.

Concept Relationships

16.7.0.16 Edge Detection

Statistical methods
Sub-100 ms response
Safety-critical actions
Typical algorithm: Z-score and thresholds

16.7.0.17 Fog Analysis

ML models close to the plant
Around 1-second latency
Multi-sensor correlation
Typical algorithm: Isolation Forest

16.7.0.18 Cloud Training

Deep learning and retraining
Long-term pattern discovery
Concept-drift adaptation
Typical algorithm: LSTM autoencoder

Relationship summary: Edge handles the fastest guardrails, fog adds richer multi-sensor context, and cloud retrains the models that keep both lower tiers accurate over time.

How These Concepts Connect:

Detection complexity increases from edge to cloud: Edge handles simple point anomalies, fog handles multi-sensor patterns, cloud handles complex temporal sequences
Latency requirements drive deployment: Safety systems need edge (<100ms), predictive maintenance tolerates cloud (~10s)
Feedback loops maintain accuracy: Cloud trains models → fog deploys models → edge applies thresholds → all send results back to cloud for retraining
Concept drift requires all tiers: Edge detects immediate shifts, fog adapts thresholds, cloud retrains long-term models

Try It Yourself

Experiment 1: Measure Latency at Each Tier

Build a three-tier demo pipeline and measure end-to-end latency:

Edge: Python script simulating sensor with Z-score detection
Fog: Receive via MQTT, run Isolation Forest, publish alerts
Cloud: Store in database, display on dashboard

Inject an anomaly and measure: sensor detection time, MQTT transmission time, fog processing time, cloud visualization time. Which component dominates?

Experiment 2: Simulate Concept Drift

Generate temperature data with gradual baseline shift (22C → 26C over 30 days):

Fixed thresholds: Alert when >25C (fails as baseline drifts)
Adaptive thresholds: Z-score on 7-day rolling window
Model retraining: Retrain Isolation Forest weekly

Which approach maintains accuracy? What is the trade-off in responsiveness vs false positive rate?

Experiment 3: Edge Processing ROI

Calculate bandwidth and cost with and without edge filtering:

1,000 sensors at 10 Hz, 100 bytes/reading
Cloud ingress: $0.09/GB
Edge device: Raspberry Pi Zero ($10) running Z-score

How many days until edge processing pays for itself? What if sensors run at 1 kHz (vibration monitoring)?

Interactive: Edge Processing ROI Calculator

Adjust the parameters below to explore how edge filtering affects cloud bandwidth costs and payback period.

Show code

viewof numSensors = Inputs.range([10, 10000], {value: 1000, step: 10, label: "Number of sensors"})
viewof sampleRate = Inputs.range([1, 1000], {value: 10, step: 1, label: "Sample rate (Hz)"})
viewof bytesPerReading = Inputs.range([10, 500], {value: 100, step: 10, label: "Bytes per reading"})
viewof cloudCostPerGB = Inputs.range([0.01, 0.50], {value: 0.09, step: 0.01, label: "Cloud ingress ($/GB)"})
viewof edgeDeviceCost = Inputs.range([5, 100], {value: 10, step: 5, label: "Edge device cost ($)"})
viewof edgeFilterRate = Inputs.range([0.50, 0.99], {value: 0.90, step: 0.01, label: "Edge filter rate (fraction removed)"})

Show code

{
  const rawBytesPerSec = numSensors * sampleRate * bytesPerReading;
  const rawGBPerDay = (rawBytesPerSec * 86400) / (1024 * 1024 * 1024);
  const filteredGBPerDay = rawGBPerDay * (1 - edgeFilterRate);
  const dailySavingsGB = rawGBPerDay - filteredGBPerDay;
  const dailySavingsDollars = dailySavingsGB * cloudCostPerGB;
  const paybackDays = dailySavingsDollars > 0 ? Math.ceil(edgeDeviceCost / dailySavingsDollars) : Infinity;

  const fmt = (n) => n < 0.01 ? n.toExponential(2) : n.toFixed(2);

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 6px; border-left: 4px solid #3498DB; font-family: inherit;">
    <p style="margin: 0.3rem 0;"><strong>Raw data rate:</strong> ${fmt(rawBytesPerSec / (1024*1024))} MB/s = <strong>${fmt(rawGBPerDay)} GB/day</strong></p>
    <p style="margin: 0.3rem 0;"><strong>After edge filtering (${(edgeFilterRate * 100).toFixed(0)}% removed):</strong> ${fmt(filteredGBPerDay)} GB/day</p>
    <p style="margin: 0.3rem 0;"><strong>Daily bandwidth saved:</strong> ${fmt(dailySavingsGB)} GB/day</p>
    <p style="margin: 0.3rem 0;"><strong>Daily cost saved:</strong> $${fmt(dailySavingsDollars)}/day</p>
    <p style="margin: 0.5rem 0 0 0; font-size: 1.1em;"><strong>Payback period:</strong> ${paybackDays === Infinity ? "Never (no savings)" : paybackDays + " days"} for a $${edgeDeviceCost} edge device</p>
  </div>`;
}

Challenge: Build a Feedback Loop

Implement operator confirmation feedback:

Operator marks alerts as true positive or false positive
System recomputes precision/recall weekly
If precision <80%, automatically increase threshold
If recall <95%, automatically decrease threshold

Test with simulated operator feedback. Does the system converge to optimal thresholds?

Interactive Quiz: Match Concepts

Interactive Quiz: Sequence the Steps

Common Pitfalls

1. Deploying only at one tier

Edge-only detection misses complex multivariate patterns; cloud-only detection introduces unacceptable latency for safety shutdowns. Layer algorithms across all three tiers.

2. Ignoring backpressure

A sudden burst of anomalies can flood the pipeline. Without bounded queues, the system crashes exactly when accurate detection matters most.

3. Skipping the feedback loop

A pipeline without operator feedback never improves. Wire confirmed anomaly and false-positive labels back to retraining jobs so models adapt to changing conditions.

4. Treating alert volume as a success metric

Generating 10,000 alerts per day is not sensitivity — it is misconfiguration. Track the ratio of actionable alerts to total alerts; aim for > 60% actionable before declaring production-ready.

5. Not testing pipeline failure modes

Simulate network partitions, sensor dropouts, and clock skew before going live. A pipeline that silently stops detecting anomalies during a network outage is more dangerous than one that fails loudly.

Label the Diagram

Code Challenge

16.8 Summary

Production anomaly detection requires layered architecture:

Edge: Fast threshold detection for safety-critical anomalies (<100ms)
Gateway: ML models for pattern detection (~1s latency)
Cloud: Deep learning and continuous retraining (~10s latency)

Key Takeaway: Design for concept drift from the start. Implement feedback loops and scheduled retraining to maintain detection accuracy over time.

16.9 What’s Next

If you want to…	Read this
Evaluate detection accuracy with correct metrics	Performance Metrics
Learn the anomaly types your pipeline must handle	Types of Anomalies
Apply statistical methods at the edge tier	Statistical Methods
Deploy ML models at the gateway tier	Machine Learning Approaches
Return to the module overview	Anomaly Detection Overview

16.1 Learning Objectives

16.2 Prerequisites

Key Concepts

16.3 Introduction

How It Works

16.4 End-to-End Architecture

16.4.0.1 Edge Devices

16.4.0.2 Gateway / Fog

16.4.0.3 Cloud Analytics

16.5 Edge vs Cloud Detection

16.5.0.1 Latency

16.5.0.2 Criticality

16.5.0.3 Complexity

16.5.0.4 Connectivity

16.5.0.5 Power and Compute

16.6 Handling Concept Drift

16.7 Worked Example: Three-Tier Anomaly Detection for Water Treatment

16.7.0.1 Chlorine Out of Range

16.7.0.2 pH Extreme

16.7.0.3 Chlorine Rate-of-Change

16.7.0.4 Turbidity Spike

16.7.0.5 Multi-Sensor Correlation

16.7.0.6 Diurnal Pattern Deviation

16.7.0.7 Inter-Stage Anomaly

16.7.0.8 Sudden Contamination

16.7.0.9 Gradual Degradation

16.7.0.10 Sensor Malfunction

16.7.0.11 Weighted Detection Rate

16.7.0.12 Edge (45 PLCs, existing)

16.7.0.13 Gateway (1 server)

16.7.0.14 Cloud (Azure)

16.7.0.15 Total Annual Spend

Concept Relationships

16.7.0.16 Edge Detection

16.7.0.17 Fog Analysis

16.7.0.18 Cloud Training

See Also

Try It Yourself

Common Pitfalls

16.8 Summary

16.9 What’s Next