2  Stream Processing for IoT

Chapter Topic
Stream Processing Overview of batch vs. stream processing for IoT
Fundamentals Windowing strategies, watermarks, and event-time semantics
Architectures Kafka, Flink, and Spark Streaming comparison
Pipelines End-to-end ingestion, processing, and output design
Challenges Late data, exactly-once semantics, and backpressure
Pitfalls Common mistakes and production worked examples
Basic Lab ESP32 circular buffers, windows, and event detection
Advanced Lab CEP, pattern matching, and anomaly detection on ESP32
Game & Summary Interactive review game and module summary
Interoperability Four levels of IoT interoperability
Interop Fundamentals Technical, syntactic, semantic, and organizational layers
Interop Standards SenML, JSON-LD, W3C WoT, and oneM2M
Integration Patterns Protocol adapters, gateways, and ontology mapping

2.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Justify when stream processing is required over batch processing based on latency, throughput, and data freshness constraints
  • Select appropriate windowing strategies (tumbling, sliding, session) for different IoT use cases
  • Compare major stream processing platforms (Kafka, Flink, Spark Streaming) and select the right tool for specific requirements
  • Design end-to-end IoT streaming pipelines from ingestion to action
  • Handle real-world challenges including late data, exactly-once semantics, and backpressure
  • Implement basic stream processing on embedded devices using circular buffers and event detection
In 60 Seconds

Stream processing continuously analyzes IoT data as it arrives — enabling real-time anomaly detection, aggregation, and alerting in milliseconds rather than the hours required by batch processing. Core concepts: events flow through a directed pipeline of operators (filter, transform, aggregate, join) with temporal windows managing time-based grouping. Choose true streaming (Flink, Kafka Streams) for <100ms latency requirements or micro-batch (Spark Structured Streaming) for simpler development with 100ms-1s latency tolerance.

2.2 Overview

Stream processing is essential infrastructure for modern IoT systems requiring real-time insights and actions. This comprehensive guide covers everything from fundamental concepts to production implementation patterns.

A modern autonomous vehicle generates approximately 1 gigabyte of sensor data every second from LIDAR, cameras, radar, GPS, and inertial measurement units. If you attempt to store this data to disk and then query it for analysis, you’ve already crashed into the obstacle ahead. The vehicle needs to make decisions in under 10 milliseconds based on continuously arriving sensor streams.

This is the fundamental challenge that stream processing solves: processing data in motion, not data at rest.

2.3 Putting Numbers to It

An autonomous vehicle generates 1 GB/second of sensor data (LIDAR, cameras, radar, GPS). Over a 2-hour test drive, how much stream processing throughput is required?

Data rate: \(R = 1 \text{ GB/s} = 8 \text{ Gbps}\) (gigabits per second).

Time window for decision-making: Autonomous systems need decisions within 10 milliseconds. In 10 ms, the vehicle generates:

\[D_{10ms} = 1 \text{ GB/s} \times 0.01 \text{ s} = 10 \text{ MB}\]

Total test drive data: \(T = 1 \text{ GB/s} \times 2 \text{ hours} \times 3600 \text{ s/hour} = 7,200 \text{ GB} = 7.2 \text{ TB}\).

Why batch processing fails: If you buffer 1 second of data before processing (batch approach), you’ve traveled 25 meters at highway speed (90 km/h = 25 m/s) before detecting an obstacle. Stream processing operates on 10 MB chunks in 10 ms windows, making decisions within safe reaction distance.

Stream processing latency budget: End-to-end: sensor → processing → actuation = 10 ms. Assume sensor latency 2 ms, actuation 2 ms. Stream processor has 6 ms budget. At 10 MB input, throughput requirement = \(10 \text{ MB} / 0.006 \text{ s} \approx 1.67 \text{ GB/s}\) sustained processing rate.

Adjust the parameters below to explore how data rate, decision window, and vehicle speed affect stream processing requirements.

Think of stream processing like a factory assembly line versus a warehouse:

  • Batch processing (warehouse): Collect all products in a warehouse, then sort and analyze them all at once during the night shift
  • Stream processing (assembly line): Inspect and act on each product as it moves down the conveyor belt in real-time

In IoT, your sensors are constantly generating data like items on a conveyor belt. Stream processing lets you:

  • Detect a temperature spike immediately instead of discovering it hours later in a daily report
  • Trigger an emergency shutdown within milliseconds when pressure exceeds safe limits
  • Update dashboards continuously so operators see current conditions, not stale data

When to use stream processing: Ask yourself “Does the value of this insight decrease over time?” If yes, stream it. If analyzing last week’s trends is just as useful as analyzing right now, batch processing is simpler and cheaper.

Imagine your sensors are like reporters at a news station!

2.3.1 The Sensor Squad Adventure: Breaking News vs. Yesterday’s Paper

The Sensor Squad has two ways to share news with the world:

The Newspaper Way (Batch Processing): Sammy the Temperature Sensor collects ALL the temperatures throughout the day. At midnight, everything gets printed in tomorrow’s newspaper. The problem? If something exciting happened at 2 PM, nobody finds out until they read the paper the next morning!

The Live TV News Way (Stream Processing): Sammy stands in front of a camera ALL DAY. The moment something happens - “BREAKING NEWS! Temperature just hit 100 degrees!” - everyone watching TV sees it RIGHT AWAY!

Real Examples:

  • Batch: Your report card comes once per semester - you find out your grades weeks after taking tests
  • Stream: A scoreboard at a basketball game - you see points the instant they’re scored!

The Sensor Squad learned: When something is URGENT (like a fire alarm), you need the “Live TV” way. When you just need a summary (like “how many steps did I take this month?”), the “Newspaper” way works fine!

2.3.2 Try This at Home!

Play the “Data Stream Game” with a friend:

  1. Friend whispers numbers to you one at a time (that’s streaming data!)
  2. Batch version: Write all numbers down, add them up at the end
  3. Stream version: Keep a running total in your head as each number arrives

Which way tells you the answer faster? That’s why IoT uses stream processing for important alerts!

Minimum Viable Understanding: Streaming vs Batch Processing

Core Concept: Stream processing analyzes data continuously as it arrives (event-by-event), while batch processing accumulates data into chunks and processes them at scheduled intervals.

Why It Matters: The choice determines your system’s latency profile. A factory safety shutdown requiring <100ms response cannot wait for hourly batch jobs. Conversely, training ML models on streaming data adds unnecessary complexity when overnight batch processing suffices.

Key Takeaway: Evaluate the cost of delayed insight. If your anomaly alert loses value after 10 seconds, stream process it. If your trend analysis is equally useful whether computed at midnight or noon, batch process it and save on infrastructure complexity. As a rule of thumb: stream processing adds 3-5x infrastructure complexity over batch, so the latency benefit must justify the cost.

2.4 Batch vs Stream Processing Comparison

Comparison diagram showing batch processing with scheduled jobs versus stream processing with continuous event-driven flow
Figure 2.1: Comparison of batch vs stream processing data flows. Batch processing collects data in storage, processes on schedule, and outputs to a results store. Stream processing handles events continuously with immediate processing, windowed aggregations, and real-time outputs.

Key Differences:

Aspect Batch Processing Stream Processing
Latency Hours to days Milliseconds to seconds
Data Model Complete, bounded datasets Unbounded, continuous events
Processing All at once, scheduled Event-by-event, continuous
Use Case Historical analysis, ML training Real-time alerts, dashboards
Cost Model Pay per job execution Always-on infrastructure

2.5 Chapter Overview

Diagram illustrating chapter overview
Figure 2.2: Chapter learning path showing four sequential phases: Foundations (stream vs batch, event time, windowing), Platforms (Kafka, Flink, Spark), Implementation (pipelines, late data, exactly-once, backpressure), and Practice (labs and games).

2.6 Chapter Contents

This chapter is organized into the following sections:

2.6.1 1. Stream Processing Fundamentals

Core concepts including batch vs stream processing, event time vs processing time, and windowing strategies (tumbling, sliding, session windows). Includes interactive pipeline demo and worked examples.

2.6.2 2. Stream Processing Architectures

Deep dive into Apache Kafka + Kafka Streams, Apache Flink, and Spark Structured Streaming. Architecture comparison, performance benchmarks, and technology selection guidance.

2.6.3 3. Building IoT Streaming Pipelines

Complete guide to designing and implementing production IoT streaming pipelines, from requirements through architecture to implementation stages.

2.6.4 4. Handling Real-World Challenges

Late data and watermarks, exactly-once processing semantics, backpressure management, checkpointing, and fault tolerance strategies.

2.6.5 5. Common Pitfalls and Worked Examples

Real-world pitfalls including non-idempotent processing and missing backpressure. Features comprehensive fraud detection pipeline worked example.

2.6.6 6. Hands-On Lab: Basic Stream Processing

45-minute Wokwi lab implementing continuous streaming, windowed aggregations, event detection, and circular buffers on ESP32.

2.6.7 7. Hands-On Lab: Advanced CEP

60-minute advanced lab covering pattern matching, complex event processing (CEP), session windows, and statistical aggregation.

2.6.8 8. Interactive Game and Summary

Interactive Data Stream Challenge game with 3 difficulty levels testing windowing, CEP, and anomaly detection skills. Plus chapter summary and next steps.

2.7 Windowing Strategies Visual Guide

Diagram illustrating windowing strategies
Figure 2.3: Visual comparison of three windowing strategies. Tumbling windows are fixed, non-overlapping intervals. Sliding windows overlap based on slide interval. Session windows group events by activity with gaps causing new sessions.
Windowing Strategy Selection Guide

Tumbling Windows (Fixed, non-overlapping):

  • Use for: Periodic aggregations like “events per minute”
  • Example: Count sensor readings every 5 minutes

Sliding Windows (Overlapping):

  • Use for: Continuous monitoring with smooth results
  • Example: Moving average temperature over last 5 minutes, updated every 30 seconds

Session Windows (Activity-based):

  • Use for: User interactions, intermittent activity
  • Example: Group user clicks with a 30-second timeout between sessions

Explore how window size and slide interval affect detection latency and computational cost.

2.8 Quick Reference

Topic Key Concepts Go To
Windowing basics Tumbling, sliding, session windows Fundamentals
Technology selection Kafka vs Flink vs Spark Architectures
Pipeline design Ingestion, enrichment, aggregation Pipelines
Late data handling Watermarks, allowed lateness Challenges
Exactly-once semantics Idempotency, transactions Challenges
Fraud detection example Complete worked example Pitfalls
ESP32 streaming lab Hands-on embedded implementation Basic Lab
CEP patterns Pattern matching, sequences Advanced Lab
Requirement Apache Kafka Streams Apache Flink Spark Structured Streaming Best For
Latency SLA <100ms <10ms 100ms-1s (micro-batch) Flink for ultra-low, Spark for batch+stream
Exactly-once semantics Yes (transactional) Yes (checkpoints) Yes (structured API) All support; Flink has lowest overhead
Complex event processing Limited Native CEP library Via Spark SQL Flink for CEP, Kafka for simple transforms
Stateful operations RocksDB state store Managed state (heap/RocksDB) DataFrame checkpoints Flink for large state, Kafka for simple
Existing ecosystem Already using Kafka Greenfield deployment Already using Spark batch Kafka Streams = simpler ops if Kafka present
Operational complexity Low (part of Kafka) Medium (separate cluster) High (Spark cluster) Kafka Streams easiest to deploy

Quick Decision Tree:

  1. Need sub-10ms latency? –> Flink
  2. Already using Kafka heavily? –> Kafka Streams (simpler ops)
  3. Need to unify batch + stream processing? –> Spark Structured Streaming
  4. Need complex CEP (multi-event patterns)? –> Flink CEP
  5. Constrained DevOps team? –> Kafka Streams (fewer moving parts)
Common Mistake: Not Accounting for Event-Time Skew

What practitioners do wrong: They use processing time (when the event arrives at the streaming system) instead of event time (when the event actually occurred) for windowing.

Why it fails: Network delays, device buffering, and batch uploads cause event-time skew. A sensor reading from 10:45:30 might arrive at the server at 10:47:15 due to intermittent connectivity. If you use processing time for a “10:45-10:46 window,” this event gets assigned to the wrong window (10:47-10:48), skewing aggregations.

Correct approach:

  1. Use event-time windows based on the timestamp embedded in the event payload
  2. Configure watermarks to track event-time progress (e.g., “events may arrive up to 2 minutes out of order” sets watermark = max observed event time - 2 minutes)
  3. Set allowed lateness (grace period) for late-arriving events (e.g., 1 minute after watermark)
  4. Emit window results when watermark passes window end + grace period

Real-world consequence: A 2018 smart city traffic monitoring system used processing time and reported “rush hour at 3 AM” because 3-hour-old traffic camera data uploaded in a batch at 3 AM. The city deployed wrong traffic light timings for 6 weeks. After switching to event-time windowing with 5-minute watermarks, congestion metrics matched ground truth within 2%.

2.9 Knowledge Check

Test your understanding of stream processing concepts with these questions.

A manufacturing plant needs to analyze sensor data from production equipment. The analysis generates weekly reports for management showing trends in equipment performance. Which processing paradigm is most appropriate?

  1. Stream processing with tumbling windows
  2. Batch processing with scheduled jobs
  3. Stream processing with session windows
  4. Real-time streaming with sub-second latency

Correct Answer: B) Batch processing with scheduled jobs

Why this is correct:

  • Weekly reports have no real-time urgency - the value doesn’t degrade over hours or days
  • Management trend analysis requires complete data context, not partial streaming results
  • Batch processing is simpler to implement and more cost-effective for scheduled reports
  • Processing can run during off-peak hours, reducing infrastructure costs

Why the other options are incorrect:

  • A) Stream processing with tumbling windows: Unnecessary complexity for weekly reports. Streaming is overkill when you don’t need sub-minute insights.
  • C) Stream processing with session windows: Session windows are designed for activity-based grouping (user sessions), not time-based reports.
  • D) Real-time streaming with sub-second latency: Paying for always-on streaming infrastructure when weekly batches suffice wastes resources.

Key Principle: “Ask yourself: Does the value of this insight decrease over time?” For weekly management reports, a day-old analysis is just as valuable as a minute-old one.

An IoT system monitors patient vital signs in a hospital. You need to detect sustained abnormal heart rates - specifically, an average heart rate above 120 BPM over any 5-minute period. Which windowing strategy is most appropriate?

  1. Tumbling windows of 5 minutes
  2. Sliding windows of 5 minutes with 1-minute slide
  3. Session windows with 5-minute timeout
  4. Global windows with no boundaries

Correct Answer: B) Sliding windows of 5 minutes with 1-minute slide

Why this is correct:

  • Sliding windows provide continuous monitoring - you check every minute whether the last 5 minutes were abnormal
  • A patient’s heart rate could spike at minute 3 of a tumbling window - with tumbling windows, you’d only detect it at minute 5
  • With sliding windows (5-min window, 1-min slide), you’d detect it within 1 minute of the start of the abnormal period
  • For patient safety, earlier detection is critical

Why the other options are incorrect:

  • A) Tumbling windows of 5 minutes: These are non-overlapping. If abnormal readings span minutes 3-8, a tumbling window [0-5] might show normal, and window [5-10] might also show normal, missing the alert entirely!
  • C) Session windows with 5-minute timeout: Session windows are for grouping activity bursts (like user clicks). Heart rate monitoring is continuous, not session-based.
  • D) Global windows: Without boundaries, you’d accumulate all data forever, making the “last 5 minutes” calculation impossible.

Key Principle: Use sliding windows when you need smooth, continuous monitoring without gaps. The overlap ensures no critical event falls between window boundaries.

Your IoT sensor network has devices that occasionally experience network delays of up to 5 minutes. You’re calculating 10-minute tumbling window aggregations and emitting results eagerly (as the watermark advances). A sensor reading with event time 10:08:30 arrives at processing time 10:14:00 – after the watermark has already passed 10:10:00 and the [10:00-10:10] window result was emitted. How should a stream processor handle this?

  1. Drop the late event since the window has already closed
  2. Assign it to the [10:10-10:20] window based on its processing time
  3. Buffer all events until every possible late arrival has been received
  4. Update the already-emitted [10:00-10:10] result using watermarks and allowed lateness

Correct Answer: D) Update the already-emitted [10:00-10:10] result using watermarks and allowed lateness

Why this is correct:

  • The event (event time 10:08:30) belongs in the [10:00-10:10] window, but arrived 5.5 minutes late at processing time 10:14:00
  • The watermark already passed 10:10, so the window was considered “complete” and its result was emitted
  • Allowed lateness (e.g., 10 minutes) lets the system accept late events and update previously emitted results
  • The aggregation result is refined as late events arrive, improving accuracy over time

Why the other options are incorrect:

  • A) Drop the late event: Losing data is unacceptable. A 5-minute delay is common in IoT networks with intermittent connectivity, and shouldn’t cause data loss.
  • B) Assign it to the [10:10-10:20] window based on processing time: This confuses processing time (when the event arrived: 10:14) with event time (when the event occurred: 10:08:30). The event belongs in [10:00-10:10] based on event time, not [10:10-10:20] based on arrival time. Using processing time would produce incorrect analytics.
  • C) Buffer all events until every possible late arrival has been received: This defeats the purpose of stream processing. You’d wait indefinitely since you can never be certain no more late events will arrive.

Key Principle: In stream processing, distinguish event time (when it happened) from processing time (when it arrived). Use watermarks to track event-time progress and allowed lateness to accept late events, balancing completeness against latency.

You’re designing a streaming system for a smart city that needs to:

  • Ingest sensor data from 10,000+ traffic sensors
  • Perform complex event pattern detection (e.g., detect traffic jam propagation)
  • Maintain exactly-once processing guarantees
  • Scale horizontally as more sensors are added

Which platform combination is most appropriate?

  1. Apache Kafka only
  2. Apache Flink only
  3. Apache Kafka + Apache Flink
  4. Spark Structured Streaming only

Correct Answer: C) Apache Kafka + Apache Flink

Why this is correct:

  • Kafka excels at high-throughput ingestion and durable event storage - perfect for 10,000+ sensors
  • Flink provides best-in-class Complex Event Processing (CEP) for detecting traffic jam patterns
  • Flink has native exactly-once semantics with checkpointing
  • This combination is the de-facto standard for production IoT streaming at scale

Why the other options are less suitable:

  • A) Apache Kafka only: Kafka Streams is good for simpler transformations but lacks Flink’s sophisticated CEP library for complex pattern matching.
  • B) Apache Flink only: While Flink can ingest data directly, adding Kafka provides durability, replay capability, and decouples ingestion from processing.
  • D) Spark Structured Streaming only: Spark uses micro-batching with minimum latencies of ~100ms. For detecting rapidly propagating traffic jams, Flink’s true event-by-event processing provides lower latency.

Key Principle: In production IoT streaming, Kafka + Flink is a powerful combination - Kafka as the durable event backbone, Flink as the processing engine. Choose Spark when you need unified batch/stream processing with existing Spark infrastructure.

Your streaming pipeline normally processes 10,000 events/second. During a traffic spike, the input rate jumps to 50,000 events/second, overwhelming your aggregation operators. What is the correct backpressure response?

  1. Drop events that exceed capacity to maintain latency SLAs
  2. Signal upstream to slow down while buffering what you can
  3. Automatically scale to 5x capacity immediately
  4. Switch to batch processing mode until the spike passes

Correct Answer: B) Signal upstream to slow down while buffering what you can

Why this is correct:

  • Backpressure is a flow control mechanism that propagates “slow down” signals upstream when downstream operators are overwhelmed
  • Buffering provides temporary capacity while the system stabilizes
  • This preserves data integrity - no events are lost
  • It’s the fundamental pattern implemented by Flink, Kafka, and other production streaming systems

Why the other options are problematic:

  • A) Drop events: Data loss is almost never acceptable in IoT systems. Lost sensor readings mean incomplete analysis and potentially missed critical alerts.
  • C) Automatically scale to 5x capacity: Auto-scaling helps long-term but takes minutes. Backpressure handles second-by-second fluctuations. Also, scaling 5x for a brief spike is expensive.
  • D) Switch to batch processing: Switching processing paradigms mid-stream is architecturally complex and would disrupt real-time analytics.

Key Principle: Backpressure is a fundamental reliability mechanism in streaming systems. It’s not a failure - it’s the correct response to temporary overload. Monitor backpressure metrics to know when to scale.

2.10 Summary

Stream processing is fundamental infrastructure for IoT systems requiring real-time insights and actions. This chapter covered:

Topic Key Takeaway
Batch vs Stream Choose stream when insight value degrades over time; batch when complete data context is needed
Windowing Tumbling for periodic reports, sliding for smooth trends, session for activity bursts
Platforms Kafka for event streaming backbone, Flink for complex event processing, Spark for unified batch/stream
Late Data Watermarks + grace periods balance completeness against latency
Exactly-Once Requires idempotent processing or transactional commits
Production Backpressure, checkpointing, and monitoring are essential for reliability
Key Decision Framework

When to use stream processing:

  1. Safety-critical alerts requiring <1 second response
  2. Real-time dashboards and operational visibility
  3. Fraud/anomaly detection where delay means damage
  4. Event-driven automation and control systems

When batch is sufficient:

  1. Historical trend analysis and reporting
  2. ML model training on complete datasets
  3. Regulatory reports with fixed deadlines
  4. Cost optimization (pay-per-job vs always-on)

Hybrid (Lambda Architecture): Most production IoT systems use both - stream for real-time approximations, batch for accurate historical corrections.

Objective: Implement both tumbling and sliding window aggregations on streaming sensor data and compare how they handle boundary events.

import random

# Simulate streaming temperature data (1 reading per second for 60 seconds)
random.seed(42)
stream = []
for i in range(60):
    temp = 22.0 + 0.1 * i  # Gradual increase
    if 25 <= i <= 30:       # Spike event at seconds 25-30
        temp += 8.0
    temp += random.gauss(0, 0.3)
    stream.append({"time": i, "temp": round(temp, 2)})

# Tumbling Window: non-overlapping 10-second windows
print("=== Tumbling Window (10s, non-overlapping) ===")
for start in range(0, 60, 10):
    window = [s["temp"] for s in stream if start <= s["time"] < start + 10]
    avg = sum(window) / len(window)
    mx = max(window)
    print(f"  [{start:2d}-{start+10:2d}s] avg={avg:.1f}C, max={mx:.1f}C"
          + (" <-- SPIKE" if mx > 28 else ""))

# Sliding Window: 10-second window, sliding every 2 seconds
print("\n=== Sliding Window (10s window, 2s slide) ===")
for start in range(0, 52, 2):
    window = [s["temp"] for s in stream if start <= s["time"] < start + 10]
    if not window:
        continue
    avg = sum(window) / len(window)
    mx = max(window)
    print(f"  [{start:2d}-{start+10:2d}s] avg={avg:.1f}C, max={mx:.1f}C"
          + (" <-- SPIKE" if mx > 28 else ""))

print("\nKey difference:")
print("  Tumbling: spike may straddle window boundary (split detection)")
print("  Sliding:  overlapping windows catch spike regardless of timing")

What to Observe:

  1. Tumbling windows are simpler but may split events across boundaries
  2. Sliding windows overlap, catching events regardless of when they occur
  3. Sliding windows produce more output (one result per slide) but are more responsive
  4. For alert detection, sliding windows reduce worst-case detection latency

Objective: Build a simple stream processing pipeline demonstrating how producers, processors, and consumers interact, including backpressure handling.

import random
from collections import deque

class StreamPipeline:
    """Simplified stream processing pipeline with backpressure"""

    def __init__(self, buffer_size=20):
        self.buffer = deque(maxlen=buffer_size)
        self.processed = 0
        self.dropped = 0
        self.alerts = 0

    def produce(self, readings):
        """Ingest sensor readings into buffer"""
        for r in readings:
            if len(self.buffer) >= self.buffer.maxlen:
                self.dropped += 1  # Backpressure: drop oldest
            self.buffer.append(r)

    def process(self, n=5):
        """Process up to n readings from buffer"""
        results = []
        for _ in range(min(n, len(self.buffer))):
            reading = self.buffer.popleft()
            self.processed += 1
            if reading["temp"] > 30:
                results.append({"alert": "HIGH_TEMP",
                    "value": reading["temp"], "device": reading["device"]})
                self.alerts += 1
            results.append({"type": "processed", "value": reading["temp"]})
        return results

    def status(self):
        return (f"Buffer: {len(self.buffer)}/{self.buffer.maxlen} | "
                f"Processed: {self.processed} | Dropped: {self.dropped}")

Simulation driver (run after defining the class above):

pipeline = StreamPipeline(buffer_size=20)
random.seed(42)
scenarios = [("Normal", 5), ("Normal", 5), ("Burst!", 30),
             ("Burst!", 25), ("Recovery", 5), ("Recovery", 3)]

for desc, count in scenarios:
    readings = [{"device": f"sensor_{random.randint(1,5):03d}",
                 "temp": round(22 + random.gauss(0,3), 1)}
                for _ in range(count)]
    pipeline.produce(readings)
    results = pipeline.process(n=10)
    print(f"{desc:12s} | {pipeline.status()}")

What to Observe:

  1. Normal load: buffer stays manageable, all readings processed
  2. Burst load: buffer fills up, backpressure drops oldest readings
  3. Recovery: pipeline drains the buffer as load decreases
  4. Alerts are generated in real-time even during high load
  5. This is a simplified version of what Kafka + Flink do at scale

An online marketplace processes 50,000 transactions/minute and needs to detect potential fraud in real-time. The system uses Apache Kafka + Flink to implement a multi-stage streaming pipeline.

Pipeline Architecture:

  1. Ingestion: Kafka topic receives transaction events (user_id, amount, location, timestamp)
  2. Enrichment: Flink joins transactions with user profile stream (account age, history)
  3. Windowing: 5-minute sliding window (1-minute slide) calculates per-user spending velocity
  4. CEP Pattern Matching: Detect sequence: high-value purchase → immediate address change → second purchase <30 min
  5. Alerting: Suspicious transactions route to fraud review queue

Sample Calculation: User ID 12345 makes 3 purchases in 5 minutes totaling $1,200. Historical average spending: $240 per 5-minute window (roughly 3 transactions at $80 each). Velocity alert threshold: 5x historical average = 5 x $240 = $1,200/5-min window. Current velocity ($1,200) meets the threshold, triggering review. In practice, thresholds are often set lower (e.g., 3x) to catch fraud earlier.

Backpressure Handling: During Black Friday traffic spike (150,000 events/min), Kafka buffers excess events while Flink processes at max capacity (80,000/min). Backpressure signals upstream producers to slow publishing rate. Window results may lag by 2-3 minutes but all events are eventually processed without data loss.

Production Metrics: 99.2% of fraud caught within 2 minutes, 0.3% false positive rate, $4.2M prevented losses in 2023.

Concept Relationships

Stream processing for IoT connects to several related concepts:

Contrast with: Batch processing (complete datasets, scheduled execution, hours-to-days latency) vs. stream processing (unbounded data, continuous execution, milliseconds-to-seconds latency)

See Also

2.12 What’s Next

If you want to… Read this
Study stream processing fundamentals in depth Stream Processing Fundamentals
Learn about streaming architecture patterns Stream Processing Architectures
Get hands-on with the basic streaming lab Lab: Stream Processing
Understand IoT interoperability alongside stream processing Interoperability

Continue your journey in IoT data management: