Justify when stream processing is required over batch processing based on latency, throughput, and data freshness constraints
Select appropriate windowing strategies (tumbling, sliding, session) for different IoT use cases
Compare major stream processing platforms (Kafka, Flink, Spark Streaming) and select the right tool for specific requirements
Design end-to-end IoT streaming pipelines from ingestion to action
Handle real-world challenges including late data, exactly-once semantics, and backpressure
Implement basic stream processing on embedded devices using circular buffers and event detection
In 60 Seconds
Stream processing continuously analyzes IoT data as it arrives — enabling real-time anomaly detection, aggregation, and alerting in milliseconds rather than the hours required by batch processing. Core concepts: events flow through a directed pipeline of operators (filter, transform, aggregate, join) with temporal windows managing time-based grouping. Choose true streaming (Flink, Kafka Streams) for <100ms latency requirements or micro-batch (Spark Structured Streaming) for simpler development with 100ms-1s latency tolerance.
2.2 Overview
Stream processing is essential infrastructure for modern IoT systems requiring real-time insights and actions. This comprehensive guide covers everything from fundamental concepts to production implementation patterns.
A modern autonomous vehicle generates approximately 1 gigabyte of sensor data every second from LIDAR, cameras, radar, GPS, and inertial measurement units. If you attempt to store this data to disk and then query it for analysis, you’ve already crashed into the obstacle ahead. The vehicle needs to make decisions in under 10 milliseconds based on continuously arriving sensor streams.
This is the fundamental challenge that stream processing solves: processing data in motion, not data at rest.
2.3 Putting Numbers to It
An autonomous vehicle generates 1 GB/second of sensor data (LIDAR, cameras, radar, GPS). Over a 2-hour test drive, how much stream processing throughput is required?
Data rate: \(R = 1 \text{ GB/s} = 8 \text{ Gbps}\) (gigabits per second).
Time window for decision-making: Autonomous systems need decisions within 10 milliseconds. In 10 ms, the vehicle generates:
Why batch processing fails: If you buffer 1 second of data before processing (batch approach), you’ve traveled 25 meters at highway speed (90 km/h = 25 m/s) before detecting an obstacle. Stream processing operates on 10 MB chunks in 10 ms windows, making decisions within safe reaction distance.
Think of stream processing like a factory assembly line versus a warehouse:
Batch processing (warehouse): Collect all products in a warehouse, then sort and analyze them all at once during the night shift
Stream processing (assembly line): Inspect and act on each product as it moves down the conveyor belt in real-time
In IoT, your sensors are constantly generating data like items on a conveyor belt. Stream processing lets you:
Detect a temperature spike immediately instead of discovering it hours later in a daily report
Trigger an emergency shutdown within milliseconds when pressure exceeds safe limits
Update dashboards continuously so operators see current conditions, not stale data
When to use stream processing: Ask yourself “Does the value of this insight decrease over time?” If yes, stream it. If analyzing last week’s trends is just as useful as analyzing right now, batch processing is simpler and cheaper.
For Kids: Meet the Sensor Squad!
Imagine your sensors are like reporters at a news station!
2.3.1 The Sensor Squad Adventure: Breaking News vs. Yesterday’s Paper
The Sensor Squad has two ways to share news with the world:
The Newspaper Way (Batch Processing): Sammy the Temperature Sensor collects ALL the temperatures throughout the day. At midnight, everything gets printed in tomorrow’s newspaper. The problem? If something exciting happened at 2 PM, nobody finds out until they read the paper the next morning!
The Live TV News Way (Stream Processing): Sammy stands in front of a camera ALL DAY. The moment something happens - “BREAKING NEWS! Temperature just hit 100 degrees!” - everyone watching TV sees it RIGHT AWAY!
Real Examples:
Batch: Your report card comes once per semester - you find out your grades weeks after taking tests
Stream: A scoreboard at a basketball game - you see points the instant they’re scored!
The Sensor Squad learned: When something is URGENT (like a fire alarm), you need the “Live TV” way. When you just need a summary (like “how many steps did I take this month?”), the “Newspaper” way works fine!
2.3.2 Try This at Home!
Play the “Data Stream Game” with a friend:
Friend whispers numbers to you one at a time (that’s streaming data!)
Batch version: Write all numbers down, add them up at the end
Stream version: Keep a running total in your head as each number arrives
Which way tells you the answer faster? That’s why IoT uses stream processing for important alerts!
Minimum Viable Understanding: Streaming vs Batch Processing
Core Concept: Stream processing analyzes data continuously as it arrives (event-by-event), while batch processing accumulates data into chunks and processes them at scheduled intervals.
Why It Matters: The choice determines your system’s latency profile. A factory safety shutdown requiring <100ms response cannot wait for hourly batch jobs. Conversely, training ML models on streaming data adds unnecessary complexity when overnight batch processing suffices.
Key Takeaway: Evaluate the cost of delayed insight. If your anomaly alert loses value after 10 seconds, stream process it. If your trend analysis is equally useful whether computed at midnight or noon, batch process it and save on infrastructure complexity. As a rule of thumb: stream processing adds 3-5x infrastructure complexity over batch, so the latency benefit must justify the cost.
2.4 Batch vs Stream Processing Comparison
Figure 2.1: Comparison of batch vs stream processing data flows. Batch processing collects data in storage, processes on schedule, and outputs to a results store. Stream processing handles events continuously with immediate processing, windowed aggregations, and real-time outputs.
Key Differences:
Aspect
Batch Processing
Stream Processing
Latency
Hours to days
Milliseconds to seconds
Data Model
Complete, bounded datasets
Unbounded, continuous events
Processing
All at once, scheduled
Event-by-event, continuous
Use Case
Historical analysis, ML training
Real-time alerts, dashboards
Cost Model
Pay per job execution
Always-on infrastructure
2.5 Chapter Overview
Figure 2.2: Chapter learning path showing four sequential phases: Foundations (stream vs batch, event time, windowing), Platforms (Kafka, Flink, Spark), Implementation (pipelines, late data, exactly-once, backpressure), and Practice (labs and games).
2.6 Chapter Contents
This chapter is organized into the following sections:
Core concepts including batch vs stream processing, event time vs processing time, and windowing strategies (tumbling, sliding, session windows). Includes interactive pipeline demo and worked examples.
Interactive Data Stream Challenge game with 3 difficulty levels testing windowing, CEP, and anomaly detection skills. Plus chapter summary and next steps.
2.7 Windowing Strategies Visual Guide
Figure 2.3: Visual comparison of three windowing strategies. Tumbling windows are fixed, non-overlapping intervals. Sliding windows overlap based on slide interval. Session windows group events by activity with gaps causing new sessions.
Windowing Strategy Selection Guide
Tumbling Windows (Fixed, non-overlapping):
Use for: Periodic aggregations like “events per minute”
Example: Count sensor readings every 5 minutes
Sliding Windows (Overlapping):
Use for: Continuous monitoring with smooth results
Example: Moving average temperature over last 5 minutes, updated every 30 seconds
Session Windows (Activity-based):
Use for: User interactions, intermittent activity
Example: Group user clicks with a 30-second timeout between sessions
Interactive Calculator: Window Size Trade-offs
Explore how window size and slide interval affect detection latency and computational cost.
Show code
viewof windowSizeMin = Inputs.range([1,30], {value:5,step:1,label:"Window size (minutes)"})viewof slideIntervalSec = Inputs.range([5,300], {value:30,step:5,label:"Slide interval (seconds)"})viewof sensorCount = Inputs.range([10,10000], {value:1000,step:10,label:"Number of sensors"})viewof readingsPerSec = Inputs.range([0.1,10], {value:1,step:0.1,label:"Readings per sensor per second"})
Common Mistake: Not Accounting for Event-Time Skew
What practitioners do wrong: They use processing time (when the event arrives at the streaming system) instead of event time (when the event actually occurred) for windowing.
Why it fails: Network delays, device buffering, and batch uploads cause event-time skew. A sensor reading from 10:45:30 might arrive at the server at 10:47:15 due to intermittent connectivity. If you use processing time for a “10:45-10:46 window,” this event gets assigned to the wrong window (10:47-10:48), skewing aggregations.
Correct approach:
Use event-time windows based on the timestamp embedded in the event payload
Configure watermarks to track event-time progress (e.g., “events may arrive up to 2 minutes out of order” sets watermark = max observed event time - 2 minutes)
Set allowed lateness (grace period) for late-arriving events (e.g., 1 minute after watermark)
Emit window results when watermark passes window end + grace period
Real-world consequence: A 2018 smart city traffic monitoring system used processing time and reported “rush hour at 3 AM” because 3-hour-old traffic camera data uploaded in a batch at 3 AM. The city deployed wrong traffic light timings for 6 weeks. After switching to event-time windowing with 5-minute watermarks, congestion metrics matched ground truth within 2%.
2.9 Knowledge Check
Test your understanding of stream processing concepts with these questions.
Question 1: Stream vs Batch Processing Selection
A manufacturing plant needs to analyze sensor data from production equipment. The analysis generates weekly reports for management showing trends in equipment performance. Which processing paradigm is most appropriate?
Stream processing with tumbling windows
Batch processing with scheduled jobs
Stream processing with session windows
Real-time streaming with sub-second latency
Answer and Explanation
Correct Answer: B) Batch processing with scheduled jobs
Why this is correct:
Weekly reports have no real-time urgency - the value doesn’t degrade over hours or days
Management trend analysis requires complete data context, not partial streaming results
Batch processing is simpler to implement and more cost-effective for scheduled reports
Processing can run during off-peak hours, reducing infrastructure costs
Why the other options are incorrect:
A) Stream processing with tumbling windows: Unnecessary complexity for weekly reports. Streaming is overkill when you don’t need sub-minute insights.
C) Stream processing with session windows: Session windows are designed for activity-based grouping (user sessions), not time-based reports.
D) Real-time streaming with sub-second latency: Paying for always-on streaming infrastructure when weekly batches suffice wastes resources.
Key Principle: “Ask yourself: Does the value of this insight decrease over time?” For weekly management reports, a day-old analysis is just as valuable as a minute-old one.
Question 2: Windowing Strategy Selection
An IoT system monitors patient vital signs in a hospital. You need to detect sustained abnormal heart rates - specifically, an average heart rate above 120 BPM over any 5-minute period. Which windowing strategy is most appropriate?
Tumbling windows of 5 minutes
Sliding windows of 5 minutes with 1-minute slide
Session windows with 5-minute timeout
Global windows with no boundaries
Answer and Explanation
Correct Answer: B) Sliding windows of 5 minutes with 1-minute slide
Why this is correct:
Sliding windows provide continuous monitoring - you check every minute whether the last 5 minutes were abnormal
A patient’s heart rate could spike at minute 3 of a tumbling window - with tumbling windows, you’d only detect it at minute 5
With sliding windows (5-min window, 1-min slide), you’d detect it within 1 minute of the start of the abnormal period
For patient safety, earlier detection is critical
Why the other options are incorrect:
A) Tumbling windows of 5 minutes: These are non-overlapping. If abnormal readings span minutes 3-8, a tumbling window [0-5] might show normal, and window [5-10] might also show normal, missing the alert entirely!
C) Session windows with 5-minute timeout: Session windows are for grouping activity bursts (like user clicks). Heart rate monitoring is continuous, not session-based.
D) Global windows: Without boundaries, you’d accumulate all data forever, making the “last 5 minutes” calculation impossible.
Key Principle: Use sliding windows when you need smooth, continuous monitoring without gaps. The overlap ensures no critical event falls between window boundaries.
Question 3: Late Data Handling
Your IoT sensor network has devices that occasionally experience network delays of up to 5 minutes. You’re calculating 10-minute tumbling window aggregations and emitting results eagerly (as the watermark advances). A sensor reading with event time 10:08:30 arrives at processing time 10:14:00 – after the watermark has already passed 10:10:00 and the [10:00-10:10] window result was emitted. How should a stream processor handle this?
Drop the late event since the window has already closed
Assign it to the [10:10-10:20] window based on its processing time
Buffer all events until every possible late arrival has been received
Update the already-emitted [10:00-10:10] result using watermarks and allowed lateness
Answer and Explanation
Correct Answer: D) Update the already-emitted [10:00-10:10] result using watermarks and allowed lateness
Why this is correct:
The event (event time 10:08:30) belongs in the [10:00-10:10] window, but arrived 5.5 minutes late at processing time 10:14:00
The watermark already passed 10:10, so the window was considered “complete” and its result was emitted
Allowed lateness (e.g., 10 minutes) lets the system accept late events and update previously emitted results
The aggregation result is refined as late events arrive, improving accuracy over time
Why the other options are incorrect:
A) Drop the late event: Losing data is unacceptable. A 5-minute delay is common in IoT networks with intermittent connectivity, and shouldn’t cause data loss.
B) Assign it to the [10:10-10:20] window based on processing time: This confuses processing time (when the event arrived: 10:14) with event time (when the event occurred: 10:08:30). The event belongs in [10:00-10:10] based on event time, not [10:10-10:20] based on arrival time. Using processing time would produce incorrect analytics.
C) Buffer all events until every possible late arrival has been received: This defeats the purpose of stream processing. You’d wait indefinitely since you can never be certain no more late events will arrive.
Key Principle: In stream processing, distinguish event time (when it happened) from processing time (when it arrived). Use watermarks to track event-time progress and allowed lateness to accept late events, balancing completeness against latency.
Question 4: Platform Selection
You’re designing a streaming system for a smart city that needs to:
Ingest sensor data from 10,000+ traffic sensors
Perform complex event pattern detection (e.g., detect traffic jam propagation)
Maintain exactly-once processing guarantees
Scale horizontally as more sensors are added
Which platform combination is most appropriate?
Apache Kafka only
Apache Flink only
Apache Kafka + Apache Flink
Spark Structured Streaming only
Answer and Explanation
Correct Answer: C) Apache Kafka + Apache Flink
Why this is correct:
Kafka excels at high-throughput ingestion and durable event storage - perfect for 10,000+ sensors
Flink provides best-in-class Complex Event Processing (CEP) for detecting traffic jam patterns
Flink has native exactly-once semantics with checkpointing
This combination is the de-facto standard for production IoT streaming at scale
Why the other options are less suitable:
A) Apache Kafka only: Kafka Streams is good for simpler transformations but lacks Flink’s sophisticated CEP library for complex pattern matching.
B) Apache Flink only: While Flink can ingest data directly, adding Kafka provides durability, replay capability, and decouples ingestion from processing.
D) Spark Structured Streaming only: Spark uses micro-batching with minimum latencies of ~100ms. For detecting rapidly propagating traffic jams, Flink’s true event-by-event processing provides lower latency.
Key Principle: In production IoT streaming, Kafka + Flink is a powerful combination - Kafka as the durable event backbone, Flink as the processing engine. Choose Spark when you need unified batch/stream processing with existing Spark infrastructure.
Question 5: Backpressure Management
Your streaming pipeline normally processes 10,000 events/second. During a traffic spike, the input rate jumps to 50,000 events/second, overwhelming your aggregation operators. What is the correct backpressure response?
Drop events that exceed capacity to maintain latency SLAs
Signal upstream to slow down while buffering what you can
Automatically scale to 5x capacity immediately
Switch to batch processing mode until the spike passes
Answer and Explanation
Correct Answer: B) Signal upstream to slow down while buffering what you can
Why this is correct:
Backpressure is a flow control mechanism that propagates “slow down” signals upstream when downstream operators are overwhelmed
Buffering provides temporary capacity while the system stabilizes
This preserves data integrity - no events are lost
It’s the fundamental pattern implemented by Flink, Kafka, and other production streaming systems
Why the other options are problematic:
A) Drop events: Data loss is almost never acceptable in IoT systems. Lost sensor readings mean incomplete analysis and potentially missed critical alerts.
C) Automatically scale to 5x capacity: Auto-scaling helps long-term but takes minutes. Backpressure handles second-by-second fluctuations. Also, scaling 5x for a brief spike is expensive.
D) Switch to batch processing: Switching processing paradigms mid-stream is architecturally complex and would disrupt real-time analytics.
Key Principle: Backpressure is a fundamental reliability mechanism in streaming systems. It’s not a failure - it’s the correct response to temporary overload. Monitor backpressure metrics to know when to scale.
🏷️ Label the Diagram
💻 Code Challenge
2.10 Summary
Stream processing is fundamental infrastructure for IoT systems requiring real-time insights and actions. This chapter covered:
Topic
Key Takeaway
Batch vs Stream
Choose stream when insight value degrades over time; batch when complete data context is needed
Windowing
Tumbling for periodic reports, sliding for smooth trends, session for activity bursts
Platforms
Kafka for event streaming backbone, Flink for complex event processing, Spark for unified batch/stream
Late Data
Watermarks + grace periods balance completeness against latency
Exactly-Once
Requires idempotent processing or transactional commits
Production
Backpressure, checkpointing, and monitoring are essential for reliability
Key Decision Framework
When to use stream processing:
Safety-critical alerts requiring <1 second response
Real-time dashboards and operational visibility
Fraud/anomaly detection where delay means damage
Event-driven automation and control systems
When batch is sufficient:
Historical trend analysis and reporting
ML model training on complete datasets
Regulatory reports with fixed deadlines
Cost optimization (pay-per-job vs always-on)
Hybrid (Lambda Architecture): Most production IoT systems use both - stream for real-time approximations, batch for accurate historical corrections.
Try It: Tumbling vs Sliding Window Aggregation
Objective: Implement both tumbling and sliding window aggregations on streaming sensor data and compare how they handle boundary events.
import random# Simulate streaming temperature data (1 reading per second for 60 seconds)random.seed(42)stream = []for i inrange(60): temp =22.0+0.1* i # Gradual increaseif25<= i <=30: # Spike event at seconds 25-30 temp +=8.0 temp += random.gauss(0, 0.3) stream.append({"time": i, "temp": round(temp, 2)})# Tumbling Window: non-overlapping 10-second windowsprint("=== Tumbling Window (10s, non-overlapping) ===")for start inrange(0, 60, 10): window = [s["temp"] for s in stream if start <= s["time"] < start +10] avg =sum(window) /len(window) mx =max(window)print(f" [{start:2d}-{start+10:2d}s] avg={avg:.1f}C, max={mx:.1f}C"+ (" <-- SPIKE"if mx >28else""))# Sliding Window: 10-second window, sliding every 2 secondsprint("\n=== Sliding Window (10s window, 2s slide) ===")for start inrange(0, 52, 2): window = [s["temp"] for s in stream if start <= s["time"] < start +10]ifnot window:continue avg =sum(window) /len(window) mx =max(window)print(f" [{start:2d}-{start+10:2d}s] avg={avg:.1f}C, max={mx:.1f}C"+ (" <-- SPIKE"if mx >28else""))print("\nKey difference:")print(" Tumbling: spike may straddle window boundary (split detection)")print(" Sliding: overlapping windows catch spike regardless of timing")
What to Observe:
Tumbling windows are simpler but may split events across boundaries
Sliding windows overlap, catching events regardless of when they occur
Sliding windows produce more output (one result per slide) but are more responsive
For alert detection, sliding windows reduce worst-case detection latency
Try It: Stream Processing Pipeline with Backpressure
Objective: Build a simple stream processing pipeline demonstrating how producers, processors, and consumers interact, including backpressure handling.
import randomfrom collections import dequeclass StreamPipeline:"""Simplified stream processing pipeline with backpressure"""def__init__(self, buffer_size=20):self.buffer= deque(maxlen=buffer_size)self.processed =0self.dropped =0self.alerts =0def produce(self, readings):"""Ingest sensor readings into buffer"""for r in readings:iflen(self.buffer) >=self.buffer.maxlen:self.dropped +=1# Backpressure: drop oldestself.buffer.append(r)def process(self, n=5):"""Process up to n readings from buffer""" results = []for _ inrange(min(n, len(self.buffer))): reading =self.buffer.popleft()self.processed +=1if reading["temp"] >30: results.append({"alert": "HIGH_TEMP","value": reading["temp"], "device": reading["device"]})self.alerts +=1 results.append({"type": "processed", "value": reading["temp"]})return resultsdef status(self):return (f"Buffer: {len(self.buffer)}/{self.buffer.maxlen} | "f"Processed: {self.processed} | Dropped: {self.dropped}")
Simulation driver (run after defining the class above):
Recovery: pipeline drains the buffer as load decreases
Alerts are generated in real-time even during high load
This is a simplified version of what Kafka + Flink do at scale
Worked Example: E-Commerce Fraud Detection Pipeline
An online marketplace processes 50,000 transactions/minute and needs to detect potential fraud in real-time. The system uses Apache Kafka + Flink to implement a multi-stage streaming pipeline.
CEP Pattern Matching: Detect sequence: high-value purchase → immediate address change → second purchase <30 min
Alerting: Suspicious transactions route to fraud review queue
Sample Calculation: User ID 12345 makes 3 purchases in 5 minutes totaling $1,200. Historical average spending: $240 per 5-minute window (roughly 3 transactions at $80 each). Velocity alert threshold: 5x historical average = 5 x $240 = $1,200/5-min window. Current velocity ($1,200) meets the threshold, triggering review. In practice, thresholds are often set lower (e.g., 3x) to catch fraud earlier.
Backpressure Handling: During Black Friday traffic spike (150,000 events/min), Kafka buffers excess events while Flink processes at max capacity (80,000/min). Backpressure signals upstream producers to slow publishing rate. Window results may lag by 2-3 minutes but all events are eventually processed without data loss.
Production Metrics: 99.2% of fraud caught within 2 minutes, 0.3% false positive rate, $4.2M prevented losses in 2023.
Quiz: Stream Processing
Concept Relationships
Stream processing for IoT connects to several related concepts:
Foundational: Big Data Overview - Understanding data volume and velocity that drive streaming needs