5  Building IoT Streaming Pipelines

Chapter Topic
Stream Processing Overview of batch vs. stream processing for IoT
Fundamentals Windowing strategies, watermarks, and event-time semantics
Architectures Kafka, Flink, and Spark Streaming comparison
Pipelines End-to-end ingestion, processing, and output design
Challenges Late data, exactly-once semantics, and backpressure
Pitfalls Common mistakes and production worked examples
Basic Lab ESP32 circular buffers, windows, and event detection
Advanced Lab CEP, pattern matching, and anomaly detection on ESP32
Game & Summary Interactive review game and module summary
Interoperability Four levels of IoT interoperability
Interop Fundamentals Technical, syntactic, semantic, and organizational layers
Interop Standards SenML, JSON-LD, W3C WoT, and oneM2M
Integration Patterns Protocol adapters, gateways, and ontology mapping

Learning Objectives

After completing this chapter, you will be able to:

  • Design a three-stage IoT streaming pipeline covering ingestion, processing, and output
  • Implement protocol translation and schema validation in the ingestion layer to normalize heterogeneous IoT data sources
  • Build windowed aggregation and anomaly detection logic using Apache Flink SQL
  • Configure multi-sink output routing that directs critical alerts, metrics, and historical data to appropriate destinations
  • Analyze real-world streaming architectures such as Netflix’s 8-million-events-per-second pipeline

A stream processing pipeline is a series of automated steps that data flows through in real time. Think of a water treatment plant where water passes through filters, purifiers, and quality checks continuously. Similarly, IoT data flows through stages that clean it, analyze it, and trigger alerts – all happening automatically and instantly.

In 60 Seconds

A production IoT streaming pipeline has three stages: ingestion (protocol translation from MQTT/CoAP/HTTP to a unified Kafka format, partitioned by sensor ID for ordered processing), stream processing (parsing, filtering, enrichment, windowed aggregation, and anomaly detection in Flink or Kafka Streams), and output (routing critical alerts to PagerDuty, metrics to time-series databases, and historical data to data lakes). Netflix processes 8 million events per second through this architecture to deliver sub-second recommendation updates.

Key Concepts
  • IoT Streaming Pipeline: End-to-end data flow from IoT sensors through ingestion, processing, enrichment, and storage to consuming applications
  • Ingestion Layer: First pipeline stage receiving raw sensor data from MQTT, HTTP, or direct connection — typically Kafka, Kinesis, or Pub/Sub acting as a durable buffer
  • Processing Layer: Middle pipeline stage applying transformations (filtering, aggregation, enrichment, anomaly detection) using stream processors (Flink, Spark, Kafka Streams)
  • Serving Layer: Final pipeline stage delivering processed results to dashboards, APIs, databases, or downstream systems for consumption
  • Schema Registry: Centralized service managing IoT message schemas (Avro, Protobuf, JSON Schema) ensuring producers and consumers use compatible schemas
  • Sink Connector: Component writing processed streaming results to storage systems (InfluxDB, Cassandra, S3, Elasticsearch) with configurable batching and error handling
  • Pipeline Observability: Monitoring throughput, latency, error rates, and consumer lag at each pipeline stage to detect bottlenecks and data quality issues
  • Dead Letter Topic: Kafka topic receiving events that fail processing (schema validation errors, transformation exceptions) for inspection and reprocessing without blocking the main pipeline

5.1 Building IoT Streaming Pipelines

⏱️ ~15 min | ⭐⭐⭐ Advanced | 📋 P10.C14.U04

A complete IoT streaming pipeline transforms raw sensor data into actionable insights through multiple processing stages. This section covers the design and implementation of production-ready streaming pipelines.

Drag and drop pipeline components (sources, processors, sinks) to assemble a working IoT streaming pipeline.

Watch data flow through each pipeline stage in real time, with throughput and latency metrics at every node.

Explore how pipeline architecture changes when scaling from thousands to millions of events per second.

5.1.1 Pipeline Architecture

Architecture diagram showing a three-stage IoT streaming pipeline: ingestion layer receiving data from edge devices via MQTT, CoAP, and HTTP into Kafka; processing layer performing enrichment, windowed aggregation, and anomaly detection in Flink; and output layer routing results to dashboards, databases, and alerting systems

IoT streaming pipeline architecture
Figure 5.1: Complete IoT streaming pipeline from edge devices through ingestion, enrichment, aggregation, and output stages

5.1.2 Stage 1: Data Ingestion

The ingestion layer handles receiving data from diverse IoT sources and normalizing it for processing.

Design Principles:

  1. Protocol Translation: Convert MQTT, CoAP, HTTP into unified message format
  2. Schema Validation: Validate incoming data against defined schemas
  3. Partitioning: Route messages to appropriate partitions for parallel processing
  4. Backpressure: Handle traffic spikes without data loss
# Kafka producer with IoT-optimized configuration
import json
import logging
from confluent_kafka import Producer

logger = logging.getLogger(__name__)

producer_config = {
    'bootstrap.servers': 'kafka:9092',
    'client.id': 'iot-gateway',
    'acks': 'all',  # Wait for all replicas
    'retries': 10,  # Retry on transient failures
    'batch.size': 65536,  # 64KB batches for efficiency
    'linger.ms': 5,  # Wait up to 5ms for batching
    'compression.type': 'lz4',  # Fast compression
}

def on_delivery(err, msg):
    if err:
        logger.error(f"Delivery failed: {err}")
        # Implement retry or dead-letter queue

producer = Producer(producer_config)

def publish_sensor_reading(sensor_id, reading):
    # Partition by sensor_id for ordered per-sensor processing
    producer.produce(
        topic='iot-raw-events',
        key=sensor_id.encode(),
        value=json.dumps(reading).encode(),
        callback=on_delivery
    )
    producer.poll(0)  # Trigger delivery callbacks

5.1.3 Stage 2: Stream Processing

The processing layer applies transformations, aggregations, and analytics to the data stream.

Common Operations:

  1. Parsing: Extract structured data from raw messages
  2. Filtering: Remove invalid or irrelevant events
  3. Enrichment: Add metadata, geolocation, device info
  4. Aggregation: Compute statistics over windows
  5. Alerting: Detect anomalies and threshold violations
# Flink processing pipeline
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment

env = StreamExecutionEnvironment.get_execution_environment()
t_env = StreamTableEnvironment.create(env)

# Define source table from Kafka
t_env.execute_sql("""
    CREATE TABLE sensor_readings (
        sensor_id STRING,
        temperature DOUBLE,
        humidity DOUBLE,
        timestamp TIMESTAMP(3),
        WATERMARK FOR timestamp AS timestamp - INTERVAL '5' SECOND
    ) WITH (
        'connector' = 'kafka',
        'topic' = 'iot-raw-events',
        'properties.bootstrap.servers' = 'kafka:9092',
        'properties.group.id' = 'flink-sensor-aggregation',
        'scan.startup.mode' = 'latest-offset',
        'format' = 'json'
    )
""")

# Tumbling window aggregation
result = t_env.sql_query("""
    SELECT
        sensor_id,
        TUMBLE_START(timestamp, INTERVAL '5' MINUTE) AS window_start,
        AVG(temperature) AS avg_temp,
        MAX(temperature) AS max_temp,
        MIN(temperature) AS min_temp,
        COUNT(*) AS reading_count
    FROM sensor_readings
    GROUP BY sensor_id, TUMBLE(timestamp, INTERVAL '5' MINUTE)
""")

5.1.4 Stage 3: Output and Actions

The output layer delivers processed results to consumers and triggers automated actions.

Output Destinations:

  1. Time-Series Databases: InfluxDB, TimescaleDB for metrics storage
  2. Data Lakes: S3, HDFS for historical analysis
  3. Real-Time Dashboards: WebSocket pushes to Grafana, custom UIs
  4. Alerting Systems: PagerDuty, Slack, SMS for critical alerts
  5. Control Systems: Actuator commands, automated responses
# Multi-sink output configuration
from pyflink.table import DataTypes
from pyflink.table.udf import udf

@udf(result_type=DataTypes.STRING())
def route_message(severity: str, message: str) -> str:
    """Route messages based on severity"""
    if severity == 'CRITICAL':
        # Immediate alert
        send_pagerduty_alert(message)
        return 'pagerduty'
    elif severity == 'WARNING':
        # Log to monitoring
        return 'monitoring'
    else:
        # Standard metrics storage
        return 'timeseries'

5.2 Interactive Pipeline Demo

Experiment with a simplified streaming pipeline:

Interactive Animation: This animation is under development.

This interactive demo simulates a real streaming pipeline with: - Configurable event rate: Adjust how fast sensor readings arrive - Windowed aggregation: Compute average temperature over configurable window sizes - Anomaly detection: Flag readings above the threshold - Real-time alerts: Display detected anomalies immediately

Use this calculator to estimate infrastructure requirements for your IoT streaming pipeline. Adjust the parameters to see how device count, reporting frequency, and message size affect throughput, storage, and Kafka cluster sizing.

5.3 Real-World Case Study: Netflix Streaming Analytics

Netflix processes over 8 million events per second from their streaming platform to provide real-time recommendations, detect playback issues, and optimize content delivery.

5.3.1 System Architecture

Scale:

  • 230+ million subscribers globally
  • 8 million events/second during peak hours
  • 1.3 petabytes of data processed daily
  • Sub-second latency for recommendation updates

How does 8 million events/second translate to infrastructure? Not all 8M events/sec are playback heartbeats – Netflix collects many event types (playback, UI interactions, error telemetry, CDN metrics, A/B test exposure logs). If we estimate that heartbeat events account for roughly half the total at 4M events/sec, with heartbeats sent every 30 seconds per active stream:

\[ \text{Concurrent streams} \approx 4{,}000{,}000\text{ heartbeats/sec} \times 30\text{ sec/heartbeat} = 120{,}000{,}000\text{ active streams} \]

This aligns with estimates of peak concurrent viewers across 230M+ subscriber accounts (multiple profiles, shared accounts, and global time-zone spread).

At 100 bytes per event across all types: \(8\text{M} \times 100\text{ bytes} = 800\text{ MB/sec} = 2.88\text{ TB/hour}\). Over 24 hours: 69 TB/day raw ingestion. The reported 1.3 PB/day of “processed” data includes derived streams, enrichment joins, replication across availability zones, and materialized aggregations – roughly a 19x fan-out from raw ingestion. With 20:1 compression in Kafka: 3.45 TB stored per day for raw events. Netflix keeps 7 days hot (~24 TB) + downsampled history. Kafka cluster capacity at a conservative 50K events/sec/broker: \(\frac{8\text{M events/sec}}{50{,}000\text{ events/sec/broker}} \approx 160\text{ brokers}\) minimum for throughput (production clusters use replication factor 3, so approximately 480 broker replicas).

Pipeline:

  1. Event Collection: Client devices (TV, mobile, browser) send playback events every 30 seconds
    • Play, pause, seek, error, buffering events
    • Device capabilities, network conditions
  2. Real-time Processing (Apache Flink):
    • Detect playback failures within 5 seconds
    • Calculate real-time popularity metrics
    • Update recommendation models continuously
  3. Windowed Aggregations:
    • 1-minute tumbling windows for error rates by region
    • 10-minute sliding windows for trending content
    • Session windows for viewing sessions (gap timeout: 30 minutes)
  4. Outputs:
    • Real-time dashboards for operations teams
    • A/B testing metrics for product features
    • Content delivery network (CDN) optimization
    • Personalized homepage updated within 1 second

Key Challenges Solved:

  • Backpressure during outages: When CDN fails, millions of error events flood the pipeline. Kafka buffering prevents data loss.
  • Global clock skew: Devices in different timezones and with inaccurate clocks require watermark strategies with 5-minute lag.
  • Exactly-once for billing: View duration metrics must be exact for royalty payments to content creators.

Results:

  • 99.99% uptime for streaming analytics
  • $1 billion annual savings through network optimization
  • 80% of viewing driven by recommendation engine

5.4 Understanding Check

You’re designing a stream processing system for a smart grid utility with the following requirements:

  • Scale: 100,000 smart meters
  • Frequency: Each meter reports power consumption every 15 seconds
  • Detection: Identify power theft within 1 minute of occurrence
  • Theft Pattern: Consumption drops by >50% compared to historical average but meter remains connected

Question 1: What is your event rate?

Solution:

  • 100,000 meters x (1 reading / 15 seconds) = 6,667 events per second
  • Peak rate (considering network retries): ~10,000 events per second

Question 2: What windowing strategy would you use for theft detection?

Solution: Sliding windows with: - Window size: 1 hour (for historical average) - Slide interval: 15 seconds (update on each reading) - Rationale: Need to compare current reading against recent average, updated continuously

Alternative approach: Two windows: - Long tumbling window (24 hours) for baseline average - Short sliding window (1 minute) for current consumption - Alert when ratio < 0.5

Question 3: Design a simple pipeline to detect anomalies

Solution:

Smart grid anomaly detection pipeline diagram showing smart meters feeding into Kafka ingestion, Flink processing with sliding window comparison against historical baseline, and output routing to fraud alert system and time-series database

Technology Choice: Apache Flink - Reason: Sophisticated event-time windowing, exactly-once semantics critical for fraud detection, sub-second latency

5.5 Worked Example: Fleet Telematics Streaming Pipeline

Worked Example: Designing a Real-Time Pipeline for 50,000 Connected Vehicles

Scenario: Geotab, a fleet management company, processes telematics data from 50,000 commercial vehicles (delivery trucks, buses, service vans) across the UK. Each vehicle has an OBD-II dongle reporting GPS, speed, fuel level, engine RPM, and diagnostic trouble codes. The pipeline must detect speeding violations in real-time (for regulatory compliance) and compute daily fuel efficiency reports.

Given:

  • 50,000 vehicles, each reporting every 2 seconds while engine is running
  • Average 10 hours/day engine-on per vehicle = 18,000 readings/vehicle/day
  • Peak hours (7-9am, 4-6pm): 80% of fleet active = 40,000 vehicles x 0.5 msg/sec = 20,000 messages/second
  • Message size: 180 bytes (JSON: vehicle ID, timestamp, lat/lon, speed km/h, fuel %, RPM, DTC codes)
  • Peak data rate: 20,000 x 180 bytes = 3.6 MB/second = 28.8 Mbps
  • Speeding alert latency: <5 seconds from violation to fleet manager notification
  • Daily report: Fuel efficiency (km/L) per vehicle, idle time, distance traveled

Step 1 – Design the ingestion layer:

Component Technology Configuration Why
Protocol bridge MQTT broker (EMQX) 50,000 persistent connections, TLS Vehicles use MQTT over cellular
Message queue Apache Kafka (3 brokers) 12 partitions, keyed by vehicle_id Ordered per-vehicle processing, 7-day retention
Schema registry Confluent Schema Registry Avro schema with backward compatibility Validate incoming messages, handle schema evolution

Kafka partitioning strategy: partition = hash(vehicle_id) % 12. This ensures all messages from one vehicle go to the same partition, maintaining temporal ordering per vehicle (critical for speed calculations).

Step 2 – Design the processing layer:

Two Flink jobs running in parallel:

Job 1: Real-Time Speeding Detection (latency-critical)

  • Input: Raw Kafka topic
  • Processing: Compare speed to road speed limit (enriched from HERE Maps API, cached locally)
  • Window: Event-time tumbling window of 5 seconds per vehicle
  • Rule: If average speed in window exceeds limit by >10%, emit alert
  • Output: PagerDuty webhook for fleet managers, speeding_alerts Kafka topic
  • Latency budget: 2s Kafka + 1s Flink + 1s PagerDuty = 4 seconds end-to-end

Job 2: Daily Fuel Efficiency Aggregation (accuracy-critical)

  • Input: Same raw Kafka topic (consumer group isolation)
  • Processing: Per-vehicle daily aggregation
  • Windows: Event-time tumbling window of 24 hours, with watermark tolerance of 30 minutes (late cellular reports)
  • Computations per vehicle per day:
    • Total distance: sum of haversine distances between consecutive GPS points
    • Total fuel consumed: delta between first and last fuel level readings, adjusted for refueling events
    • Idle time: count of readings where RPM > 600 and speed = 0
    • Fuel efficiency: distance / fuel consumed (km/L)
  • Output: PostgreSQL daily_reports table, S3 Parquet for data lake

Step 3 – Calculate infrastructure costs:

Component Specification Monthly Cost (AWS)
EMQX broker (3 nodes) c5.2xlarge x 3 GBP 780
Kafka cluster (3 brokers) r5.xlarge x 3, 2TB EBS each GBP 1,140
Flink cluster (Job 1) c5.xlarge x 4 (parallelism=16) GBP 520
Flink cluster (Job 2) r5.large x 2 (stateful, checkpointing) GBP 260
PostgreSQL RDS db.r5.large, 500GB GBP 340
S3 data lake ~4.9 TB/month (180B x 18K x 50K x 30) GBP 113
Total monthly GBP 3,153

Per-vehicle cost: GBP 3,153 / 50,000 = GBP 0.063/vehicle/month (6.3 pence per vehicle per month for the entire streaming infrastructure).

Result: The pipeline processes 20,000 messages/second at peak with 4-second end-to-end latency for speeding alerts. Daily fuel reports are available by 01:00 the following day (30-minute watermark ensures late cellular data is captured). The 30-minute late data tolerance catches 99.7% of delayed messages. Infrastructure cost of GBP 0.063/vehicle/month is negligible compared to the GBP 45/vehicle/month subscription fee.

Key Insight: Separating the speeding detection (latency-optimized, 5-second windows, small state) from fuel reporting (accuracy-optimized, 24-hour windows, 30-minute watermark) into two independent Flink jobs is essential. A single job trying to serve both requirements would force compromises in either latency (waiting for late data) or accuracy (missing late reports). Independent jobs can be scaled, monitored, and upgraded independently.

Concept Relationships

Understanding streaming pipelines connects to several related concepts:

Contrast with: Batch processing (processes complete datasets on schedule) vs. streaming pipelines (process events continuously as they arrive)

See Also

Key Takeaway

A production IoT streaming pipeline separates concerns into three stages: ingestion (protocol normalization and schema validation), processing (windowed aggregation and anomaly detection), and output (multi-sink routing based on severity). Centralizing protocol translation in the ingestion layer means implementing each protocol once rather than in every downstream consumer. Netflix demonstrates this architecture at extreme scale, processing 8 million events/second with sub-second recommendation updates.

How does Netflix know what to recommend RIGHT NOW? The Sensor Squad builds a data pipeline!

Sammy the Sensor is amazed: “Netflix has 230 MILLION people watching shows, and they update recommendations in less than 1 SECOND!”

The Sensor Squad decides to build their own data pipeline to understand how it works. Think of it like a pizza restaurant with three stations:

Stage 1: Taking Orders (The Counter) Customers place orders in different ways – some call on the phone, some walk in, some order online. Max the Microcontroller works the counter and writes every order down on the same kind of ticket, no matter how it came in.

“It is like having one order pad for everyone,” explains Max. “Phone orders, online orders, walk-ins – they all end up as the same ticket for the kitchen!”

Stage 2: Making Pizzas (The Kitchen) Now the order tickets flow through the kitchen where different cooks handle different jobs: - Quality Cook: Throws away any ticket that does not make sense (“No, we cannot put ice cream ON a pizza”) - Prep Cook: Adds extra details (“Table 5 wants extra cheese – add that to the ticket!”) - Line Cook: Makes groups of similar orders at once for speed - Safety Cook: Checks the oven temperature and shouts “TOO HOT!” if something is burning

Lila the LED says: “It is like a kitchen assembly line – each cook does one job really well!”

Stage 3: Delivery (Getting Food Out) The finished pizzas go to different places: - Burning oven? Emergency alert to the manager RIGHT NOW! - Regular orders go to the correct table (stored neatly) - End-of-day counts go into a big report (for planning tomorrow)

Bella the Battery summarizes: “Take orders, cook them, deliver the results. That is the recipe for handling millions of data points per second!”

5.5.1 Try This at Home!

Build your own “data pipeline” with toy cars! Set up three stations: (1) Sorting station – separate cars by color. (2) Counting station – count how many of each color. (3) Delivery station – put results on different shelves. Time yourself! This is exactly what stream processing does with sensor data.

5.7 What’s Next

If you want to… Read this
Study the architecture patterns behind these pipelines Stream Processing Architectures
Practice building pipelines in the basic lab Lab: Stream Processing
Learn about streaming challenges to design for Handling Stream Processing Challenges
Study common pipeline pitfalls and fixes Common Pitfalls and Worked Examples

Continue to Handling Real-World Challenges to learn about late data handling, exactly-once processing, backpressure management, and fault tolerance strategies.