4  Stream Processing Architectures

Chapter Topic
Stream Processing Overview of batch vs. stream processing for IoT
Fundamentals Windowing strategies, watermarks, and event-time semantics
Architectures Kafka, Flink, and Spark Streaming comparison
Pipelines End-to-end ingestion, processing, and output design
Challenges Late data, exactly-once semantics, and backpressure
Pitfalls Common mistakes and production worked examples
Basic Lab ESP32 circular buffers, windows, and event detection
Advanced Lab CEP, pattern matching, and anomaly detection on ESP32
Game & Summary Interactive review game and module summary
Interoperability Four levels of IoT interoperability
Interop Fundamentals Technical, syntactic, semantic, and organizational layers
Interop Standards SenML, JSON-LD, W3C WoT, and oneM2M
Integration Patterns Protocol adapters, gateways, and ontology mapping

Learning Objectives

After completing this chapter, you will be able to:

  • Compare Apache Kafka Streams, Apache Flink, and Spark Structured Streaming across latency, throughput, and deployment complexity
  • Select the appropriate stream processing architecture based on event volume, latency requirements, and team expertise
  • Evaluate push-based versus pull-based consumer models for IoT message queues
  • Apply key-based and round-robin partitioning strategies for Kafka topics based on processing requirements
  • Analyze trade-offs between managed and self-hosted Kafka deployments for IoT workloads

Stream processing architecture is the design for handling data that flows continuously, like a river of sensor readings. Unlike traditional systems where you store everything first and analyze later, stream processing examines data as it arrives – like a factory assembly line where each station processes items as they pass through.

In 60 Seconds

Apache Kafka, Apache Flink, and Spark Structured Streaming are the three dominant stream processing architectures for IoT. Kafka Streams is a lightweight library achieving 100K-500K events/sec per instance with 10-100ms latency and no separate cluster, Flink provides true event-by-event processing with 1-10ms latency and built-in CEP, and Spark offers unified batch+streaming with native ML integration. Most IoT projects should start with Kafka Streams for simplicity and migrate to Flink only when throughput exceeds 500K events/sec per instance or complex event patterns are needed.

Key Concepts
  • Lambda Architecture: Dual-path processing combining a batch layer (accurate, slow, Spark) with a speed layer (approximate, real-time, Storm/Flink) and serving layer merging results
  • Kappa Architecture: Simplified lambda variant using only a stream processing layer (Kafka + Flink) for both real-time and historical reprocessing, eliminating batch layer complexity
  • Micro-Batch Processing: Processing small batches (100ms-1s intervals) to balance throughput and latency — Apache Spark Structured Streaming, Storm Trident
  • True Streaming: Event-by-event processing with sub-millisecond latency — Apache Flink, Google Dataflow, Kafka Streams — providing lowest possible end-to-end latency
  • State Management: Maintaining aggregation state (running counts, sums, windows) across events in a fault-tolerant manner, with checkpointing for recovery
  • Windowing: Temporal grouping of events into bounded sets (tumbling, sliding, session windows) for aggregate computation over time periods
  • Event Time vs. Processing Time: Event time is when data was generated at the sensor; processing time is when it arrives at the processor — out-of-order delivery creates gaps between the two
  • Exactly-Once Processing: Guarantee that each event is processed exactly once even during failures, requiring distributed transactions or idempotent operations

4.1 Stream Processing Architectures

⏱️ ~20 min | ⭐⭐⭐ Advanced | 📋 P10.C14.U03

4.1.1 Apache Kafka + Kafka Streams

Apache Kafka is a distributed streaming platform that combines message queuing with stream processing capabilities through Kafka Streams.

Tradeoff: Push-Based vs Pull-Based Consumer Models for IoT Message Queues

Option A (Push-Based Consumers - RabbitMQ/AMQP style):

  • Delivery model: Broker pushes messages to consumers immediately upon arrival
  • Latency: 1-5ms message delivery (minimal delay between publish and consume)
  • Consumer control: Limited - prefetch count only, broker controls pace
  • Throughput: 10K-50K msg/sec per consumer (limited by consumer processing speed)
  • Flow control: Prefetch limits prevent overwhelming slow consumers
  • Use cases: Real-time alerts, chat systems, command-and-control where immediate delivery is critical
  • Failure handling: Messages requeued on NACK, broker tracks delivery state

Option B (Pull-Based Consumers - Kafka style):

  • Delivery model: Consumers fetch batches of messages when ready
  • Latency: 10-100ms typical (batch accumulation + poll interval)
  • Consumer control: Full - consumers decide batch size, timing, and offset commits
  • Throughput: 100K-500K msg/sec per consumer (batch efficiency)
  • Flow control: Natural backpressure - consumers only pull what they can process
  • Use cases: Log aggregation, metrics collection, stream processing where throughput matters
  • Failure handling: Consumer rewinds offset, re-reads from log (no broker state)

Decision Factors:

  • Choose Push-Based when: Sub-10ms latency required (financial trading, emergency alerts), message ordering per consumer is critical, consumers are always connected and fast, simple request-reply patterns needed
  • Choose Pull-Based when: Throughput exceeds 50K msg/sec, consumers need replay capability (reprocess historical data), batch processing is acceptable, consumers may be temporarily slow or offline, at-least-once semantics with consumer-controlled commits needed
  • Hybrid approach: Use push (RabbitMQ) for real-time command channels, pull (Kafka) for telemetry streams - many IoT systems use both patterns for different data types
Tradeoff: Key-Based Partitioning vs Round-Robin for Kafka Topics

Option A (Key-Based Partitioning - Same sensor always goes to same partition):

  • Routing: hash(sensor_id) % partition_count → deterministic partition assignment
  • Message ordering: Guaranteed per sensor (all readings from sensor X arrive in order)
  • Load distribution: Uneven if some sensors publish more than others (hot partitions)
  • Partition locality: Same consumer processes all data for a sensor (enables stateful processing)
  • Consumer scaling: Adding consumers may not help if load is concentrated in few partitions
  • Use cases: Sensor aggregations, session windowing, any stateful processing requiring per-key ordering

Option B (Round-Robin Partitioning - Messages distributed evenly across partitions):

  • Routing: Next message goes to next partition in sequence
  • Message ordering: None guaranteed (sensor X messages spread across all partitions)
  • Load distribution: Perfect even distribution regardless of publisher patterns
  • Partition locality: Each consumer sees random subset of sensors
  • Consumer scaling: Adding consumers linearly increases throughput
  • Use cases: Stateless processing (validation, enrichment), independent event handling, maximum throughput scenarios

Decision Factors:

  • Choose Key-Based when: Processing requires per-sensor state (running averages, anomaly detection), message ordering within a sensor matters (state machines, sequence detection), downstream systems expect sensor affinity (per-device dashboards)
  • Choose Round-Robin when: Each message is independent (fire-and-forget telemetry), maximum throughput is critical and ordering doesn’t matter, stateless transformations only (JSON parsing, format conversion), want to avoid hot partition problems from uneven sensor traffic
  • Metrics to monitor: Partition lag variance (key-based often shows 10x+ variance vs round-robin), consumer CPU imbalance across partitions, end-to-end latency p99 per partition

Example calculation: 100 sensors where 10% generate 90% of traffic, with 10 partitions - Key-based: Hash collisions likely concentrate hot sensors onto fewer partitions, creating severe imbalance (some partitions handle 2-3 hot sensors while others are nearly idle) - Round-robin: All 10 partitions at uniform ~10,000 msg/sec, but lose per-sensor ordering

Explore how partition count, sensor distribution, and traffic skew affect load balance. Adjust the parameters to see the impact on partition imbalance.

Kafka Streams architecture showing sensor data ingestion through distributed broker cluster with partitioned topics, stream processing pipeline applying transformations and aggregations, and output routing to time-series database and alert services

Kafka Streams architecture for IoT data processing
Figure 4.1: Kafka Streams architecture showing sensor data ingestion through distributed broker cluster with stream processing pipeline outputting to time-series database and alert services
Architecture diagram showing kafka architecture timeline components and layers
Figure 4.2: Alternative view: Sequence diagram showing the temporal flow of a single sensor reading through the Kafka ecosystem. This timeline perspective illustrates the latency budget at each stage - from sensor publishing through broker replication, stream processing, and final routing to storage or alerting based on anomaly detection results.

Strengths:

  • Durability: All messages persisted to disk with configurable retention
  • Exactly-once semantics: Guarantees each message is processed exactly once (requires idempotent producers and transactional consumers; adds 3-5% throughput overhead)
  • High throughput: 1+ million messages per second per broker
  • Horizontal scalability: Add brokers and partitions as needed
  • Fault tolerance: Automatic failover with replication

Performance Metrics (Kafka Broker):

  • Broker throughput: 1-2 million messages/second per broker (producer-to-disk)
  • Broker latency: 2-10 milliseconds median (producer publish to consumer fetch)
  • Retention: Days to years of historical data
  • Partitions: Up to 200,000 per cluster (with KRaft, formerly limited by ZooKeeper)

Kafka Streams Processing Metrics (on top of broker): - Throughput: 100K-500K events/sec per application instance - End-to-end latency: 10-100 ms (depends on commit interval and processing complexity) - Scales horizontally by adding application instances (up to partition count)

Best For: IoT applications requiring guaranteed delivery, historical replay capability, and integration with multiple downstream systems.

The Kafka architecture excels at decoupling producers from consumers, enabling independent scaling and resilience. This producer-consumer model is fundamental to modern stream processing systems.

4.1.1.1 How It Works: Consumer Group Rebalancing

Understanding how Kafka distributes partition consumption across consumers is essential for scaling IoT workloads:

Initial State (3 partitions, 1 consumer):

Topic: sensor-data (3 partitions)
├─ Partition 0 ───> Consumer A
├─ Partition 1 ───> Consumer A
└─ Partition 2 ───> Consumer A
Result: 100% CPU on Consumer A, no parallelism

Add Consumer (3 partitions, 2 consumers – triggers rebalance):

Rebalancing... (brief pause in consumption)
├─ Partition 0 ───> Consumer A
├─ Partition 1 ───> Consumer B
└─ Partition 2 ───> Consumer A
Result: Load distributed, ~2x throughput

Add Third Consumer (3 partitions, 3 consumers):

├─ Partition 0 ───> Consumer A
├─ Partition 1 ───> Consumer B
└─ Partition 2 ───> Consumer C
Result: Perfect 1:1 mapping, maximum parallelism for this topic

Add Fourth Consumer (3 partitions, 4 consumers – wasted capacity):

├─ Partition 0 ───> Consumer A
├─ Partition 1 ───> Consumer B
├─ Partition 2 ───> Consumer C
└─ (no partition) → Consumer D (IDLE!)
Result: Consumer D wastes resources, partition count limits parallelism

Key Insight: Maximum consumer parallelism equals the partition count. Over-provisioning consumers provides no benefit. Under-provisioning creates bottlenecks. This is why choosing the right partition count at topic creation is critical for IoT scaling.

Tradeoff: Kafka Streams vs Apache Flink for IoT Stream Processing

Option A: Kafka Streams (Library-based)

  • Deployment model: Embedded library in your Java/Scala application
  • Infrastructure: No separate cluster - runs in your application containers
  • Operational cost: $200-800/month (Kafka cluster only, no separate Flink cluster)
  • State management: RocksDB embedded, state stored in Kafka topics
  • Throughput: 100K-500K events/sec per instance (scales with app instances)
  • Latency: 10-100ms typical (depends on commit interval)
  • Use cases: Stateful transformations, simple aggregations, microservice integration

Option B: Apache Flink (Distributed Framework)

  • Deployment model: Dedicated cluster with Job Manager and Task Managers
  • Infrastructure: Separate Flink cluster (3-10+ nodes typical)
  • Operational cost: $1,000-5,000/month (Kafka + Flink managed clusters)
  • State management: RocksDB or heap, checkpointed to S3/HDFS
  • Throughput: 1M-10M events/sec per cluster (horizontal scaling)
  • Latency: 1-10ms typical (true event-by-event processing)
  • Use cases: Complex event processing, ML inference, cross-stream joins, very high volume

Decision Factors:

  • Choose Kafka Streams when: Team already operates Kafka, processing logic fits in Java/Scala, 100K events/sec is sufficient, operational simplicity is critical, tight integration with Kafka ecosystem (Connect, Schema Registry)
  • Choose Flink when: Event volume exceeds 500K/sec, complex event patterns (CEP) needed, sub-10ms latency required, processing requires Python/SQL, cross-stream joins or sessionization needed
  • Real-world guidance: Start with Kafka Streams for most IoT projects - it handles 90% of use cases with simpler operations. Migrate to Flink when you hit throughput limits or need CEP capabilities.
Artistic visualization of Kafka streaming architecture showing distributed topic partitions with producer and consumer groups, illustrating the log-based message ordering and parallel consumer scaling that enables high-throughput IoT data streaming
Figure 4.3: Kafka stream architecture with distributed partitions and consumer groups

4.1.3 Apache Spark Structured Streaming

Spark Structured Streaming extends Spark’s batch processing capabilities to streaming data using a micro-batch approach.

Spark Structured Streaming architecture showing driver program coordinating executor nodes processing micro-batches of IoT data with DataFrame API, SQL query support, and ML pipeline integration for real-time predictions

Apache Spark Structured Streaming architecture
Figure 4.7: Spark Structured Streaming architecture showing driver program coordinating executors processing micro-batches with DataFrame API, SQL, and ML pipeline integration

Strengths:

  • Unified API: Same code for batch and streaming
  • ML integration: Native integration with MLlib for real-time predictions
  • SQL support: Query streams with SQL
  • Ecosystem: Leverage entire Spark ecosystem
  • Fault tolerance: Checkpoint-based recovery

Performance Metrics:

  • Latency: 100 ms to seconds (micro-batch mode); experimental continuous mode achieves ~1 ms but with limited operator support
  • Throughput: Millions of events per second per cluster
  • Trigger interval: Configurable from 100ms (high overhead) to 10+ seconds (efficient batching)
  • Scalability: Thousands of executors across hundreds of nodes

Best For: Applications requiring machine learning on streaming data, teams already using Spark, use cases where 1-second latency is acceptable.

4.1.4 Architecture Comparison

The following table consolidates the key differences between all three frameworks to support side-by-side evaluation:

Feature Kafka Streams Apache Flink Spark Streaming
Processing Model Record-by-record Event-by-event Micro-batch
Latency 10-100 ms 1-10 ms 100 ms - 10 s
Throughput (per instance/cluster) 100K-500K events/sec Millions/sec Millions/sec
Exactly-Once Yes (via Kafka transactions) Yes (via checkpoints) Yes (via write-ahead log)
State Management RocksDB/InMemory RocksDB/Heap In-memory/HDFS/S3
Event Time Yes Advanced (watermarks, allowed lateness) Yes
CEP Support Limited (custom code) Built-in (FlinkCEP library) Limited
ML Integration External External (FlinkML experimental) Native (MLlib)
Deployment Library (no cluster) Cluster (JobManager + TaskManagers) Cluster (Driver + Executors)
Learning Curve Medium Steep Medium
Best Use Case Kafka-centric microservices Complex patterns, low latency ML + streaming, batch unification

Use this calculator to evaluate which architecture fits your IoT requirements. Adjust the parameters to see the recommendation change.

Beyond choosing a processing framework, teams must decide how to deploy and operate their Kafka infrastructure:

Tradeoff: Managed Kafka Services vs Self-Hosted Kafka for IoT

Option A (Managed - Confluent Cloud/AWS MSK/Azure Event Hubs):

  • Setup time: 15-30 minutes (console wizard, no infrastructure provisioning)
  • Availability SLA: 99.95% (Confluent), 99.9% (MSK), automatic broker replacement
  • Scaling: Automatic partition rebalancing, 10,000+ partitions supported
  • Throughput cost at 100MB/sec: ~$2,000-4,000/month (pay per throughput unit)
  • Ops overhead: Zero broker management, automated patches, built-in monitoring
  • Features included: Schema Registry, Connect, KSQL (Confluent), CDC connectors
  • Vendor lock-in: Moderate (Kafka protocol portable, but proprietary extensions)

Option B (Self-Hosted - Apache Kafka on EC2/Kubernetes):

  • Setup time: 1-3 days (cluster provisioning, KRaft metadata quorum or legacy ZooKeeper, security hardening)
  • Availability: Manual multi-AZ setup required, failover in 30-120 seconds
  • Scaling: Manual partition reassignment, rolling restarts for config changes
  • Throughput cost at 100MB/sec: ~$800-1,500/month (3-5 broker cluster on reserved instances)
  • Ops overhead: 10-20 hours/month (upgrades, monitoring, incident response, capacity planning)
  • Features included: Core Kafka only (Schema Registry, Connect require separate setup)
  • Vendor lock-in: None (full control, portable to any cloud or on-premises)

Decision Factors:

  • Choose Managed when: Team lacks Kafka expertise (learning curve is steep), time-to-market critical (<4 weeks), need enterprise features (RBAC, audit logs, tiered storage), throughput <500MB/sec (cost-effective range)
  • Choose Self-Hosted when: Throughput exceeds 500MB/sec (managed costs escalate), regulatory requirements mandate infrastructure control, need custom patches or bleeding-edge features, team has 2+ Kafka-experienced engineers
  • Break-even analysis: At 50MB/sec, managed wins. At 500MB/sec, self-hosted saves $20K-40K/year. Factor in $150K/year fully-loaded engineer cost for ops time.
Tradeoff: Real-Time Dashboards vs Batch Analytics for IoT Monitoring

Option A (Real-Time Dashboards - Sub-second refresh):

  • Data freshness: 1-10 seconds (live sensor values)
  • Query latency: 50-200ms (pre-aggregated metrics in Redis/TimescaleDB)
  • Infrastructure cost: $500-2,000/month (always-on stream processing + hot storage)
  • Development complexity: High (streaming pipeline, windowed aggregations, push updates)
  • Use cases: Operations centers, live fleet tracking, real-time alerting, trading floors
  • Accuracy: Approximate (windowed aggregates may miss late data)
  • User experience: Animated updates, live charts, immediate feedback

Option B (Batch Analytics - Scheduled refresh):

  • Data freshness: 1-24 hours (scheduled ETL jobs)
  • Query latency: 1-30 seconds (ad-hoc queries on data warehouse)
  • Infrastructure cost: $100-500/month (serverless batch + cold storage)
  • Development complexity: Lower (SQL transformations, scheduled jobs, standard BI tools)
  • Use cases: Executive reporting, trend analysis, compliance audits, capacity planning
  • Accuracy: Exact (complete dataset, no approximations)
  • User experience: Static reports, PDF exports, email digests

Decision Factors:

  • Choose Real-Time when: Operators need to react within minutes (factory floor, NOC, dispatch), users expect live feedback (ride-sharing, delivery tracking), anomaly detection must trigger immediate action, competitive advantage depends on speed
  • Choose Batch when: Insights drive strategic decisions (monthly reviews, budgeting), accuracy matters more than speed (financial reconciliation, compliance), users prefer scheduled reports (email digests, PDF exports), cost efficiency is critical and hourly latency is acceptable
  • Tiered approach: Real-time dashboards for operations (last 24 hours), batch analytics for history (30+ days) - most mature IoT platforms implement both, serving different user personas and use cases

4.2 Worked Example: Choosing an Architecture for Smart Factory Monitoring

With the three architectures and their tradeoffs now established, let us walk through a realistic decision process.

Scenario: A factory with 2,000 sensors (vibration, temperature, current, acoustic) needs a stream processing system. Requirements:

  • Throughput: 2,000 sensors x 10 readings/sec = 20,000 events/sec (200 bytes each = 4 MB/sec)
  • Latency: Vibration anomaly alerts must fire within 5 seconds. Daily reports can be hours old.
  • Patterns: Need to detect “vibration spike followed by temperature rise within 60 seconds” (bearing failure precursor)
  • Team: 3 backend engineers, no Kafka or Flink experience. Budget: $3,000/month infrastructure.

What happens as the factory grows? Adjust the sensor count and see how it changes the architecture decision.

Step 1: Eliminate by throughput

All three architectures handle 20K events/sec easily. Kafka Streams handles 100K-500K, Flink handles 1M+, Spark handles millions. No elimination here.

Step 2: Eliminate by latency

Spark Structured Streaming can achieve sub-5s latency with short micro-batch intervals (100-500ms), so it is not strictly eliminated by the latency requirement alone. However, Spark’s micro-batch model adds scheduling overhead per batch (driver coordination, task serialization), making latency less predictable under load spikes compared to Kafka Streams (10-100ms) and Flink (1-10ms), which both process records continuously without batch boundaries.

Step 3: Evaluate pattern detection needs

The “vibration spike THEN temperature rise within 60 seconds” pattern is a temporal sequence – exactly what Flink CEP excels at. Kafka Streams can implement this with custom windowed joins and state stores, but the code is significantly more complex and error-prone. Spark Structured Streaming lacks built-in CEP and would require even more custom logic using flatMapGroupsWithState, making it the weakest option for this pattern.

Step 4: Factor in team expertise and operational cost

Factor Kafka Streams Apache Flink Spark Streaming
Learning curve for 3 engineers 2-3 weeks 4-6 weeks 3-4 weeks
Infrastructure Managed Kafka only (~$400/month) Managed Kafka + Flink cluster (~$1,800/month) Managed Kafka + Spark cluster (~$1,500/month)
CEP implementation effort 3-4 weeks custom code 1 week using FlinkCEP library 4-5 weeks custom stateful code
Ongoing ops burden Low (library, no cluster) Medium (cluster monitoring, upgrades) Medium (cluster + driver monitoring)
Total Year-1 cost (infra + engineering time) ~$5,000 infra + 6 weeks eng ~$22,000 infra + 5 weeks eng ~$18,000 infra + 7 weeks eng

Decision: Kafka Streams with custom pattern detection.

At 20K events/sec, the factory is using only 4-20% of Kafka Streams’ capacity. The pattern detection is implementable (one temporal sequence, not dozens). The ~$17,000/year infrastructure savings versus Flink and simpler operations (no separate cluster) outweigh the 3-4 weeks of additional engineering effort for custom CEP logic. Spark is ruled out because its CEP implementation would require the most custom code while adding cluster management overhead.

When to reconsider: If the factory scales to 10 plants (200K events/sec) or needs 10+ complex event patterns, migrate to Flink. The Kafka topic structure remains unchanged – only the processing layer changes.

4.3 Architectural Patterns: Combining Batch and Stream

In practice, many IoT systems do not choose exclusively between stream and batch processing. Two hybrid architectures address this:

Lambda architecture showing data flowing through both batch layer for comprehensive processing and speed layer for real-time results, merging in a serving layer for unified query access

Lambda architecture runs parallel pipelines: a batch layer (Spark, Hadoop) recomputes complete, accurate results periodically, while a speed layer (Flink, Kafka Streams) provides low-latency approximate results from recent data. A serving layer merges both views. The tradeoff is operational complexity – maintaining two codebases that must produce consistent results.

Stream processing architecture showing IoT data sources feeding into stream processing engines with windowing, aggregation, and output sinks for real-time analytics pipelines

Kappa architecture simplifies Lambda by using a single stream processing pipeline for both real-time and historical reprocessing (replaying the event log). This reduces operational complexity but requires a stream processor capable of handling both use cases efficiently. Kafka’s log retention plus Flink or Kafka Streams is a common Kappa implementation.

Comparison diagram showing batch processing with periodic large-volume jobs versus stream processing with continuous small-increment processing, highlighting latency, throughput, and use case differences

Batch processing excels at comprehensive historical analysis with high throughput and exact results, while stream processing provides low-latency real-time insights with approximate (windowed) results. The choice depends on whether your IoT use case prioritizes completeness or timeliness.

Key Takeaway

For most IoT projects, start with Kafka Streams (library-based, no separate cluster, 100K-500K events/sec) for simplicity and lower operational cost. Migrate to Apache Flink only when throughput exceeds 500K events/sec, sub-10ms latency is required, or complex event processing (CEP) patterns are needed. Choose Spark Streaming when ML integration is essential and 100ms-1s latency is acceptable. The biggest mistake teams make is adopting Flink prematurely and spending more on operations than on business logic.

Which data highway should the Sensor Squad use? They compare three amazing roads!

The Sensor Squad needs to send millions of sensor messages every second. But which delivery system should they use? They test three options!

Road 1: Kafka Highway (the Librarian) Sammy the Sensor tries Kafka first. “It is like a giant library where every message is written in a book. Anyone can come read the books whenever they want, and the books stay on the shelves for days!”

Max the Microcontroller adds: “And the best part – you do not need a special building! Kafka Streams is a library that lives RIGHT INSIDE your own program. It handles 500,000 messages per second!”

Road 2: Flink Expressway (the Speed Demon) Lila the LED tests Flink: “This is a SUPER-FAST highway with special lanes! It processes each message one by one in just 1-10 milliseconds. And it has a special detective feature called CEP that can spot patterns like ‘alarm, then temperature spike, then smoke’ automatically!”

But Flink needs its own special building (cluster) with managers and workers. “It is faster but more complicated to set up,” notes Lila.

Road 3: Spark Boulevard (the Brainy One) Bella the Battery tries Spark: “This road collects messages into small bundles and processes them together, like a mail truck that delivers packages in batches. It takes about 1 second per batch, but it is AMAZING at machine learning – it can predict things while processing!”

The Verdict: “For most of our missions,” says Max, “Kafka Highway is perfect – it is simple, fast enough, and easy to manage. We only need Flink Expressway when we have millions of messages or need that special detective feature. And Spark Boulevard is great when we need our smart brains (ML) to work on live data!”

4.3.1 Try This at Home!

Think about three ways to deliver a birthday party invitation: (1) Text message (instant, like Flink), (2) Email (fast, like Kafka – and you can re-read it later!), (3) Mail a letter (slower, like Spark – but you can include a beautiful card!). Different messages need different delivery methods, just like different IoT data needs different processing systems!

4.4 Concept Check

4.5 Concept Relationships

Stream processing architectures connect to broader system design:

These connections show architecture choice as influenced by broader ecosystem.

4.6 See Also

Official Documentation:

Benchmarks and Comparisons:

Common Pitfalls

Lambda architecture requires maintaining two separate processing codebases (batch and streaming) that must produce equivalent results. This doubles testing burden, operational complexity, and code surface area. Before choosing Lambda, verify that Kappa architecture (streaming-only with historical reprocessing) cannot meet requirements — it often can.

Spark Structured Streaming micro-batching introduces 100ms-1s batching delay by design. For IoT applications requiring <100ms end-to-end latency (real-time control, safety alerts), micro-batch architectures are architecturally incompatible. Use true streaming frameworks (Flink, Kafka Streams) that process events individually.

Streaming pipelines that consume faster than downstream systems can absorb cause unbounded memory growth as buffers fill. Without backpressure mechanisms, the pipeline OOMs or drops data silently. Always design streaming pipelines with explicit backpressure — slow down ingestion when downstream is overwhelmed rather than buffering indefinitely.

IoT sensors in poor connectivity areas send events late — sometimes minutes or hours after they were generated. Streaming windows that close strictly at event-time boundaries will drop these late events silently. Use watermarks with configurable allowed lateness to handle late events gracefully without indefinitely holding all windows open.

4.8 What’s Next

If you want to… Read this
Study the fundamentals of stream processing Stream Processing Fundamentals
Learn about specific streaming challenges Handling Stream Processing Challenges
Build streaming pipelines hands-on Building IoT Streaming Pipelines
Practice in the advanced CEP lab Hands-On Lab: Advanced CEP

Continue to Building IoT Streaming Pipelines to learn how to design and implement complete real-time IoT data pipelines with appropriate components and stages.