1297 Stream Processing Architectures

Previous: Stream Processing Fundamentals

1297.1 Stream Processing Architectures

⏱️ ~20 min | ⭐⭐⭐ Advanced | 📋 P10.C14.U03

1297.1.1 Apache Kafka + Kafka Streams

Apache Kafka is a distributed streaming platform that combines message queuing with stream processing capabilities through Kafka Streams.

Tradeoff: Push-Based vs Pull-Based Consumer Models for IoT Message Queues

Option A (Push-Based Consumers - RabbitMQ/AMQP style): - Delivery model: Broker pushes messages to consumers immediately upon arrival - Latency: 1-5ms message delivery (minimal delay between publish and consume) - Consumer control: Limited - prefetch count only, broker controls pace - Throughput: 10K-50K msg/sec per consumer (limited by consumer processing speed) - Flow control: Prefetch limits prevent overwhelming slow consumers - Use cases: Real-time alerts, chat systems, command-and-control where immediate delivery is critical - Failure handling: Messages requeued on NACK, broker tracks delivery state

Option B (Pull-Based Consumers - Kafka style): - Delivery model: Consumers fetch batches of messages when ready - Latency: 10-100ms typical (batch accumulation + poll interval) - Consumer control: Full - consumers decide batch size, timing, and offset commits - Throughput: 100K-500K msg/sec per consumer (batch efficiency) - Flow control: Natural backpressure - consumers only pull what they can process - Use cases: Log aggregation, metrics collection, stream processing where throughput matters - Failure handling: Consumer rewinds offset, re-reads from log (no broker state)

Decision Factors: - Choose Push-Based when: Sub-10ms latency required (financial trading, emergency alerts), message ordering per consumer is critical, consumers are always connected and fast, simple request-reply patterns needed - Choose Pull-Based when: Throughput exceeds 50K msg/sec, consumers need replay capability (reprocess historical data), batch processing is acceptable, consumers may be temporarily slow or offline, at-least-once semantics with consumer-controlled commits needed - Hybrid approach: Use push (RabbitMQ) for real-time command channels, pull (Kafka) for telemetry streams - many IoT systems use both patterns for different data types

Tradeoff: Key-Based Partitioning vs Round-Robin for Kafka Topics

Option A (Key-Based Partitioning - Same sensor always goes to same partition): - Routing: hash(sensor_id) % partition_count → deterministic partition assignment - Message ordering: Guaranteed per sensor (all readings from sensor X arrive in order) - Load distribution: Uneven if some sensors publish more than others (hot partitions) - Partition locality: Same consumer processes all data for a sensor (enables stateful processing) - Consumer scaling: Adding consumers may not help if load is concentrated in few partitions - Use cases: Sensor aggregations, session windowing, any stateful processing requiring per-key ordering

Option B (Round-Robin Partitioning - Messages distributed evenly across partitions): - Routing: Next message goes to next partition in sequence - Message ordering: None guaranteed (sensor X messages spread across all partitions) - Load distribution: Perfect even distribution regardless of publisher patterns - Partition locality: Each consumer sees random subset of sensors - Consumer scaling: Adding consumers linearly increases throughput - Use cases: Stateless processing (validation, enrichment), independent event handling, maximum throughput scenarios

Decision Factors: - Choose Key-Based when: Processing requires per-sensor state (running averages, anomaly detection), message ordering within a sensor matters (state machines, sequence detection), downstream systems expect sensor affinity (per-device dashboards) - Choose Round-Robin when: Each message is independent (fire-and-forget telemetry), maximum throughput is critical and ordering doesn’t matter, stateless transformations only (JSON parsing, format conversion), want to avoid hot partition problems from uneven sensor traffic - Metrics to monitor: Partition lag variance (key-based often shows 10x+ variance vs round-robin), consumer CPU imbalance across partitions, end-to-end latency p99 per partition

Example calculation: 100 sensors where 10% generate 90% of traffic - Key-based: 10 hot partitions at 90% load each, 90 idle partitions at 1% load - Round-robin: All partitions at uniform 10% load, but lose per-sensor ordering

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%

sequenceDiagram
    participant S as IoT Sensor
    participant P as Kafka Producer
    participant B as Kafka Broker
    participant KS as Kafka Streams
    participant C1 as Time-Series DB
    participant C2 as Alert Service

    Note over S,C2: Message Lifecycle: 2-10ms end-to-end

    S->>P: Temperature: 85°C
    Note right of S: 0ms

    P->>B: Produce to raw-sensors
    Note right of P: +1ms (serialize, send)

    B->>B: Replicate to followers
    Note right of B: +2ms (durability)

    B->>KS: Consume from partition
    Note right of B: +1ms (fetch)

    KS->>KS: Parse → Aggregate → Detect
    Note right of KS: +3ms (processing)

    alt Anomaly Detected (Z > 3)
        KS->>C2: Produce to alerts topic
        C2->>C2: Trigger PagerDuty
        Note right of C2: Critical alert!
    else Normal Reading
        KS->>C1: Produce to metrics topic
        C1->>C1: Store in InfluxDB
        Note right of C1: Query from Grafana
    end

    Note over S,C2: Total latency: ~7ms median

Figure 1297.2: Alternative view: Sequence diagram showing the temporal flow of a single sensor reading through the Kafka ecosystem. This timeline perspective illustrates the latency budget at each stage - from sensor publishing through broker replication, stream processing, and final routing to storage or alerting based on anomaly detection results.

Strengths: - Durability: All messages persisted to disk with configurable retention - Exactly-once semantics: Guarantees each message is processed exactly once - High throughput: 1+ million messages per second per broker - Horizontal scalability: Add brokers and partitions as needed - Fault tolerance: Automatic failover with replication

Performance Metrics: - Throughput: 1-2 million messages/second per broker - Latency: 2-10 milliseconds (median) - Retention: Days to years of historical data - Partitions: 1000s per cluster

Best For: IoT applications requiring guaranteed delivery, historical replay capability, and integration with multiple downstream systems.

The Kafka architecture excels at decoupling producers from consumers, enabling independent scaling and resilience. This producer-consumer model is fundamental to modern stream processing systems.

Show code

{
  const container = document.getElementById('kc-data-3');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "A manufacturing plant has 1,000 sensors where 50 high-priority sensors generate 80% of the data volume. The analytics team needs to ensure all readings from a single sensor arrive at the same processing task in order for stateful aggregation. Which Kafka partitioning strategy should they use?",
      options: [
        {text: "Round-robin partitioning for even load distribution", correct: false, feedback: "Round-robin distributes load evenly but destroys per-sensor ordering. Events from sensor S1 would scatter across all partitions, making stateful aggregation (running averages, anomaly detection) impossible without expensive cross-partition coordination."},
        {text: "Key-based partitioning by sensor_id", correct: true, feedback: "Correct! Key-based partitioning (hash(sensor_id) % partition_count) ensures all events from a single sensor go to the same partition, preserving order and enabling local stateful processing. The 50 hot sensors will create hot partitions, but this is acceptable for per-sensor state locality."},
        {text: "Manual partition assignment with priority queues", correct: false, feedback: "Manual assignment is complex and doesn't scale. You'd need to track sensor-to-partition mappings externally and update them as sensors are added. Key-based hashing achieves the same result automatically."},
        {text: "Time-based partitioning by hour", correct: false, feedback: "Time-based partitioning doesn't help with per-sensor ordering. Events from the same sensor in the same hour would still be spread across partitions based on arrival timing."}
      ],
      difficulty: "hard",
      topic: "data-processing"
    }));
  }
}

Tradeoff: Kafka Streams vs Apache Flink for IoT Stream Processing

Option A: Kafka Streams (Library-based) - Deployment model: Embedded library in your Java/Scala application - Infrastructure: No separate cluster - runs in your application containers - Operational cost: $200-800/month (Kafka cluster only, no separate Flink cluster) - State management: RocksDB embedded, state stored in Kafka topics - Throughput: 100K-500K events/sec per instance (scales with app instances) - Latency: 10-100ms typical (depends on commit interval) - Use cases: Stateful transformations, simple aggregations, microservice integration

Option B: Apache Flink (Distributed Framework) - Deployment model: Dedicated cluster with Job Manager and Task Managers - Infrastructure: Separate Flink cluster (3-10+ nodes typical) - Operational cost: $1,000-5,000/month (Kafka + Flink managed clusters) - State management: RocksDB or heap, checkpointed to S3/HDFS - Throughput: 1M-10M events/sec per cluster (horizontal scaling) - Latency: 1-10ms typical (true event-by-event processing) - Use cases: Complex event processing, ML inference, cross-stream joins, very high volume

Decision Factors: - Choose Kafka Streams when: Team already operates Kafka, processing logic fits in Java/Scala, 100K events/sec is sufficient, operational simplicity is critical, tight integration with Kafka ecosystem (Connect, Schema Registry) - Choose Flink when: Event volume exceeds 500K/sec, complex event patterns (CEP) needed, sub-10ms latency required, processing requires Python/SQL, cross-stream joins or sessionization needed - Real-world guidance: Start with Kafka Streams for most IoT projects - it handles 90% of use cases with simpler operations. Migrate to Flink when you hit throughput limits or need CEP capabilities.

Artistic visualization of Kafka streaming architecture showing distributed topic partitions with producer and consumer groups, illustrating the log-based message ordering and parallel consumer scaling that enables high-throughput IoT data streaming — Figure 1297.3: Kafka stream architecture with distributed partitions and consumer groups

1297.1.2 Apache Flink

Apache Flink is a distributed stream processing framework designed for true streaming with event-time processing and exactly-once guarantees.

Strengths: - True streaming: Event-by-event processing, not micro-batches - Complex Event Processing (CEP): Built-in pattern matching for sequences - Event-time processing: Native support for watermarks and late data - State management: Fault-tolerant, distributed state with snapshots - Low latency: Single-digit millisecond processing times

Performance Metrics: - Latency: 1-10 milliseconds (p99) - Throughput: Millions of events per second - State size: Can handle gigabytes per task - Recovery time: Seconds to minutes depending on checkpoint size

Best For: Complex event processing, pattern detection, applications requiring very low latency and sophisticated event-time handling.

Geometric visualization of Apache Flink processing topology showing parallel task slots executing stream operators arranged in a directed acyclic graph, with data partitioning across tasks and checkpoint barriers flowing through the pipeline for fault tolerance — Figure 1297.5: Flink’s execution model distributes stream operators across parallel task slots, enabling horizontal scaling. The geometric representation shows how data flows through operator chains, with checkpoint barriers ensuring exactly-once semantics even during failures. Understanding this topology helps engineers optimize parallelism and identify bottlenecks in production pipelines.

Artistic visualization of Complex Event Processing engine showing pattern matching against incoming event streams, with highlighted pattern sequences like A-then-B-within-5-seconds being detected from continuous sensor data flows, triggering composite alerts — Figure 1297.6: Complex Event Processing (CEP) enables detection of meaningful patterns across event streams. The CEP engine continuously matches incoming events against defined patterns such as sequences, time-windows, and negations. In IoT applications, CEP detects scenarios like “temperature rising followed by humidity drop within 2 minutes indicates equipment overheating” or “three failed login attempts within 60 seconds triggers security alert.”

Show code

{
  const container = document.getElementById('kc-data-4');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "A security system needs to detect a specific attack pattern: 'Door sensor triggered, followed by motion sensor within 30 seconds, but NO valid badge scan in between.' Which stream processing capability is ESSENTIAL for this detection?",
      options: [
        {text: "Tumbling window aggregation", correct: false, feedback: "Tumbling windows aggregate events but cannot express temporal sequences or negation patterns. They count events in fixed intervals, not ordered relationships between different event types."},
        {text: "Complex Event Processing (CEP) with pattern matching", correct: true, feedback: "Correct! CEP engines like Flink CEP can express sequence patterns (A followed by B), temporal constraints (within 30 seconds), and negation (NOT C between A and B). This is exactly what's needed for 'door then motion but no badge' detection."},
        {text: "Sliding window with 30-second advance", correct: false, feedback: "Sliding windows provide overlapping time aggregations but cannot express the sequential pattern (A then B) or the negation (no C in between). They're for moving averages, not event sequences."},
        {text: "Key-based partitioning with local state", correct: false, feedback: "Partitioning and state management are infrastructure concerns, not pattern detection capabilities. You'd still need CEP on top of keyed streams to express the temporal sequence with negation."}
      ],
      difficulty: "hard",
      topic: "data-processing"
    }));
  }
}

1297.1.3 Apache Spark Structured Streaming

Spark Structured Streaming extends Spark’s batch processing capabilities to streaming data using a micro-batch approach.

Strengths: - Unified API: Same code for batch and streaming - ML integration: Native integration with MLlib for real-time predictions - SQL support: Query streams with SQL - Ecosystem: Leverage entire Spark ecosystem - Fault tolerance: Checkpoint-based recovery

Performance Metrics: - Latency: 100 milliseconds to seconds (micro-batch interval) - Throughput: Millions of events per second - Batch interval: Typically 500ms to 10 seconds - Scalability: Thousands of nodes

Best For: Applications requiring machine learning on streaming data, teams already using Spark, use cases where 1-second latency is acceptable.

1297.1.4 Architecture Comparison

Feature	Kafka Streams	Apache Flink	Spark Streaming
Processing Model	Event-by-event	Event-by-event	Micro-batch
Latency	2-10 ms	1-10 ms	100 ms - 10 s
Throughput	1M+ msg/sec	Millions/sec	Millions/sec
Exactly-Once	Yes	Yes	Yes
State Management	RocksDB/InMemory	RocksDB/Memory	HDFS/S3
Event Time	Yes	Advanced	Yes
CEP Support	Limited	Built-in	Limited
ML Integration	External	External	Native (MLlib)
Deployment	Library (no cluster)	Cluster	Cluster
Learning Curve	Medium	Steep	Medium
Best Use Case	Kafka-centric apps	Complex patterns, low latency	ML + streaming

Tradeoff: Managed Kafka Services vs Self-Hosted Kafka for IoT

Option A (Managed - Confluent Cloud/AWS MSK/Azure Event Hubs): - Setup time: 15-30 minutes (console wizard, no infrastructure provisioning) - Availability SLA: 99.95% (Confluent), 99.9% (MSK), automatic broker replacement - Scaling: Automatic partition rebalancing, 10,000+ partitions supported - Throughput cost at 100MB/sec: ~$2,000-4,000/month (pay per throughput unit) - Ops overhead: Zero broker management, automated patches, built-in monitoring - Features included: Schema Registry, Connect, KSQL (Confluent), CDC connectors - Vendor lock-in: Moderate (Kafka protocol portable, but proprietary extensions)

Option B (Self-Hosted - Apache Kafka on EC2/Kubernetes): - Setup time: 1-3 days (cluster provisioning, ZooKeeper/KRaft, security hardening) - Availability: Manual multi-AZ setup required, failover in 30-120 seconds - Scaling: Manual partition reassignment, rolling restarts for config changes - Throughput cost at 100MB/sec: ~$800-1,500/month (3-5 broker cluster on reserved instances) - Ops overhead: 10-20 hours/month (upgrades, monitoring, incident response, capacity planning) - Features included: Core Kafka only (Schema Registry, Connect require separate setup) - Vendor lock-in: None (full control, portable to any cloud or on-premises)

Decision Factors: - Choose Managed when: Team lacks Kafka expertise (learning curve is steep), time-to-market critical (<4 weeks), need enterprise features (RBAC, audit logs, tiered storage), throughput <500MB/sec (cost-effective range) - Choose Self-Hosted when: Throughput exceeds 500MB/sec (managed costs escalate), regulatory requirements mandate infrastructure control, need custom patches or bleeding-edge features, team has 2+ Kafka-experienced engineers - Break-even analysis: At 50MB/sec, managed wins. At 500MB/sec, self-hosted saves $20K-40K/year. Factor in $150K/year fully-loaded engineer cost for ops time.

Tradeoff: Real-Time Dashboards vs Batch Analytics for IoT Monitoring

Option A (Real-Time Dashboards - Sub-second refresh): - Data freshness: 1-10 seconds (live sensor values) - Query latency: 50-200ms (pre-aggregated metrics in Redis/TimescaleDB) - Infrastructure cost: $500-2,000/month (always-on stream processing + hot storage) - Development complexity: High (streaming pipeline, windowed aggregations, push updates) - Use cases: Operations centers, live fleet tracking, real-time alerting, trading floors - Accuracy: Approximate (windowed aggregates may miss late data) - User experience: Animated updates, live charts, immediate feedback

Option B (Batch Analytics - Scheduled refresh): - Data freshness: 1-24 hours (scheduled ETL jobs) - Query latency: 1-30 seconds (ad-hoc queries on data warehouse) - Infrastructure cost: $100-500/month (serverless batch + cold storage) - Development complexity: Lower (SQL transformations, scheduled jobs, standard BI tools) - Use cases: Executive reporting, trend analysis, compliance audits, capacity planning - Accuracy: Exact (complete dataset, no approximations) - User experience: Static reports, PDF exports, email digests

Decision Factors: - Choose Real-Time when: Operators need to react within minutes (factory floor, NOC, dispatch), users expect live feedback (ride-sharing, delivery tracking), anomaly detection must trigger immediate action, competitive advantage depends on speed - Choose Batch when: Insights drive strategic decisions (monthly reviews, budgeting), accuracy matters more than speed (financial reconciliation, compliance), users prefer scheduled reports (email digests, PDF exports), cost efficiency is critical and hourly latency is acceptable - Tiered approach: Real-time dashboards for operations (last 24 hours), batch analytics for history (30+ days) - most mature IoT platforms implement both, serving different user personas and use cases

1297.2 Visual Reference Gallery

Stream Processing Architecture

Stream processing architectures handle continuous IoT data flows through ingestion, processing, and output stages. Key components include message brokers for buffering, stream processors for transformations, and sinks for delivering results to dashboards, databases, or alerting systems.

Batch vs Stream Processing Comparison

Batch and stream processing serve different IoT needs. Batch excels for comprehensive historical analysis with high throughput, while streaming provides low-latency real-time insights. Modern Lambda and Kappa architectures often combine both approaches.

Lambda Architecture Overview

Lambda architecture combines batch and stream processing for IoT analytics. The batch layer provides accuracy through complete recomputation, the speed layer offers low latency for recent data, and the serving layer merges results for unified query access.

1297.3 What’s Next

Continue to Building IoT Streaming Pipelines to learn how to design and implement complete real-time IoT data pipelines with appropriate components and stages.