1297  Stream Processing Architectures

1297.1 Stream Processing Architectures

⏱️ ~20 min | ⭐⭐⭐ Advanced | 📋 P10.C14.U03

1297.1.1 Apache Kafka + Kafka Streams

Apache Kafka is a distributed streaming platform that combines message queuing with stream processing capabilities through Kafka Streams.

WarningTradeoff: Push-Based vs Pull-Based Consumer Models for IoT Message Queues

Option A (Push-Based Consumers - RabbitMQ/AMQP style): - Delivery model: Broker pushes messages to consumers immediately upon arrival - Latency: 1-5ms message delivery (minimal delay between publish and consume) - Consumer control: Limited - prefetch count only, broker controls pace - Throughput: 10K-50K msg/sec per consumer (limited by consumer processing speed) - Flow control: Prefetch limits prevent overwhelming slow consumers - Use cases: Real-time alerts, chat systems, command-and-control where immediate delivery is critical - Failure handling: Messages requeued on NACK, broker tracks delivery state

Option B (Pull-Based Consumers - Kafka style): - Delivery model: Consumers fetch batches of messages when ready - Latency: 10-100ms typical (batch accumulation + poll interval) - Consumer control: Full - consumers decide batch size, timing, and offset commits - Throughput: 100K-500K msg/sec per consumer (batch efficiency) - Flow control: Natural backpressure - consumers only pull what they can process - Use cases: Log aggregation, metrics collection, stream processing where throughput matters - Failure handling: Consumer rewinds offset, re-reads from log (no broker state)

Decision Factors: - Choose Push-Based when: Sub-10ms latency required (financial trading, emergency alerts), message ordering per consumer is critical, consumers are always connected and fast, simple request-reply patterns needed - Choose Pull-Based when: Throughput exceeds 50K msg/sec, consumers need replay capability (reprocess historical data), batch processing is acceptable, consumers may be temporarily slow or offline, at-least-once semantics with consumer-controlled commits needed - Hybrid approach: Use push (RabbitMQ) for real-time command channels, pull (Kafka) for telemetry streams - many IoT systems use both patterns for different data types

WarningTradeoff: Key-Based Partitioning vs Round-Robin for Kafka Topics

Option A (Key-Based Partitioning - Same sensor always goes to same partition): - Routing: hash(sensor_id) % partition_count → deterministic partition assignment - Message ordering: Guaranteed per sensor (all readings from sensor X arrive in order) - Load distribution: Uneven if some sensors publish more than others (hot partitions) - Partition locality: Same consumer processes all data for a sensor (enables stateful processing) - Consumer scaling: Adding consumers may not help if load is concentrated in few partitions - Use cases: Sensor aggregations, session windowing, any stateful processing requiring per-key ordering

Option B (Round-Robin Partitioning - Messages distributed evenly across partitions): - Routing: Next message goes to next partition in sequence - Message ordering: None guaranteed (sensor X messages spread across all partitions) - Load distribution: Perfect even distribution regardless of publisher patterns - Partition locality: Each consumer sees random subset of sensors - Consumer scaling: Adding consumers linearly increases throughput - Use cases: Stateless processing (validation, enrichment), independent event handling, maximum throughput scenarios

Decision Factors: - Choose Key-Based when: Processing requires per-sensor state (running averages, anomaly detection), message ordering within a sensor matters (state machines, sequence detection), downstream systems expect sensor affinity (per-device dashboards) - Choose Round-Robin when: Each message is independent (fire-and-forget telemetry), maximum throughput is critical and ordering doesn’t matter, stateless transformations only (JSON parsing, format conversion), want to avoid hot partition problems from uneven sensor traffic - Metrics to monitor: Partition lag variance (key-based often shows 10x+ variance vs round-robin), consumer CPU imbalance across partitions, end-to-end latency p99 per partition

Example calculation: 100 sensors where 10% generate 90% of traffic - Key-based: 10 hot partitions at 90% load each, 90 idle partitions at 1% load - Round-robin: All partitions at uniform 10% load, but lose per-sensor ordering

Graph diagram

Graph diagram
Figure 1297.1: Kafka Streams architecture showing sensor data ingestion through distributed broker cluster with stream processing pipeline outputting to time-series database and alert services

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D'}}}%%

sequenceDiagram
    participant S as IoT Sensor
    participant P as Kafka Producer
    participant B as Kafka Broker
    participant KS as Kafka Streams
    participant C1 as Time-Series DB
    participant C2 as Alert Service

    Note over S,C2: Message Lifecycle: 2-10ms end-to-end

    S->>P: Temperature: 85°C
    Note right of S: 0ms

    P->>B: Produce to raw-sensors
    Note right of P: +1ms (serialize, send)

    B->>B: Replicate to followers
    Note right of B: +2ms (durability)

    B->>KS: Consume from partition
    Note right of B: +1ms (fetch)

    KS->>KS: Parse → Aggregate → Detect
    Note right of KS: +3ms (processing)

    alt Anomaly Detected (Z > 3)
        KS->>C2: Produce to alerts topic
        C2->>C2: Trigger PagerDuty
        Note right of C2: Critical alert!
    else Normal Reading
        KS->>C1: Produce to metrics topic
        C1->>C1: Store in InfluxDB
        Note right of C1: Query from Grafana
    end

    Note over S,C2: Total latency: ~7ms median

Figure 1297.2: Alternative view: Sequence diagram showing the temporal flow of a single sensor reading through the Kafka ecosystem. This timeline perspective illustrates the latency budget at each stage - from sensor publishing through broker replication, stream processing, and final routing to storage or alerting based on anomaly detection results.

Strengths: - Durability: All messages persisted to disk with configurable retention - Exactly-once semantics: Guarantees each message is processed exactly once - High throughput: 1+ million messages per second per broker - Horizontal scalability: Add brokers and partitions as needed - Fault tolerance: Automatic failover with replication

Performance Metrics: - Throughput: 1-2 million messages/second per broker - Latency: 2-10 milliseconds (median) - Retention: Days to years of historical data - Partitions: 1000s per cluster

Best For: IoT applications requiring guaranteed delivery, historical replay capability, and integration with multiple downstream systems.

The Kafka architecture excels at decoupling producers from consumers, enabling independent scaling and resilience. This producer-consumer model is fundamental to modern stream processing systems.

WarningTradeoff: Kafka Streams vs Apache Flink for IoT Stream Processing

Option A: Kafka Streams (Library-based) - Deployment model: Embedded library in your Java/Scala application - Infrastructure: No separate cluster - runs in your application containers - Operational cost: $200-800/month (Kafka cluster only, no separate Flink cluster) - State management: RocksDB embedded, state stored in Kafka topics - Throughput: 100K-500K events/sec per instance (scales with app instances) - Latency: 10-100ms typical (depends on commit interval) - Use cases: Stateful transformations, simple aggregations, microservice integration

Option B: Apache Flink (Distributed Framework) - Deployment model: Dedicated cluster with Job Manager and Task Managers - Infrastructure: Separate Flink cluster (3-10+ nodes typical) - Operational cost: $1,000-5,000/month (Kafka + Flink managed clusters) - State management: RocksDB or heap, checkpointed to S3/HDFS - Throughput: 1M-10M events/sec per cluster (horizontal scaling) - Latency: 1-10ms typical (true event-by-event processing) - Use cases: Complex event processing, ML inference, cross-stream joins, very high volume

Decision Factors: - Choose Kafka Streams when: Team already operates Kafka, processing logic fits in Java/Scala, 100K events/sec is sufficient, operational simplicity is critical, tight integration with Kafka ecosystem (Connect, Schema Registry) - Choose Flink when: Event volume exceeds 500K/sec, complex event patterns (CEP) needed, sub-10ms latency required, processing requires Python/SQL, cross-stream joins or sessionization needed - Real-world guidance: Start with Kafka Streams for most IoT projects - it handles 90% of use cases with simpler operations. Migrate to Flink when you hit throughput limits or need CEP capabilities.

Artistic visualization of Kafka streaming architecture showing distributed topic partitions with producer and consumer groups, illustrating the log-based message ordering and parallel consumer scaling that enables high-throughput IoT data streaming
Figure 1297.3: Kafka stream architecture with distributed partitions and consumer groups

1297.1.3 Apache Spark Structured Streaming

Spark Structured Streaming extends Spark’s batch processing capabilities to streaming data using a micro-batch approach.

Graph diagram

Graph diagram
Figure 1297.7: Spark Structured Streaming architecture showing driver program coordinating executors processing micro-batches with DataFrame API, SQL, and ML pipeline integration

Strengths: - Unified API: Same code for batch and streaming - ML integration: Native integration with MLlib for real-time predictions - SQL support: Query streams with SQL - Ecosystem: Leverage entire Spark ecosystem - Fault tolerance: Checkpoint-based recovery

Performance Metrics: - Latency: 100 milliseconds to seconds (micro-batch interval) - Throughput: Millions of events per second - Batch interval: Typically 500ms to 10 seconds - Scalability: Thousands of nodes

Best For: Applications requiring machine learning on streaming data, teams already using Spark, use cases where 1-second latency is acceptable.

1297.1.4 Architecture Comparison

Feature Kafka Streams Apache Flink Spark Streaming
Processing Model Event-by-event Event-by-event Micro-batch
Latency 2-10 ms 1-10 ms 100 ms - 10 s
Throughput 1M+ msg/sec Millions/sec Millions/sec
Exactly-Once Yes Yes Yes
State Management RocksDB/InMemory RocksDB/Memory HDFS/S3
Event Time Yes Advanced Yes
CEP Support Limited Built-in Limited
ML Integration External External Native (MLlib)
Deployment Library (no cluster) Cluster Cluster
Learning Curve Medium Steep Medium
Best Use Case Kafka-centric apps Complex patterns, low latency ML + streaming
WarningTradeoff: Managed Kafka Services vs Self-Hosted Kafka for IoT

Option A (Managed - Confluent Cloud/AWS MSK/Azure Event Hubs): - Setup time: 15-30 minutes (console wizard, no infrastructure provisioning) - Availability SLA: 99.95% (Confluent), 99.9% (MSK), automatic broker replacement - Scaling: Automatic partition rebalancing, 10,000+ partitions supported - Throughput cost at 100MB/sec: ~$2,000-4,000/month (pay per throughput unit) - Ops overhead: Zero broker management, automated patches, built-in monitoring - Features included: Schema Registry, Connect, KSQL (Confluent), CDC connectors - Vendor lock-in: Moderate (Kafka protocol portable, but proprietary extensions)

Option B (Self-Hosted - Apache Kafka on EC2/Kubernetes): - Setup time: 1-3 days (cluster provisioning, ZooKeeper/KRaft, security hardening) - Availability: Manual multi-AZ setup required, failover in 30-120 seconds - Scaling: Manual partition reassignment, rolling restarts for config changes - Throughput cost at 100MB/sec: ~$800-1,500/month (3-5 broker cluster on reserved instances) - Ops overhead: 10-20 hours/month (upgrades, monitoring, incident response, capacity planning) - Features included: Core Kafka only (Schema Registry, Connect require separate setup) - Vendor lock-in: None (full control, portable to any cloud or on-premises)

Decision Factors: - Choose Managed when: Team lacks Kafka expertise (learning curve is steep), time-to-market critical (<4 weeks), need enterprise features (RBAC, audit logs, tiered storage), throughput <500MB/sec (cost-effective range) - Choose Self-Hosted when: Throughput exceeds 500MB/sec (managed costs escalate), regulatory requirements mandate infrastructure control, need custom patches or bleeding-edge features, team has 2+ Kafka-experienced engineers - Break-even analysis: At 50MB/sec, managed wins. At 500MB/sec, self-hosted saves $20K-40K/year. Factor in $150K/year fully-loaded engineer cost for ops time.

WarningTradeoff: Real-Time Dashboards vs Batch Analytics for IoT Monitoring

Option A (Real-Time Dashboards - Sub-second refresh): - Data freshness: 1-10 seconds (live sensor values) - Query latency: 50-200ms (pre-aggregated metrics in Redis/TimescaleDB) - Infrastructure cost: $500-2,000/month (always-on stream processing + hot storage) - Development complexity: High (streaming pipeline, windowed aggregations, push updates) - Use cases: Operations centers, live fleet tracking, real-time alerting, trading floors - Accuracy: Approximate (windowed aggregates may miss late data) - User experience: Animated updates, live charts, immediate feedback

Option B (Batch Analytics - Scheduled refresh): - Data freshness: 1-24 hours (scheduled ETL jobs) - Query latency: 1-30 seconds (ad-hoc queries on data warehouse) - Infrastructure cost: $100-500/month (serverless batch + cold storage) - Development complexity: Lower (SQL transformations, scheduled jobs, standard BI tools) - Use cases: Executive reporting, trend analysis, compliance audits, capacity planning - Accuracy: Exact (complete dataset, no approximations) - User experience: Static reports, PDF exports, email digests

Decision Factors: - Choose Real-Time when: Operators need to react within minutes (factory floor, NOC, dispatch), users expect live feedback (ride-sharing, delivery tracking), anomaly detection must trigger immediate action, competitive advantage depends on speed - Choose Batch when: Insights drive strategic decisions (monthly reviews, budgeting), accuracy matters more than speed (financial reconciliation, compliance), users prefer scheduled reports (email digests, PDF exports), cost efficiency is critical and hourly latency is acceptable - Tiered approach: Real-time dashboards for operations (last 24 hours), batch analytics for history (30+ days) - most mature IoT platforms implement both, serving different user personas and use cases

1297.3 What’s Next

Continue to Building IoT Streaming Pipelines to learn how to design and implement complete real-time IoT data pipelines with appropriate components and stages.