7  Common Pitfalls and Worked Examples

Chapter Topic
Stream Processing Overview of batch vs. stream processing for IoT
Fundamentals Windowing strategies, watermarks, and event-time semantics
Architectures Kafka, Flink, and Spark Streaming comparison
Pipelines End-to-end ingestion, processing, and output design
Challenges Late data, exactly-once semantics, and backpressure
Pitfalls Common mistakes and production worked examples
Basic Lab ESP32 circular buffers, windows, and event detection
Advanced Lab CEP, pattern matching, and anomaly detection on ESP32
Game & Summary Interactive review game and module summary
Interoperability Four levels of IoT interoperability
Interop Fundamentals Technical, syntactic, semantic, and organizational layers
Interop Standards SenML, JSON-LD, W3C WoT, and oneM2M
Integration Patterns Protocol adapters, gateways, and ontology mapping

7.1 Learning Objectives

  • Implement idempotent message processing using request IDs and deduplication to prevent double billing, double counting, and duplicate actions on retries
  • Configure backpressure handling with bounded queues and blocking producers to prevent memory exhaustion when fast producers overwhelm slow consumers
  • Design production fraud detection pipelines processing 2,000 TPS with <200ms latency using key-based partitioning, sliding windows, and exactly-once semantics
  • Apply watermark strategies and allowed lateness handling to process out-of-order events in real-time payment systems
  • Monitor stream processing health using Flink metrics for throughput, latency, checkpoint duration, and backpressure ratios

Stream processing pitfalls are the common mistakes engineers make when handling real-time IoT data. Think of them as potholes on a road – knowing where they are helps you avoid them. Common traps include assuming data always arrives in order, ignoring late readings, and building systems that cannot recover from failures.

In 60 Seconds

The two most dangerous stream processing pitfalls are non-idempotent message processing (causing double billing, double counting, or duplicate actions on retries) and missing backpressure handling (causing memory exhaustion when producers overwhelm consumers). This chapter also presents a complete fraud detection pipeline worked example processing 2,000 transactions per second with <200ms P50 latency, exactly-once delivery, and velocity attack detection using key-based partitioning and sliding windows.

Key Concepts
  • Worked Example: Detailed walkthrough of a specific streaming scenario showing both the pitfall pattern and the corrected implementation with quantified impact
  • Throughput Bottleneck: Pipeline stage consuming events slower than the upstream stage produces them, causing consumer lag to grow and eventually data loss
  • Memory Leak in State Store: Unbounded accumulation of state (event history, pattern matching state) in streaming operator state stores, eventually causing OOM failures
  • Late Event Handling Strategy: Design decision for how to treat events arriving after their window has closed — drop silently, route to dead letter, or include with extended watermark
  • Idempotent Processing: Stream operation that can be applied multiple times to the same event without changing the result beyond the first application, enabling safe at-least-once delivery
  • Thundering Herd: Large number of IoT devices reconnecting simultaneously after a network outage, creating a traffic spike that overwhelms the stream ingestion tier
  • Schema Incompatibility: Situation where a new message version cannot be deserialized by an older consumer, causing pipeline failures when producer and consumer are updated independently
  • Debugging Streaming Pipelines: Techniques for tracing events through distributed streaming pipelines — sampling, event logging, distributed tracing with correlation IDs

7.2 Common Pitfalls

⏱️ ~15 min | ⭐⭐⭐ Advanced | 📋 P10.C14.U06

Common Pitfall: Non-Idempotent Message Processing

The mistake: Without idempotency, retried messages cause duplicate processing. This leads to double counting, double billing, or repeated actions.

Symptoms:

  • Duplicate actions (double billing, double counting)
  • Data inconsistency
  • Retry storms cause cascading effects
  • Incorrect totals and aggregations

Wrong approach:

# Non-idempotent: Each message increments counter
def process(msg):
    counter += msg.value  # Duplicate messages double-count!

# Non-idempotent command
def process_command(cmd):
    if cmd.action == "dispense":
        dispense_item()  # Retry = dispense twice!

Correct approach:

# Idempotent: Use message ID for deduplication
def process(msg):
    if msg.id in processed_ids:
        return  # Already processed
    processed_ids.add(msg.id)
    counter += msg.value

# Idempotent command with request ID
def process_command(cmd):
    if cmd.request_id in completed:
        return completed[cmd.request_id]  # Return same result
    result = dispense_item()
    completed[cmd.request_id] = result
    return result

How to avoid:

  • Include unique message IDs
  • Track processed message IDs
  • Design idempotent operations
  • Use database constraints to prevent duplicates

A vending machine receives a $2 snack dispense command twice due to network retry. Without idempotency checks:

Non-idempotent outcome:

  • Command 1: Dispense snack → Inventory: \(-1\), Customer charge: \(+\$2\)
  • Command 2 (retry): Dispense again → Inventory: \(-1\), Customer charge: \(+\$2\)
  • Total: \(-2\) items, \(+\$4\) charged, 100% overcharge

Idempotent fix (request ID tracking):

if request_id in completed:
    return cached_result  # No action
dispense_item()
completed[request_id] = result

Impact at scale: 10,000 transactions/day × 0.1% retry rate = 10 duplicates/day. Without idempotency: \(10 \times \$2 = \$20/day\) overcharge = $7,300/year in reconciliation costs.

Try It: Idempotency Impact Calculator

Explore how non-idempotent processing causes financial damage at scale. Adjust the transaction volume and retry rate to see the cost of missing deduplication.

Common Pitfall: Missing Backpressure Handling

The mistake: Without backpressure, slow consumers are overwhelmed by fast producers. This causes memory exhaustion, dropped messages, and system crashes during traffic spikes.

Symptoms:

  • Memory exhaustion
  • System crashes under load
  • Lost data during traffic spikes
  • Increasing latency

Wrong approach:

# Unbounded queue - memory grows indefinitely
queue = []
def producer():
    while True:
        queue.append(generate_data())

def consumer():
    while queue:
        process(queue.pop(0))  # Slower than producer
# Eventually: OutOfMemoryError

Correct approach:

# Bounded queue with backpressure
from queue import Queue

queue = Queue(maxsize=1000)

def producer():
    while True:
        queue.put(generate_data(), block=True)  # Blocks when full

# Or drop/sample under pressure
def producer_with_sampling():
    if queue.full():
        if random() < 0.1:  # Keep 10% under pressure
            queue.put_nowait(data)
    else:
        queue.put(data)

How to avoid:

  • Use bounded buffers and queues
  • Implement blocking producers
  • Add sampling under load
  • Monitor queue depths
  • Scale consumers dynamically
Try It: Backpressure Throughput Simulator

See what happens when producers generate data faster than consumers can process it. Adjust the rates and watch queue depth, memory usage, and dropped events in real time.

7.3 Worked Example: Real-Time Fraud Detection Pipeline

7.4 Worked Example: Stream Processing for Payment Terminal Fraud Detection

Scenario: A payment processor operates 50,000 IoT payment terminals (card readers, POS systems) generating 2,000 transactions per second. The fraud team needs to detect suspicious patterns in real-time: - Velocity attacks: Same card used at multiple distant locations within minutes - Amount anomalies: Unusual transaction amounts for a given merchant - Burst attacks: Many small transactions in rapid succession

Goal: Design a stream processing pipeline that detects fraud within 500ms of transaction arrival while ensuring exactly-once processing to prevent false positives and missed detections.

What we do: Define the transaction event schema and set up Kafka ingestion.

Event Schema:

# Transaction event from payment terminal
transaction_schema = {
    "transaction_id": "uuid",       # Unique identifier
    "card_hash": "string",          # Anonymized card identifier
    "terminal_id": "string",        # Payment terminal ID
    "merchant_id": "string",        # Merchant identifier
    "amount_cents": "int",          # Transaction amount in cents
    "currency": "string",           # ISO currency code
    "location": {
        "lat": "double",
        "lon": "double"
    },
    "timestamp": "timestamp",       # Event time (when card was swiped)
    "terminal_time": "timestamp"    # Processing time (when terminal sent)
}

# Example event
{
    "transaction_id": "txn-a1b2c3d4",
    "card_hash": "hash_xyz789",
    "terminal_id": "term_12345",
    "merchant_id": "merch_grocery_001",
    "amount_cents": 4523,
    "currency": "USD",
    "location": {"lat": 40.7128, "lon": -74.0060},
    "timestamp": "2026-01-10T14:23:45.123Z",
    "terminal_time": "2026-01-10T14:23:45.456Z"
}

Kafka Topic Configuration:

# Kafka topic for raw transactions
topic_config = {
    "name": "transactions-raw",
    "partitions": 32,              # Parallelism for 2K TPS
    "replication_factor": 3,       # Durability
    "retention_ms": 7 * 24 * 3600 * 1000,  # 7 days
    "cleanup_policy": "delete"
}

# Partition by card_hash for ordered per-card processing
# This ensures all transactions for a card go to same partition
producer.send(
    "transactions-raw",
    key=transaction["card_hash"],
    value=transaction
)
Try It: Kafka Partition Calculator

Explore how partition count and replication factor affect throughput capacity and fault tolerance for the fraud detection pipeline.

Why: Partitioning by card_hash ensures all transactions for a single card arrive at the same Flink task instance in order. This is critical for velocity detection (same card at multiple locations) without expensive cross-partition coordination.

What we do: Implement sliding windows to detect velocity and burst attacks.

Flink Pipeline for Velocity Detection:

from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import SlidingEventTimeWindows
from pyflink.common.time import Time

env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(32)

# Velocity detection: Same card, different locations within 5 min
velocity_alerts = (
    env.add_source(kafka_source)
    .key_by(lambda t: t["card_hash"])
    .window(SlidingEventTimeWindows.of(
        Time.minutes(5), Time.seconds(30)))
    .apply(VelocityDetector())
)

class VelocityDetector:
    def apply(self, card_hash, window, transactions):
        """Detect impossible travel (>100km in <5 min)."""
        locations = [(t["location"], t["timestamp"])
                     for t in transactions]
        for i, (loc1, t1) in enumerate(locations):
            for loc2, t2 in locations[i+1:]:
                speed = haversine(loc1, loc2) / (
                    abs((t2-t1).total_seconds()) / 3600)
                if speed > 500:  # Impossible by car
                    yield FraudAlert(card_hash=card_hash,
                        alert_type="VELOCITY_ATTACK",
                        confidence=min(0.99, speed / 1000))
                    return

Burst Attack Detection:

# Burst detection: >10 transactions in 2 minutes from same card
burst_alerts = (
    transactions
    .key_by(lambda t: t["card_hash"])
    .window(SlidingEventTimeWindows.of(
        Time.minutes(2),    # Window size: 2 minutes
        Time.seconds(15)    # Slide: check every 15 seconds
    ))
    .apply(BurstDetector())
)

class BurstDetector:
    def apply(self, card_hash, window, transactions):
        """Detect card testing attacks (many small transactions)."""
        txn_list = list(transactions)
        count = len(txn_list)
        avg_amount = sum(t["amount_cents"] for t in txn_list) / count

        # Flag: >10 transactions OR >5 transactions with avg <$5
        if count > 10 or (count > 5 and avg_amount < 500):
            yield FraudAlert(
                card_hash=card_hash,
                alert_type="BURST_ATTACK",
                confidence=min(0.95, count / 20),
                transactions=[t["transaction_id"] for t in txn_list],
                details=f"{count} txns, avg ${avg_amount/100:.2f}"
            )
Try It: Window Size Impact Visualizer

Explore the trade-offs of sliding window configuration for fraud detection. Larger windows catch more patterns but increase latency and memory. Smaller slides detect faster but cost more computation.

Why: Sliding windows allow continuous evaluation without waiting for window boundaries. A 5-minute window sliding every 30 seconds catches fraud quickly while reducing computational overhead compared to per-event evaluation.

What we do: Configure checkpointing and idempotent sinks to prevent duplicates.

Checkpointing Configuration:

from pyflink.datastream import CheckpointingMode

# Enable exactly-once checkpointing
env.enable_checkpointing(
    interval=10000,  # Checkpoint every 10 seconds
    mode=CheckpointingMode.EXACTLY_ONCE
)

# Checkpoint configuration
env.get_checkpoint_config().set_min_pause_between_checkpoints(5000)
env.get_checkpoint_config().set_checkpoint_timeout(60000)
env.get_checkpoint_config().set_max_concurrent_checkpoints(1)
env.get_checkpoint_config().enable_externalized_checkpoints(
    ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
)

# State backend for large state (card history)
env.set_state_backend(RocksDBStateBackend("s3://flink-state/checkpoints"))

Idempotent Alert Sink:

class IdempotentAlertSink:
    """Ensures each alert is processed exactly once."""

    def __init__(self, redis_client, alert_topic):
        self.redis = redis_client
        self.kafka_producer = KafkaProducer(
            bootstrap_servers="kafka:9092",
            enable_idempotence=True,  # Kafka idempotent producer
            acks="all"
        )
        self.alert_topic = alert_topic

    def invoke(self, alert: FraudAlert):
        # Deduplication key: alert_type + card + window_end
        dedup_key = f"alert:{alert.alert_type}:{alert.card_hash}:{alert.window_end}"

        # Check if already processed (Redis SET NX with TTL)
        if self.redis.set(dedup_key, "1", nx=True, ex=3600):
            # First time seeing this alert - process it
            self.kafka_producer.send(
                self.alert_topic,
                key=alert.card_hash,
                value=alert.to_json()
            )
            # Also write to PostgreSQL for audit
            self.write_to_postgres(alert)
        # else: Already processed, skip (idempotent)

    def write_to_postgres(self, alert):
        """Idempotent insert using ON CONFLICT."""
        self.cursor.execute("""
            INSERT INTO fraud_alerts
                (alert_id, card_hash, alert_type, confidence, created_at)
            VALUES (%s, %s, %s, %s, %s)
            ON CONFLICT (alert_id) DO NOTHING
        """, (alert.id, alert.card_hash, alert.alert_type,
              alert.confidence, alert.created_at))
Try It: Delivery Guarantee and Checkpoint Simulator

Compare delivery semantics and see how checkpoint intervals affect data safety and overhead. What happens when a failure occurs mid-processing?

Why: Exactly-once requires coordination between Flink checkpoints, Kafka transactions, and downstream sinks. The Redis deduplication key and PostgreSQL ON CONFLICT ensure that even if a failure causes reprocessing, each alert is delivered exactly once.

What we do: Configure watermarks and allowed lateness for out-of-order events.

Watermark Strategy:

from pyflink.datastream import WatermarkStrategy
from pyflink.common import Duration

# Watermark: event_time - 30 seconds (allow 30s network delay)
watermark_strategy = (
    WatermarkStrategy
    .for_bounded_out_of_orderness(Duration.of_seconds(30))
    .with_timestamp_assigner(
        lambda t, _: t["timestamp"].timestamp() * 1000
    )
)

transactions_with_watermarks = transactions.assign_timestamps_and_watermarks(
    watermark_strategy
)

# Window configuration with allowed lateness
velocity_alerts = (
    transactions_with_watermarks
    .key_by(lambda t: t["card_hash"])
    .window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(30)))
    .allowed_lateness(Time.minutes(2))  # Accept late data up to 2 min
    .side_output_late_data(late_data_tag)  # Capture very late data
    .apply(VelocityDetector())
)

# Handle late data separately (log for investigation)
late_transactions = velocity_alerts.get_side_output(late_data_tag)
late_transactions.add_sink(late_data_sink)  # Write to audit log

Late Data Handling Policy:

# Terminal clock drift detection
class ClockDriftMonitor:
    def process(self, transaction):
        event_time = transaction["timestamp"]
        processing_time = datetime.now(timezone.utc)
        drift_seconds = (processing_time - event_time).total_seconds()

        # Flag terminals with >60s clock drift
        if abs(drift_seconds) > 60:
            yield TerminalAlert(
                terminal_id=transaction["terminal_id"],
                alert_type="CLOCK_DRIFT",
                drift_seconds=drift_seconds
            )
Try It: Late Event Simulator

Explore how watermarks and allowed lateness determine which events get processed and which are dropped. Inject late events and see how the system handles them.

Why: Payment terminals may have network delays or clock drift. A 30-second watermark delay handles normal network latency, while 2-minute allowed lateness captures most edge cases. Very late data goes to side output for manual review rather than being silently dropped.

What we do: Add metrics and alerting for pipeline reliability.

Key Metrics (exposed to Prometheus):

pipeline_metrics = {
    "transactions_processed_total": Counter,
    "transactions_processed_rate": Gauge,     # TPS
    "processing_latency_ms": Histogram,       # End-to-end
    "event_time_lag_ms": Gauge,               # Real-time lag
    "fraud_alerts_total": Counter,
    "checkpoint_duration_ms": Histogram,
    "checkpoint_failures_total": Counter,
    "backpressure_ratio": Gauge
}

Critical Alert Rules (Prometheus/Alertmanager):

- alert: HighProcessingLatency
  expr: histogram_quantile(0.99, processing_latency_ms) > 500
  for: 2m
  annotations:
    summary: "Fraud detection latency exceeds 500ms SLA"
- alert: CheckpointFailures
  expr: increase(checkpoint_failures_total[5m]) > 0
  for: 1m
- alert: HighBackpressure
  expr: backpressure_ratio > 0.5
  for: 5m
Try It: Pipeline Health Dashboard

Simulate a fraud detection pipeline and see how key metrics change under different conditions. Adjust throughput, latency, and failure rates to see when alerts trigger.

Why: Real-time fraud detection is useless if the pipeline is slow or losing data. These metrics ensure the team knows immediately if processing latency exceeds the 500ms SLA or if checkpoints fail (risking exactly-once guarantees).

Outcome: A production fraud detection pipeline processing 2,000 TPS with <200ms average latency and exactly-once delivery guarantees.

Pipeline Performance: | Metric | Target | Achieved | |——–|——–|———-| | Throughput | 2,000 TPS | 2,500 TPS (headroom) | | P50 Latency | <200ms | 85ms | | P99 Latency | <500ms | 340ms | | False Positive Rate | <1% | 0.3% | | Missed Fraud Rate | <0.1% | 0.05% | | Exactly-Once Delivery | 100% | 100% (verified) |

Architecture Summary:

Architecture diagram of a real-time fraud detection pipeline with Kafka ingestion, Flink stream processing using sliding windows for velocity and burst detection, Redis-backed exactly-once deduplication, and monitoring via Prometheus and Alertmanager
Figure 7.1: Fraud detection pipeline architecture showing payment terminals feeding into Kafka with card_hash partitioning, Flink processing with velocity and burst detection sliding windows, Redis deduplication, and output to PostgreSQL alerts and monitoring dashboards

Key Decisions Made:

  1. Kafka partitioning by card_hash: Ensures per-card ordering without coordination
  2. Sliding windows: Balance detection speed vs computational cost
  3. 30-second watermarks: Handle network delays while keeping latency low
  4. Redis deduplication: Ensures exactly-once even across Flink restarts
  5. RocksDB state backend: Handles large per-card state (transaction history)
  6. Side output for late data: Never silently drop data, always audit

Concept Relationships

Understanding pitfalls connects to several related concepts:

Contrast with: Batch processing errors (can be fixed by re-running entire batch) vs. streaming pitfalls (require careful design to prevent data loss or duplication in real-time)

See Also

Key Takeaway

The two most critical stream processing pitfalls are non-idempotent processing (causing duplicate billing, counting, or actions on message retries – fix with unique message IDs and dedup checks) and missing backpressure handling (causing memory exhaustion when producers overwhelm consumers – fix with bounded queues and blocking producers). The fraud detection worked example demonstrates a complete production pipeline achieving <200ms P50 latency and 0.05% missed fraud rate through key-based partitioning, sliding windows, and Redis-backed deduplication for exactly-once delivery.

The Sensor Squad catches a sneaky credit card thief using stream processing!

Max the Microcontroller works at a bank, watching 2,000 credit card transactions every second. One day, something suspicious happens!

At 2:00 PM, a card is used at a coffee shop in London. At 2:03 PM, the SAME card is used at a store in New York! That is 5,500 km away in just 3 minutes!

“Nobody can fly that fast!” shouts Sammy the Sensor. “That is a VELOCITY ATTACK – someone stole the card number!”

But how did they catch it so fast? The Sensor Squad used a special pipeline:

Step 1: All transactions from the same card go to the same detective (key-based partitioning). This way, one detective sees ALL of a card’s transactions.

Step 2: The detective keeps a 5-minute sliding window – looking at the last 5 minutes of purchases for each card, checking every 30 seconds.

Step 3: For each pair of transactions, the detective calculates: “How far apart are these locations? How much time passed? Is the speed humanly possible?”

Step 4: London to New York in 3 minutes = 110,000 km/h. That is WAY faster than any airplane! ALERT!

Lila the LED adds: “And we have to be really careful about duplicates! If a retry makes us flag the same transaction twice, we might block a card that is actually fine.”

Bella the Battery finishes: “That is why we use deduplication – every alert gets a unique ID. If we see the same ID twice, we skip it. One alert, one action, no mistakes!”

The pipeline catches fraud in under 200 milliseconds – less than the blink of an eye!

7.4.1 Try This at Home!

Play detective! Ask a friend to write down two cities and the time they “visited” each one. Your job: calculate if they could have traveled between the cities in that time! For example, your friend says “I was in my bedroom at 3:00 and the kitchen at 3:01.” Totally possible! But “I was in London at 3:00 and Tokyo at 3:01?” Impossible – that is fraud!

7.6 What’s Next

If you want to… Read this
Study the fundamentals to understand these pitfalls Stream Processing Fundamentals
Learn the architectures that prevent these pitfalls Stream Processing Architectures
Practice avoiding pitfalls in a lab Lab: Stream Processing
See the full stream processing overview Stream Processing for IoT

Continue to Hands-On Lab: Basic Stream Processing to implement streaming concepts on ESP32 with a 45-minute Wokwi lab.