7 Common Pitfalls and Worked Examples
| Chapter | Topic |
|---|---|
| Stream Processing | Overview of batch vs. stream processing for IoT |
| Fundamentals | Windowing strategies, watermarks, and event-time semantics |
| Architectures | Kafka, Flink, and Spark Streaming comparison |
| Pipelines | End-to-end ingestion, processing, and output design |
| Challenges | Late data, exactly-once semantics, and backpressure |
| Pitfalls | Common mistakes and production worked examples |
| Basic Lab | ESP32 circular buffers, windows, and event detection |
| Advanced Lab | CEP, pattern matching, and anomaly detection on ESP32 |
| Game & Summary | Interactive review game and module summary |
| Interoperability | Four levels of IoT interoperability |
| Interop Fundamentals | Technical, syntactic, semantic, and organizational layers |
| Interop Standards | SenML, JSON-LD, W3C WoT, and oneM2M |
| Integration Patterns | Protocol adapters, gateways, and ontology mapping |
7.1 Learning Objectives
- Implement idempotent message processing using request IDs and deduplication to prevent double billing, double counting, and duplicate actions on retries
- Configure backpressure handling with bounded queues and blocking producers to prevent memory exhaustion when fast producers overwhelm slow consumers
- Design production fraud detection pipelines processing 2,000 TPS with <200ms latency using key-based partitioning, sliding windows, and exactly-once semantics
- Apply watermark strategies and allowed lateness handling to process out-of-order events in real-time payment systems
- Monitor stream processing health using Flink metrics for throughput, latency, checkpoint duration, and backpressure ratios
For Beginners: Stream Processing Pitfalls
Stream processing pitfalls are the common mistakes engineers make when handling real-time IoT data. Think of them as potholes on a road – knowing where they are helps you avoid them. Common traps include assuming data always arrives in order, ignoring late readings, and building systems that cannot recover from failures.
Key Concepts
- Worked Example: Detailed walkthrough of a specific streaming scenario showing both the pitfall pattern and the corrected implementation with quantified impact
- Throughput Bottleneck: Pipeline stage consuming events slower than the upstream stage produces them, causing consumer lag to grow and eventually data loss
- Memory Leak in State Store: Unbounded accumulation of state (event history, pattern matching state) in streaming operator state stores, eventually causing OOM failures
- Late Event Handling Strategy: Design decision for how to treat events arriving after their window has closed — drop silently, route to dead letter, or include with extended watermark
- Idempotent Processing: Stream operation that can be applied multiple times to the same event without changing the result beyond the first application, enabling safe at-least-once delivery
- Thundering Herd: Large number of IoT devices reconnecting simultaneously after a network outage, creating a traffic spike that overwhelms the stream ingestion tier
- Schema Incompatibility: Situation where a new message version cannot be deserialized by an older consumer, causing pipeline failures when producer and consumer are updated independently
- Debugging Streaming Pipelines: Techniques for tracing events through distributed streaming pipelines — sampling, event logging, distributed tracing with correlation IDs
7.2 Common Pitfalls
Common Pitfall: Non-Idempotent Message Processing
The mistake: Without idempotency, retried messages cause duplicate processing. This leads to double counting, double billing, or repeated actions.
Symptoms:
- Duplicate actions (double billing, double counting)
- Data inconsistency
- Retry storms cause cascading effects
- Incorrect totals and aggregations
Wrong approach:
# Non-idempotent: Each message increments counter
def process(msg):
counter += msg.value # Duplicate messages double-count!
# Non-idempotent command
def process_command(cmd):
if cmd.action == "dispense":
dispense_item() # Retry = dispense twice!Correct approach:
# Idempotent: Use message ID for deduplication
def process(msg):
if msg.id in processed_ids:
return # Already processed
processed_ids.add(msg.id)
counter += msg.value
# Idempotent command with request ID
def process_command(cmd):
if cmd.request_id in completed:
return completed[cmd.request_id] # Return same result
result = dispense_item()
completed[cmd.request_id] = result
return resultHow to avoid:
- Include unique message IDs
- Track processed message IDs
- Design idempotent operations
- Use database constraints to prevent duplicates
Putting Numbers to It
A vending machine receives a $2 snack dispense command twice due to network retry. Without idempotency checks:
Non-idempotent outcome:
- Command 1: Dispense snack → Inventory: \(-1\), Customer charge: \(+\$2\)
- Command 2 (retry): Dispense again → Inventory: \(-1\), Customer charge: \(+\$2\)
- Total: \(-2\) items, \(+\$4\) charged, 100% overcharge
Idempotent fix (request ID tracking):
if request_id in completed:
return cached_result # No action
dispense_item()
completed[request_id] = resultImpact at scale: 10,000 transactions/day × 0.1% retry rate = 10 duplicates/day. Without idempotency: \(10 \times \$2 = \$20/day\) overcharge = $7,300/year in reconciliation costs.
Common Pitfall: Missing Backpressure Handling
The mistake: Without backpressure, slow consumers are overwhelmed by fast producers. This causes memory exhaustion, dropped messages, and system crashes during traffic spikes.
Symptoms:
- Memory exhaustion
- System crashes under load
- Lost data during traffic spikes
- Increasing latency
Wrong approach:
# Unbounded queue - memory grows indefinitely
queue = []
def producer():
while True:
queue.append(generate_data())
def consumer():
while queue:
process(queue.pop(0)) # Slower than producer
# Eventually: OutOfMemoryErrorCorrect approach:
# Bounded queue with backpressure
from queue import Queue
queue = Queue(maxsize=1000)
def producer():
while True:
queue.put(generate_data(), block=True) # Blocks when full
# Or drop/sample under pressure
def producer_with_sampling():
if queue.full():
if random() < 0.1: # Keep 10% under pressure
queue.put_nowait(data)
else:
queue.put(data)How to avoid:
- Use bounded buffers and queues
- Implement blocking producers
- Add sampling under load
- Monitor queue depths
- Scale consumers dynamically
7.3 Worked Example: Real-Time Fraud Detection Pipeline
7.4 Worked Example: Stream Processing for Payment Terminal Fraud Detection
Scenario: A payment processor operates 50,000 IoT payment terminals (card readers, POS systems) generating 2,000 transactions per second. The fraud team needs to detect suspicious patterns in real-time: - Velocity attacks: Same card used at multiple distant locations within minutes - Amount anomalies: Unusual transaction amounts for a given merchant - Burst attacks: Many small transactions in rapid succession
Goal: Design a stream processing pipeline that detects fraud within 500ms of transaction arrival while ensuring exactly-once processing to prevent false positives and missed detections.
What we do: Define the transaction event schema and set up Kafka ingestion.
Event Schema:
# Transaction event from payment terminal
transaction_schema = {
"transaction_id": "uuid", # Unique identifier
"card_hash": "string", # Anonymized card identifier
"terminal_id": "string", # Payment terminal ID
"merchant_id": "string", # Merchant identifier
"amount_cents": "int", # Transaction amount in cents
"currency": "string", # ISO currency code
"location": {
"lat": "double",
"lon": "double"
},
"timestamp": "timestamp", # Event time (when card was swiped)
"terminal_time": "timestamp" # Processing time (when terminal sent)
}
# Example event
{
"transaction_id": "txn-a1b2c3d4",
"card_hash": "hash_xyz789",
"terminal_id": "term_12345",
"merchant_id": "merch_grocery_001",
"amount_cents": 4523,
"currency": "USD",
"location": {"lat": 40.7128, "lon": -74.0060},
"timestamp": "2026-01-10T14:23:45.123Z",
"terminal_time": "2026-01-10T14:23:45.456Z"
}Kafka Topic Configuration:
# Kafka topic for raw transactions
topic_config = {
"name": "transactions-raw",
"partitions": 32, # Parallelism for 2K TPS
"replication_factor": 3, # Durability
"retention_ms": 7 * 24 * 3600 * 1000, # 7 days
"cleanup_policy": "delete"
}
# Partition by card_hash for ordered per-card processing
# This ensures all transactions for a card go to same partition
producer.send(
"transactions-raw",
key=transaction["card_hash"],
value=transaction
)Why: Partitioning by card_hash ensures all transactions for a single card arrive at the same Flink task instance in order. This is critical for velocity detection (same card at multiple locations) without expensive cross-partition coordination.
What we do: Implement sliding windows to detect velocity and burst attacks.
Flink Pipeline for Velocity Detection:
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.datastream.window import SlidingEventTimeWindows
from pyflink.common.time import Time
env = StreamExecutionEnvironment.get_execution_environment()
env.set_parallelism(32)
# Velocity detection: Same card, different locations within 5 min
velocity_alerts = (
env.add_source(kafka_source)
.key_by(lambda t: t["card_hash"])
.window(SlidingEventTimeWindows.of(
Time.minutes(5), Time.seconds(30)))
.apply(VelocityDetector())
)
class VelocityDetector:
def apply(self, card_hash, window, transactions):
"""Detect impossible travel (>100km in <5 min)."""
locations = [(t["location"], t["timestamp"])
for t in transactions]
for i, (loc1, t1) in enumerate(locations):
for loc2, t2 in locations[i+1:]:
speed = haversine(loc1, loc2) / (
abs((t2-t1).total_seconds()) / 3600)
if speed > 500: # Impossible by car
yield FraudAlert(card_hash=card_hash,
alert_type="VELOCITY_ATTACK",
confidence=min(0.99, speed / 1000))
returnBurst Attack Detection:
# Burst detection: >10 transactions in 2 minutes from same card
burst_alerts = (
transactions
.key_by(lambda t: t["card_hash"])
.window(SlidingEventTimeWindows.of(
Time.minutes(2), # Window size: 2 minutes
Time.seconds(15) # Slide: check every 15 seconds
))
.apply(BurstDetector())
)
class BurstDetector:
def apply(self, card_hash, window, transactions):
"""Detect card testing attacks (many small transactions)."""
txn_list = list(transactions)
count = len(txn_list)
avg_amount = sum(t["amount_cents"] for t in txn_list) / count
# Flag: >10 transactions OR >5 transactions with avg <$5
if count > 10 or (count > 5 and avg_amount < 500):
yield FraudAlert(
card_hash=card_hash,
alert_type="BURST_ATTACK",
confidence=min(0.95, count / 20),
transactions=[t["transaction_id"] for t in txn_list],
details=f"{count} txns, avg ${avg_amount/100:.2f}"
)Why: Sliding windows allow continuous evaluation without waiting for window boundaries. A 5-minute window sliding every 30 seconds catches fraud quickly while reducing computational overhead compared to per-event evaluation.
What we do: Configure checkpointing and idempotent sinks to prevent duplicates.
Checkpointing Configuration:
from pyflink.datastream import CheckpointingMode
# Enable exactly-once checkpointing
env.enable_checkpointing(
interval=10000, # Checkpoint every 10 seconds
mode=CheckpointingMode.EXACTLY_ONCE
)
# Checkpoint configuration
env.get_checkpoint_config().set_min_pause_between_checkpoints(5000)
env.get_checkpoint_config().set_checkpoint_timeout(60000)
env.get_checkpoint_config().set_max_concurrent_checkpoints(1)
env.get_checkpoint_config().enable_externalized_checkpoints(
ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION
)
# State backend for large state (card history)
env.set_state_backend(RocksDBStateBackend("s3://flink-state/checkpoints"))Idempotent Alert Sink:
class IdempotentAlertSink:
"""Ensures each alert is processed exactly once."""
def __init__(self, redis_client, alert_topic):
self.redis = redis_client
self.kafka_producer = KafkaProducer(
bootstrap_servers="kafka:9092",
enable_idempotence=True, # Kafka idempotent producer
acks="all"
)
self.alert_topic = alert_topic
def invoke(self, alert: FraudAlert):
# Deduplication key: alert_type + card + window_end
dedup_key = f"alert:{alert.alert_type}:{alert.card_hash}:{alert.window_end}"
# Check if already processed (Redis SET NX with TTL)
if self.redis.set(dedup_key, "1", nx=True, ex=3600):
# First time seeing this alert - process it
self.kafka_producer.send(
self.alert_topic,
key=alert.card_hash,
value=alert.to_json()
)
# Also write to PostgreSQL for audit
self.write_to_postgres(alert)
# else: Already processed, skip (idempotent)
def write_to_postgres(self, alert):
"""Idempotent insert using ON CONFLICT."""
self.cursor.execute("""
INSERT INTO fraud_alerts
(alert_id, card_hash, alert_type, confidence, created_at)
VALUES (%s, %s, %s, %s, %s)
ON CONFLICT (alert_id) DO NOTHING
""", (alert.id, alert.card_hash, alert.alert_type,
alert.confidence, alert.created_at))Why: Exactly-once requires coordination between Flink checkpoints, Kafka transactions, and downstream sinks. The Redis deduplication key and PostgreSQL ON CONFLICT ensure that even if a failure causes reprocessing, each alert is delivered exactly once.
What we do: Configure watermarks and allowed lateness for out-of-order events.
Watermark Strategy:
from pyflink.datastream import WatermarkStrategy
from pyflink.common import Duration
# Watermark: event_time - 30 seconds (allow 30s network delay)
watermark_strategy = (
WatermarkStrategy
.for_bounded_out_of_orderness(Duration.of_seconds(30))
.with_timestamp_assigner(
lambda t, _: t["timestamp"].timestamp() * 1000
)
)
transactions_with_watermarks = transactions.assign_timestamps_and_watermarks(
watermark_strategy
)
# Window configuration with allowed lateness
velocity_alerts = (
transactions_with_watermarks
.key_by(lambda t: t["card_hash"])
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(30)))
.allowed_lateness(Time.minutes(2)) # Accept late data up to 2 min
.side_output_late_data(late_data_tag) # Capture very late data
.apply(VelocityDetector())
)
# Handle late data separately (log for investigation)
late_transactions = velocity_alerts.get_side_output(late_data_tag)
late_transactions.add_sink(late_data_sink) # Write to audit logLate Data Handling Policy:
# Terminal clock drift detection
class ClockDriftMonitor:
def process(self, transaction):
event_time = transaction["timestamp"]
processing_time = datetime.now(timezone.utc)
drift_seconds = (processing_time - event_time).total_seconds()
# Flag terminals with >60s clock drift
if abs(drift_seconds) > 60:
yield TerminalAlert(
terminal_id=transaction["terminal_id"],
alert_type="CLOCK_DRIFT",
drift_seconds=drift_seconds
)Why: Payment terminals may have network delays or clock drift. A 30-second watermark delay handles normal network latency, while 2-minute allowed lateness captures most edge cases. Very late data goes to side output for manual review rather than being silently dropped.
What we do: Add metrics and alerting for pipeline reliability.
Key Metrics (exposed to Prometheus):
pipeline_metrics = {
"transactions_processed_total": Counter,
"transactions_processed_rate": Gauge, # TPS
"processing_latency_ms": Histogram, # End-to-end
"event_time_lag_ms": Gauge, # Real-time lag
"fraud_alerts_total": Counter,
"checkpoint_duration_ms": Histogram,
"checkpoint_failures_total": Counter,
"backpressure_ratio": Gauge
}Critical Alert Rules (Prometheus/Alertmanager):
- alert: HighProcessingLatency
expr: histogram_quantile(0.99, processing_latency_ms) > 500
for: 2m
annotations:
summary: "Fraud detection latency exceeds 500ms SLA"
- alert: CheckpointFailures
expr: increase(checkpoint_failures_total[5m]) > 0
for: 1m
- alert: HighBackpressure
expr: backpressure_ratio > 0.5
for: 5mWhy: Real-time fraud detection is useless if the pipeline is slow or losing data. These metrics ensure the team knows immediately if processing latency exceeds the 500ms SLA or if checkpoints fail (risking exactly-once guarantees).
Outcome: A production fraud detection pipeline processing 2,000 TPS with <200ms average latency and exactly-once delivery guarantees.
Pipeline Performance: | Metric | Target | Achieved | |——–|——–|———-| | Throughput | 2,000 TPS | 2,500 TPS (headroom) | | P50 Latency | <200ms | 85ms | | P99 Latency | <500ms | 340ms | | False Positive Rate | <1% | 0.3% | | Missed Fraud Rate | <0.1% | 0.05% | | Exactly-Once Delivery | 100% | 100% (verified) |
Architecture Summary:
Key Decisions Made:
- Kafka partitioning by card_hash: Ensures per-card ordering without coordination
- Sliding windows: Balance detection speed vs computational cost
- 30-second watermarks: Handle network delays while keeping latency low
- Redis deduplication: Ensures exactly-once even across Flink restarts
- RocksDB state backend: Handles large per-card state (transaction history)
- Side output for late data: Never silently drop data, always audit
Concept Relationships
Understanding pitfalls connects to several related concepts:
- Upstream: Stream Processing Pipelines - Architecture patterns that these pitfalls can compromise
- Parallel: Handling Real-World Challenges - Solutions for late data, exactly-once semantics, fault tolerance
- Foundational: Stream Processing Fundamentals - Core concepts of windowing and event-time processing
Contrast with: Batch processing errors (can be fixed by re-running entire batch) vs. streaming pitfalls (require careful design to prevent data loss or duplication in real-time)
See Also
- Database Transactions - ACID guarantees that complement streaming exactly-once semantics
- MQTT QoS Fundamentals - At-least-once vs. exactly-once delivery guarantees
- Resilience Patterns - Circuit breakers and retry strategies
Key Takeaway
The two most critical stream processing pitfalls are non-idempotent processing (causing duplicate billing, counting, or actions on message retries – fix with unique message IDs and dedup checks) and missing backpressure handling (causing memory exhaustion when producers overwhelm consumers – fix with bounded queues and blocking producers). The fraud detection worked example demonstrates a complete production pipeline achieving <200ms P50 latency and 0.05% missed fraud rate through key-based partitioning, sliding windows, and Redis-backed deduplication for exactly-once delivery.
For Kids: Meet the Sensor Squad!
The Sensor Squad catches a sneaky credit card thief using stream processing!
Max the Microcontroller works at a bank, watching 2,000 credit card transactions every second. One day, something suspicious happens!
At 2:00 PM, a card is used at a coffee shop in London. At 2:03 PM, the SAME card is used at a store in New York! That is 5,500 km away in just 3 minutes!
“Nobody can fly that fast!” shouts Sammy the Sensor. “That is a VELOCITY ATTACK – someone stole the card number!”
But how did they catch it so fast? The Sensor Squad used a special pipeline:
Step 1: All transactions from the same card go to the same detective (key-based partitioning). This way, one detective sees ALL of a card’s transactions.
Step 2: The detective keeps a 5-minute sliding window – looking at the last 5 minutes of purchases for each card, checking every 30 seconds.
Step 3: For each pair of transactions, the detective calculates: “How far apart are these locations? How much time passed? Is the speed humanly possible?”
Step 4: London to New York in 3 minutes = 110,000 km/h. That is WAY faster than any airplane! ALERT!
Lila the LED adds: “And we have to be really careful about duplicates! If a retry makes us flag the same transaction twice, we might block a card that is actually fine.”
Bella the Battery finishes: “That is why we use deduplication – every alert gets a unique ID. If we see the same ID twice, we skip it. One alert, one action, no mistakes!”
The pipeline catches fraud in under 200 milliseconds – less than the blink of an eye!
7.4.1 Try This at Home!
Play detective! Ask a friend to write down two cities and the time they “visited” each one. Your job: calculate if they could have traveled between the cities in that time! For example, your friend says “I was in my bedroom at 3:00 and the kitchen at 3:01.” Totally possible! But “I was in London at 3:00 and Tokyo at 3:01?” Impossible – that is fraud!
7.6 What’s Next
| If you want to… | Read this |
|---|---|
| Study the fundamentals to understand these pitfalls | Stream Processing Fundamentals |
| Learn the architectures that prevent these pitfalls | Stream Processing Architectures |
| Practice avoiding pitfalls in a lab | Lab: Stream Processing |
| See the full stream processing overview | Stream Processing for IoT |
Continue to Hands-On Lab: Basic Stream Processing to implement streaming concepts on ESP32 with a 45-minute Wokwi lab.