29  Big Data Pipelines

In 60 Seconds

Big data pipelines use Lambda Architecture to combine batch processing (accurate historical analysis) and stream processing (real-time approximate results), with windowing strategies dividing infinite data streams into processable chunks and watermarks handling late-arriving data.

Learning Objectives

After completing this chapter, you will be able to:

  • Design Lambda architecture combining batch and stream processing
  • Implement stream processing with windowing strategies
  • Handle late-arriving data with watermarks
  • Choose between batch, stream, and hybrid processing approaches

Key Concepts

  • Data ingestion: The first stage of a pipeline that receives raw sensor data from devices, validates format and completeness, and routes it to the appropriate storage or processing system.
  • ETL (Extract, Transform, Load): A pipeline pattern that extracts data from sources, transforms it (cleaning, aggregation, format conversion), and loads it into a target system for analysis.
  • Message broker: A middleware component (e.g., Apache Kafka, AWS Kinesis) that buffers and routes high-velocity IoT data streams between producers (sensors) and consumers (processors).
  • Schema registry: A centralised service that stores and validates data schemas, ensuring that sensor payloads conform to expected formats as the system evolves.
  • Dead-letter queue: A storage location for messages that fail processing validation, preserving them for investigation rather than silently dropping them.
  • Pipeline orchestration: The automated scheduling, dependency management, and failure handling of multi-step data workflows using tools like Apache Airflow or AWS Step Functions.
  • Data lineage: The ability to trace a derived analytics result back through every transformation step to its original sensor reading, essential for debugging and regulatory compliance.

A big data pipeline is like a water treatment plant for information. Raw sensor data flows in from thousands of IoT devices, gets filtered and cleaned, is processed and transformed, and flows out as useful insights. Each stage handles a specific job, and the whole system runs continuously and automatically.

29.1 Lambda Architecture

Lambda architecture processes IoT data through three parallel layers that work together to provide both fast and accurate results.

Step-by-Step Process:

  1. Data Arrives: Raw sensor event (e.g., “temperature: 23.5°C at 14:32:15”)
  2. Split Path: Event simultaneously flows to BOTH batch and speed layers
  3. Speed Layer Processing (Milliseconds):
    • Flink reads event from Kafka
    • Updates in-memory window aggregate (last 5 minutes average)
    • Writes approximate result to serving layer
    • User sees update within 100ms
  4. Batch Layer Processing (Hours):
    • Event written to HDFS data lake
    • Nightly Spark job recomputes ALL historical aggregates
    • Overwrites batch views with exact corrected values
    • Includes late-arriving data that speed layer missed
  5. Serving Layer Merge:
    • Query: “What was average temp yesterday 14:30-14:35?”
    • If asking about yesterday (complete): serve from batch views (exact)
    • If asking about last 10 minutes (incomplete): serve from speed views (approximate)
    • For recent completed windows: batch view replaces speed view automatically

Why This Works: Speed layer sacrifices accuracy for speed (drops late data). Batch layer sacrifices speed for accuracy (waits for all data). Serving layer gives you whichever is available - fast approximate now, exact answer later. Users get immediate feedback, and wrong answers self-correct within hours.

The Lambda Architecture solves a fundamental problem: real-time analytics vs accurate historical analysis. You need both, but they have conflicting requirements.

Lambda Architecture diagram showing IoT data flowing simultaneously to a batch layer for accurate historical processing and a speed layer for real-time approximate results, with both feeding into a unified serving layer
Figure 29.1: Lambda Architecture: Batch Layer for Accuracy Speed Layer for Latency

Lambda Architecture: IoT data flows to both batch layer (accurate historical analysis) and speed layer (real-time approximate), merged in serving layer to provide low-latency queries with eventual accuracy.

29.1.1 Why Lambda Architecture?

Requirement Batch Only Stream Only Lambda (Both)
Latency Hours Milliseconds Milliseconds
Accuracy 100% correct May miss late data 100% eventually
Cost Low High Medium
Complexity Simple Medium High

29.1.2 Lambda Implementation Example

# BATCH LAYER: Complete recompute (accurate, slow)
def batch_layer():
    # Read entire historical dataset
    all_sensor_data = spark.read.parquet("s3://iot-data/raw/")

    # Full aggregation (correct answer)
    daily_totals = all_sensor_data \
        .groupBy("date", "sensor_id") \
        .agg(sum("energy_kwh").alias("total_energy"))

    # Write to batch views (replaces previous)
    daily_totals.write.mode("overwrite") \
        .parquet("s3://iot-data/batch-views/daily-energy/")

# SPEED LAYER: Incremental updates (approximate, fast)
def speed_layer():
    # Read only recent data (last hour)
    streaming_data = spark.readStream \
        .format("kafka") \
        .option("subscribe", "sensor-readings") \
        .load()

    # Incremental aggregation (approximate - may miss late data)
    hourly_totals = streaming_data \
        .withWatermark("timestamp", "10 minutes") \
        .groupBy(window("timestamp", "1 hour"), "sensor_id") \
        .agg(sum("energy_kwh").alias("total_energy"))

    # Write to real-time views
    hourly_totals.writeStream \
        .outputMode("update") \
        .format("delta") \
        .start("s3://iot-data/speed-views/hourly-energy/")

# SERVING LAYER: Merge batch + speed
def serving_layer(query_date):
    if query_date < today():
        # Historical: Use accurate batch view
        return spark.read.parquet(f"s3://iot-data/batch-views/daily-energy/")
    else:
        # Current: Use speed layer (approximate)
        return spark.read.parquet(f"s3://iot-data/speed-views/hourly-energy/")

29.2 Stream Processing with Windowing

IoT data arrives continuously - you can’t wait for “all” data before computing. Windowing divides infinite streams into finite, processable chunks.

Comparison of three stream processing window types: tumbling windows with fixed non-overlapping intervals, sliding windows with overlapping intervals, and session windows grouped by activity gaps
Figure 29.2: Three Window Types: Tumbling Sliding and Session

29.2.1 Window Type Comparison

Window Type Best For Example
Tumbling Periodic aggregates (hourly/daily totals) “Total energy per hour”
Sliding Moving averages, trend detection “5-minute average, updated every minute”
Session User activity, event sequences “Group clicks until 30s inactivity”

29.2.2 Implementing Windowed Aggregation

# Tumbling window: Non-overlapping 1-minute buckets
tumbling_result = sensor_stream \
    .groupBy(window("timestamp", "1 minute"), "sensor_id") \
    .agg(avg("temperature").alias("avg_temp"))

# Sliding window: 5-minute window, slides every 1 minute
sliding_result = sensor_stream \
    .groupBy(window("timestamp", "5 minutes", "1 minute"), "sensor_id") \
    .agg(avg("temperature").alias("avg_temp"))

# Session window: Group events with < 30 second gap
session_result = sensor_stream \
    .groupBy(session_window("timestamp", "30 seconds"), "user_id") \
    .agg(count("*").alias("events_in_session"))

29.3 Handling Late-Arriving Data

IoT networks are imperfect - data arrives late due to network delays, sensor buffering, or connectivity issues. Watermarks define how long to wait for late data.

Watermark handling diagram showing how events arriving within the tolerance window are processed while late events exceeding the watermark threshold are dropped
Figure 29.3: Watermark Handling: Late Events with Different Tolerance Settings

Watermark Strategy for Late Events: Events arriving after the watermark tolerance are dropped; setting appropriate watermark balances completeness (higher tolerance) against latency (lower tolerance).

29.3.1 Watermark Configuration

# Configure watermark: Accept events up to 2 minutes late
sensor_stream_with_watermark = sensor_stream \
    .withWatermark("event_timestamp", "2 minutes") \
    .groupBy(
        window("event_timestamp", "1 minute"),
        "sensor_id"
    ) \
    .agg(avg("temperature").alias("avg_temp"))

# Watermark tradeoffs:
# - 10 seconds: Fast results, may miss delayed sensors
# - 2 minutes: More complete, but delayed output
# - 1 hour: Very complete, but results arrive 1 hour late

29.3.2 Watermark Tradeoff Explorer

Use this calculator to explore how watermark tolerance affects data completeness and output latency for a typical IoT sensor deployment.

Scenario: A smart city has 5,000 parking sensors that send occupancy status (occupied/free) whenever it changes. During rush hour, this generates 50,000 updates/hour. The city needs: 1. Real-time parking availability map (updated within 5 seconds) 2. Daily parking utilization reports for urban planning 3. Monthly revenue tracking for parking meters

Think about: Which processing model should you use for each requirement?

Key Insights:

  1. Real-time Parking Map - Stream Processing
    • Why: Users need immediate information (“Is parking available NOW?”)
    • Latency: 5 seconds acceptable
    • Volume: 50,000 updates/hour = 13.9 updates/second (manageable for streaming)
    • Technology: Apache Kafka + Spark Streaming
    • Real number: Stream processing delivers results in 2-5 seconds vs 1-24 hours for batch
  2. Daily Utilization Reports - Batch Processing
    • Why: Historical analysis doesn’t need real-time processing
    • Latency: 24 hours acceptable (run nightly)
    • Volume: Process full day’s data in one job
    • Technology: Apache Spark batch job scheduled at midnight
    • Real number: Batch processing is 10x cheaper than maintaining real-time streaming for historical queries
  3. Monthly Revenue - Batch Processing
    • Why: Aggregating 30 days of data, accuracy > speed
    • Latency: Can run monthly (1st of each month)
    • Volume: 50,000 events/hour x 720 hours/month = 36 million events
    • Technology: SQL query on data warehouse
    • Real number: Monthly batch job costs $5 vs $150/month for continuous stream processing

Lambda Architecture Cost-Benefit: Smart parking system with 5,000 sensors reporting every minute. Daily messages: \(5000 \times 1440 = 7.2M\) messages. Monthly: \(7.2M \times 30 = 216M\) messages.

Speed layer (Kafka + Flink real-time): \(216M \times \$0.10/\text{million} = \$21.60\) messaging + \(\$150\) compute (2 workers 24/7) = \(\$171.60/\text{month}\).

Batch layer (Spark nightly): Storage \(30 \times 1.08\text{ GB/day} = 32.4\text{ GB} \times \$0.023 = \$0.75\) + nightly job \(\$3\) = \(\$3.75/\text{month}\).

Total Lambda: \(\$171.60 + \$3.75 = \$175.35/\text{month}\).

Stream-only alternative: Run Flink 24/7 for historical queries = \(\$171.60 + \$100\) (additional history storage) = \(\$271.60\) (55% more expensive). Batch-only: Wait 24 hours for insights = unacceptable for real-time parking availability.

Lambda provides real-time speed layer for \(\$171.60\) + exact historical accuracy for \(\$3.75\), optimal cost-performance balance.

Cost Comparison:

Approach Infrastructure Cost/Month Use Case
Stream (All 3) Kafka + Spark Streaming 24/7 $500 Overkill for batch needs
Batch (All 3) Nightly Spark jobs $50 Too slow for real-time map
Hybrid (Stream + Batch) Streaming for map, batch for reports $200 Optimal cost/performance

Decision Rule:

Use STREAM processing when:
- Latency requirement < 1 minute
- Results needed continuously
- Data arrives in real-time
- Immediate action required (alerts, dashboards)

Use BATCH processing when:
- Latency requirement > 1 hour
- Results needed periodically (daily/weekly/monthly)
- Processing full datasets for accuracy
- Complex analytics on historical data

Use LAMBDA (Both) when:
- Need real-time + historical views
- Real-time alerts + periodic reports
- Different latency needs for different queries

29.4 Batch vs Stream Processing Decision

Factor Use Batch Use Stream Use Lambda (Both)
Latency Need > 1 hour < 1 minute Mixed (real-time + reports)
Data Volume TB-PB scale MB-GB/sec Both
Query Pattern Historical trends Live dashboards Real-time + historical
Cost (1 TB/day) ~$1,300/month ~$10,800/month ~$1,500/month (hybrid)
Example Monthly reports Fraud detection E-commerce analytics

29.4.1 Cost Breakdown Example (1 TB/day processing)

Batch Only:

Storage: 30 TB/month x $0.023/GB = $690/month
Compute: 1 daily Spark job x 100 nodes x 2 hours x $0.10/hour = $20/day = $600/month
Total: $1,290/month

Stream Only:

Kafka: 3 brokers x $200/month = $600/month
Spark Streaming: 50 workers x 24/7 x $0.20/hour = $7,200/month
Storage (30 days): 30 TB x $0.10/GB = $3,000/month
Total: $10,800/month (8.4x more expensive!)

Lambda Architecture (Hybrid):

Stream Layer: Last 24 hours (real-time alerts) = $300/month
Batch Layer: Historical (daily aggregates) = $1,000/month
Serving Layer: Query merging = $200/month
Total: $1,500/month
Best of both: Real-time insights + historical accuracy

29.4.2 Pipeline Cost Estimator

Estimate monthly infrastructure costs for different pipeline architectures based on your IoT data volume.

29.5 Knowledge Checks

The mistake: Scheduling hourly or daily batch jobs to process data that requires real-time or near-real-time responses, causing missed SLAs and frustrated users.

Symptoms:

  • Users complain that dashboards show “stale” data (hours old)
  • Alerts for critical events arrive long after the incident
  • Fraud detection catches issues after damage is done
  • Business requests for “real-time” lead to increasingly frequent batch jobs (hourly -> every 15 min -> every 5 min)

Why it happens: Batch processing is familiar (SQL queries, scheduled cron jobs) and cheaper. Teams underestimate latency requirements or don’t distinguish between “nice to have real-time” and “must have real-time.” Stream processing seems complex and expensive.

The fix: Classify your use cases by actual latency requirements:

# Use Case Classification Framework
latency_requirements = {
    # BATCH OK (> 1 hour latency acceptable)
    "monthly_reports": "batch",
    "historical_analysis": "batch",
    "ML_model_training": "batch",

    # STREAM REQUIRED (< 1 minute latency needed)
    "fraud_detection": "stream",
    "equipment_alerts": "stream",
    "live_dashboards": "stream",
    "surge_pricing": "stream",

    # HYBRID (Lambda architecture)
    "inventory_tracking": "lambda",
    "user_analytics": "lambda"
}

Prevention: During requirements gathering, explicitly ask: “If this data is 1 hour old, what’s the business impact?” If the answer involves money loss, safety risk, or customer complaints, you need stream processing.

Imagine you have two helpers sorting your mail – one is super careful but slow, and the other is lightning-fast but sometimes makes mistakes!

29.5.1 The Sensor Squad Adventure: The Two Mail Sorters

Sammy the Sensor was getting buried in letters (data!). “I’m sending SO many temperature readings every second, nobody can keep up!” she cried.

Max the Microcontroller had a clever idea. “Let’s hire TWO helpers! Speedy Steve will quickly glance at each letter as it arrives and shout out the important stuff right away – like if the temperature is dangerously hot. Meanwhile, Careful Clara will take ALL the letters at the end of the day, sort them perfectly, and make beautiful reports.”

Lila the LED asked, “But what if Speedy Steve misses something?” Max smiled. “That’s the beauty of it! Clara checks everything again later and fixes any mistakes. So we get fast warnings AND perfect reports!”

Bella the Battery added, “And we use ‘windows’ – like sorting letters into hourly piles instead of looking at every single one. It’s way easier to count 24 piles than a million letters!”

The team also learned about “watermarks” – a rule that says, “Wait 2 extra minutes for any late letters before closing the pile.” That way, even if a letter gets delayed in the mail, it still gets counted!

29.5.2 Key Words for Kids

Word What It Means
Pipeline A set of steps data travels through, like a water slide with different sections
Batch Processing a big pile of data all at once, like doing all your homework after school
Stream Processing data as it arrives, like answering questions during class
Window A time bucket that groups data together, like sorting mail into hourly piles
Watermark A rule for how long to wait for late data before moving on

29.6 Worked Example: Smart Grid Pipeline for 2 Million Meters

Worked Example: Designing a Lambda Pipeline for Utility Smart Meters

Scenario: Enel, one of Europe’s largest utilities, deploys 2 million smart meters across northern Italy. Meters report consumption every 15 minutes. The utility needs both real-time demand monitoring (for grid balancing) and accurate monthly billing (for regulatory compliance).

Given:

  • 2,000,000 smart meters, 1 reading per 15 minutes = 192 million readings/day
  • Each reading: 120 bytes (meter ID, timestamp, kWh, voltage, power factor, reactive power)
  • Daily raw data: 192M x 120 bytes = 23 GB/day = 8.4 TB/year
  • Real-time requirement: Grid operators need aggregated demand per substation every 30 seconds
  • Billing requirement: 100% accurate monthly totals per meter, auditable

Step 1 – Design the speed layer (real-time demand):

  • Ingestion: Apache Kafka cluster (3 brokers), meters publish via MQTT bridge
  • Stream processing: Apache Flink job with 30-second tumbling windows
  • Aggregation: Group by substation (500 substations, ~4,000 meters each)
  • Output: Real-time dashboard showing per-substation demand in MW
  • Latency: Meter reading to dashboard update in under 5 seconds
  • Late data policy: Watermark of 60 seconds (meters occasionally report with network delay)

Processing load: 192M readings / 86,400 seconds = 2,222 readings/second average, with morning peak at 4x = ~9,000 readings/second.

Step 2 – Design the batch layer (billing accuracy):

  • Storage: Apache Parquet files on HDFS, partitioned by date and meter region
  • Processing: Apache Spark job runs nightly at 02:00 (low-demand period)
  • Computation: Sum kWh per meter per billing period, apply time-of-use tariffs, compute reactive power charges
  • Validation: Cross-check totals against speed layer aggregates (must match within 0.01%)
  • Output: Per-meter billing records, substation loss calculations, regulatory reports

Nightly batch processes 23 GB in approximately 12 minutes on a 20-node Spark cluster.

Step 3 – Cost comparison of three architectures:

Architecture Infrastructure Monthly Cost Real-Time? Billing Accurate?
Batch only (Spark nightly) 20-node HDFS + Spark EUR 3,200 No (24h delay) Yes
Stream only (Flink 24/7) 8-node Kafka + Flink EUR 5,800 Yes (<5s) No (late data causes 0.3% error)
Lambda (Spark + Flink) 20-node HDFS + 8-node Kafka/Flink EUR 6,500 Yes (<5s) Yes (batch corrects errors)

Step 4 – Quantify the value of real-time:

Without real-time demand visibility, grid operators must maintain 15% spinning reserve (backup generators running idle). With 30-second demand visibility:

  • Spinning reserve reduced to 8% (confident in demand forecasts)
  • Northern Italy peak demand: 12 GW
  • Reserve reduction: 12 GW x (15% - 8%) = 840 MW less spinning reserve
  • Gas turbine cost for spinning reserve: EUR 30/MWh
  • Daily saving: 840 MW x 24h x EUR 30/MWh x 30% utilization = EUR 181,000/day
  • Monthly saving: EUR 5.4 million
  • Pipeline cost: EUR 6,500/month
  • ROI: 830:1

Result: The Lambda pipeline costs EUR 6,500/month and enables EUR 5.4 million/month in grid optimization savings. Batch layer ensures 100% billing accuracy (regulatory requirement). Speed layer provides 30-second grid visibility that reduces spinning reserve costs by 44%.

Key Insight: For utilities, the Lambda architecture is not optional – regulatory billing accuracy demands batch processing, while grid stability demands real-time stream processing. The incremental cost of Lambda over batch-only (EUR 3,300/month) is trivial compared to the value of real-time grid visibility.

Common Pitfalls

Without schema enforcement at the entry point, malformed sensor payloads propagate into storage and corrupt downstream analytics. Always validate and reject non-conforming messages at ingestion, routing rejects to a dead-letter queue.

Sensors transmit at irregular intervals and with variable latency. Synchronous pipeline stages that block waiting for data will accumulate queue backlogs. Design all stages as asynchronous, event-driven consumers.

A pipeline that silently drops 5% of sensor messages will produce systematically biased analytics without any visible error. Instrument every stage with message count, lag, and error rate metrics, and set alerts on anomalous drops.

IoT pipelines run 24/7 with hardware failures, network partitions, and data bursts. Test failure recovery explicitly: kill a stage mid-batch and verify the pipeline resumes correctly from the last checkpoint.

29.7 Summary

  • Lambda architecture combines batch processing (accurate historical analysis) and stream processing (real-time approximate), merging results in a serving layer for low-latency queries with eventual accuracy.
  • Windowing strategies divide infinite streams into processable chunks: tumbling windows for periodic totals, sliding windows for moving averages, and session windows for user activity tracking.
  • Watermarks define how long to wait for late-arriving data - higher tolerance means more complete data but delayed results; lower tolerance means faster results but potential data loss.
  • Cost comparison: Batch-only processing costs $1,290/month for 1 TB/day; stream-only costs $10,800/month (8x more); Lambda hybrid provides best value at $1,500/month with both real-time and historical capabilities.

Implement a simplified Lambda architecture using Python to see how batch and speed layers work together.

from collections import deque

# ============ SPEED LAYER (Real-time approximate) ============
class SpeedLayer:
    def __init__(self, window_size=60):
        self.window = deque(maxlen=window_size)  # Keep last 60 readings
        self.watermark_delay = 5  # Wait 5 seconds for late data

    def process_event(self, event_time, value, arrival_delay):
        """Process event in real-time, may miss late arrivals"""
        if arrival_delay <= self.watermark_delay:
            self.window.append(value)
            return self.get_average()
        else:
            print(f"  [SPEED] Late event dropped: {value} (arrived {arrival_delay}s late)")
            return None

    def get_average(self):
        if not self.window:
            return 0
        return sum(self.window) / len(self.window)

# ============ BATCH LAYER (Accurate historical) ============
class BatchLayer:
    def __init__(self):
        self.all_data = []

    def accumulate(self, timestamp, value):
        """Store all data, even late arrivals"""
        self.all_data.append((timestamp, value))

    def recompute_all(self):
        """Nightly batch job: recompute exact historical average"""
        if not self.all_data:
            return 0
        total = sum(val for _, val in self.all_data)
        return total / len(self.all_data)

# ============ SERVING LAYER (Merge results) ============
class ServingLayer:
    def __init__(self, speed_layer, batch_layer):
        self.speed = speed_layer
        self.batch = batch_layer
        self.last_batch_run = 0

    def query_average(self, use_batch=False):
        """Serve from speed layer for recent, batch for historical"""
        if use_batch:
            return self.batch.recompute_all()
        else:
            return self.speed.get_average()

# ============ SIMULATION ============
speed = SpeedLayer(window_size=10)
batch = BatchLayer()
serving = ServingLayer(speed, batch)

print("=== Simulating IoT Temperature Sensor Stream ===\n")
print("Events arrive out of order, some late...\n")

# (event_time, value, arrival_delay_seconds)
events = [
    (1000, 22.0, 0),    # On-time events
    (1001, 23.5, 0),
    (1002, 24.0, 0),
    (1003, 22.5, 0),
    (1004, 23.0, 0),
    (999, 21.5, 7),     # LATE arrival (7 seconds late)
    (1005, 24.5, 0),
    (1006, 23.8, 0),
    (998, 21.0, 10),    # VERY late (10 seconds late)
]

for event_time, value, delay in events:
    # Both layers receive event
    batch.accumulate(event_time, value)
    speed_avg = speed.process_event(event_time, value, delay)

    if speed_avg is not None:
        print(f"Event: {value}°C | Speed layer avg: {speed_avg:.2f}°C (approximate)")

print("\n--- Nightly Batch Job Runs ---")
batch_avg = batch.recompute_all()
print(f"Batch layer avg: {batch_avg:.2f}°C (exact, includes late data)\n")

print("=== Query Results ===")
print(f"Real-time query (speed layer): {serving.query_average(use_batch=False):.2f}°C")
print(f"Historical query (batch layer): {serving.query_average(use_batch=True):.2f}°C")
print(f"\nDifference: {abs(serving.query_average(True) - serving.query_average(False)):.2f}°C")
print("  (Speed layer missed 2 late events, batch layer caught them)")

What to Observe:

  1. Speed Layer Behavior: Late events (>5 seconds) are dropped for speed - approximate average computed instantly
  2. Batch Layer Accuracy: All events stored, including late arrivals - exact average after recomputation
  3. Serving Layer Trade-off: Real-time queries use approximate speed layer; historical queries use exact batch layer
  4. Self-Correction: After batch run, historical queries show the corrected average that includes late data

Extend It:

  • Add windowing (hourly aggregates instead of all-time average)
  • Implement watermark adjustment (make late tolerance configurable)
  • Simulate Kafka topics for speed layer input
  • Add a third “view” for yesterday’s data (always serve from batch)

Big Data Pipelines connects to:

  • Upstream Dependencies: Data ingestion (Kafka) feeds both batch and stream pipelines
  • Downstream Applications: Pipelines power dashboards (speed layer) and ML training (batch layer)
  • Alternative Patterns: Kappa architecture (stream-only, no batch layer) vs Lambda (both)
  • Foundational Concepts: Windowing and watermarks from stream processing theory

Decision Tree:

Need real-time results?
  ├─ Yes + Historical accuracy matters → Lambda (batch + stream)
  ├─ Yes + Can tolerate approximate → Kappa (stream-only)
  └─ No → Batch-only (Spark scheduled jobs)

Key Insight: Lambda architecture is the “best of both worlds” but at 1.5-2x infrastructure cost. Use it when business requires both real-time alerts AND accurate reports.

Within This Module:

Related Modules:

External Resources:

  • Nathan Marz, “Big Data” (2015) - Lambda architecture original paper
  • Confluent Stream Processing Guide - Kafka Streams patterns
  • Tyler Akidau, “Streaming Systems” (2018) - Windowing and late data handling

29.8 What’s Next

If you want to… Read this
Understand the technologies powering these pipelines Big Data Technologies
Apply edge processing to reduce pipeline input volume Big Data Edge Processing
Explore real-world pipeline implementations Big Data Case Studies
Learn anomaly detection within pipelines Anomaly Detection Overview
Return to the module overview Big Data Overview