1261  Big Data Pipelines

Learning Objectives

After completing this chapter, you will be able to:

  • Design Lambda architecture combining batch and stream processing
  • Implement stream processing with windowing strategies
  • Handle late-arriving data with watermarks
  • Choose between batch, stream, and hybrid processing approaches

1261.1 Lambda Architecture

The Lambda Architecture solves a fundamental problem: real-time analytics vs accurate historical analysis. You need both, but they have conflicting requirements.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Input["Data Ingestion"]
        Sensors[IoT Sensors<br/>Real-Time Data]
    end

    subgraph Batch["Batch Layer (Accuracy)"]
        HDFS[(HDFS<br/>Master Dataset)]
        Spark[Spark Batch<br/>Complete Recompute]
        BatchViews[(Batch Views<br/>Historical Truth)]
    end

    subgraph Speed["Speed Layer (Latency)"]
        Kafka[Kafka<br/>Stream Buffer]
        SparkStream[Spark Streaming<br/>Incremental Updates]
        RealTime[(Real-Time Views<br/>Approximate)]
    end

    subgraph Serving["Serving Layer"]
        Merge[View Merger<br/>Combine Batch + Speed]
        Query[Query Results<br/>Low Latency + Accurate]
    end

    Sensors --> HDFS
    Sensors --> Kafka
    HDFS --> Spark
    Spark --> BatchViews
    Kafka --> SparkStream
    SparkStream --> RealTime
    BatchViews --> Merge
    RealTime --> Merge
    Merge --> Query

    style Sensors fill:#2C3E50,stroke:#16A085,color:#fff
    style HDFS fill:#16A085,stroke:#2C3E50,color:#fff
    style Spark fill:#2C3E50,stroke:#16A085,color:#fff
    style BatchViews fill:#16A085,stroke:#2C3E50,color:#fff
    style Kafka fill:#E67E22,stroke:#2C3E50,color:#fff
    style SparkStream fill:#E67E22,stroke:#2C3E50,color:#fff
    style RealTime fill:#E67E22,stroke:#2C3E50,color:#fff
    style Merge fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style Query fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1261.1: Lambda Architecture: Batch Layer for Accuracy Speed Layer for Latency

Lambda Architecture: IoT data flows to both batch layer (accurate historical analysis) and speed layer (real-time approximate), merged in serving layer to provide low-latency queries with eventual accuracy.

1261.1.1 Why Lambda Architecture?

Requirement Batch Only Stream Only Lambda (Both)
Latency Hours Milliseconds Milliseconds
Accuracy 100% correct May miss late data 100% eventually
Cost Low High Medium
Complexity Simple Medium High

1261.1.2 Lambda Implementation Example

# BATCH LAYER: Complete recompute (accurate, slow)
def batch_layer():
    # Read entire historical dataset
    all_sensor_data = spark.read.parquet("s3://iot-data/raw/")

    # Full aggregation (correct answer)
    daily_totals = all_sensor_data \
        .groupBy("date", "sensor_id") \
        .agg(sum("energy_kwh").alias("total_energy"))

    # Write to batch views (replaces previous)
    daily_totals.write.mode("overwrite") \
        .parquet("s3://iot-data/batch-views/daily-energy/")

# SPEED LAYER: Incremental updates (approximate, fast)
def speed_layer():
    # Read only recent data (last hour)
    streaming_data = spark.readStream \
        .format("kafka") \
        .option("subscribe", "sensor-readings") \
        .load()

    # Incremental aggregation (approximate - may miss late data)
    hourly_totals = streaming_data \
        .withWatermark("timestamp", "10 minutes") \
        .groupBy(window("timestamp", "1 hour"), "sensor_id") \
        .agg(sum("energy_kwh").alias("total_energy"))

    # Write to real-time views
    hourly_totals.writeStream \
        .outputMode("update") \
        .format("delta") \
        .start("s3://iot-data/speed-views/hourly-energy/")

# SERVING LAYER: Merge batch + speed
def serving_layer(query_date):
    if query_date < today():
        # Historical: Use accurate batch view
        return spark.read.parquet(f"s3://iot-data/batch-views/daily-energy/")
    else:
        # Current: Use speed layer (approximate)
        return spark.read.parquet(f"s3://iot-data/speed-views/hourly-energy/")

1261.2 Stream Processing with Windowing

IoT data arrives continuously - you can’t wait for “all” data before computing. Windowing divides infinite streams into finite, processable chunks.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Tumbling["Tumbling Windows (Non-Overlapping)"]
        T1["[0-60s]<br/>avg: 72F"] --> T2["[60-120s]<br/>avg: 74F"]
        T2 --> T3["[120-180s]<br/>avg: 73F"]
    end

    subgraph Sliding["Sliding Windows (Overlapping)"]
        S1["[0-60s]<br/>avg: 72F"]
        S2["[30-90s]<br/>avg: 73F"]
        S3["[60-120s]<br/>avg: 74F"]
        S4["[90-150s]<br/>avg: 73.5F"]
        S1 --> S2 --> S3 --> S4
    end

    subgraph Session["Session Windows (Gap-Based)"]
        SE1["[0-45s]<br/>User active"]
        SE2["[gap > 30s]"]
        SE3["[90-150s]<br/>User active"]
        SE1 --> SE2 --> SE3
    end

    style T1 fill:#2C3E50,stroke:#16A085,color:#fff
    style T2 fill:#2C3E50,stroke:#16A085,color:#fff
    style T3 fill:#2C3E50,stroke:#16A085,color:#fff
    style S1 fill:#16A085,stroke:#2C3E50,color:#fff
    style S2 fill:#16A085,stroke:#2C3E50,color:#fff
    style S3 fill:#16A085,stroke:#2C3E50,color:#fff
    style S4 fill:#16A085,stroke:#2C3E50,color:#fff
    style SE1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style SE2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style SE3 fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 1261.2: Three Window Types: Tumbling Sliding and Session

1261.2.1 Window Type Comparison

Window Type Best For Example
Tumbling Periodic aggregates (hourly/daily totals) “Total energy per hour”
Sliding Moving averages, trend detection “5-minute average, updated every minute”
Session User activity, event sequences “Group clicks until 30s inactivity”

1261.2.2 Implementing Windowed Aggregation

# Tumbling window: Non-overlapping 1-minute buckets
tumbling_result = sensor_stream \
    .groupBy(window("timestamp", "1 minute"), "sensor_id") \
    .agg(avg("temperature").alias("avg_temp"))

# Sliding window: 5-minute window, slides every 1 minute
sliding_result = sensor_stream \
    .groupBy(window("timestamp", "5 minutes", "1 minute"), "sensor_id") \
    .agg(avg("temperature").alias("avg_temp"))

# Session window: Group events with < 30 second gap
session_result = sensor_stream \
    .groupBy(session_window("timestamp", "30 seconds"), "user_id") \
    .agg(count("*").alias("events_in_session"))

1261.3 Handling Late-Arriving Data

IoT networks are imperfect - data arrives late due to network delays, sensor buffering, or connectivity issues. Watermarks define how long to wait for late data.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
sequenceDiagram
    participant S as Sensor
    participant K as Kafka
    participant P as Processor
    participant O as Output

    Note over S,O: Event Time: 10:00:00

    S->>K: Reading @ 10:00:00
    K->>P: Delivered @ 10:00:02 (2s delay)
    P->>O: Processed in 10:00 window

    Note over S,O: Late Event Scenario

    S->>K: Reading @ 10:00:30
    Note right of K: Network issue causes 90s delay
    K->>P: Delivered @ 10:02:00 (90s late!)

    alt Watermark = 60s
        P->>P: Window 10:00-10:01 CLOSED at 10:02:00
        P-->>O: Event DROPPED (too late)
    else Watermark = 120s
        P->>P: Window still OPEN until 10:03:00
        P->>O: Event INCLUDED in window
    end

Figure 1261.3: Watermark Handling: Late Events with Different Tolerance Settings

Watermark Strategy for Late Events: Events arriving after the watermark tolerance are dropped; setting appropriate watermark balances completeness (higher tolerance) against latency (lower tolerance).

1261.3.1 Watermark Configuration

# Configure watermark: Accept events up to 2 minutes late
sensor_stream_with_watermark = sensor_stream \
    .withWatermark("event_timestamp", "2 minutes") \
    .groupBy(
        window("event_timestamp", "1 minute"),
        "sensor_id"
    ) \
    .agg(avg("temperature").alias("avg_temp"))

# Watermark tradeoffs:
# - 10 seconds: Fast results, may miss delayed sensors
# - 2 minutes: More complete, but delayed output
# - 1 hour: Very complete, but results arrive 1 hour late

1261.4 Understanding Check: Batch vs Stream Processing

Scenario: A smart city has 5,000 parking sensors that send occupancy status (occupied/free) whenever it changes. During rush hour, this generates 50,000 updates/hour. The city needs: 1. Real-time parking availability map (updated within 5 seconds) 2. Daily parking utilization reports for urban planning 3. Monthly revenue tracking for parking meters

Think about: Which processing model should you use for each requirement?

Key Insights:

  1. Real-time Parking Map - Stream Processing
    • Why: Users need immediate information (“Is parking available NOW?”)
    • Latency: 5 seconds acceptable
    • Volume: 50,000 updates/hour = 13.9 updates/second (manageable for streaming)
    • Technology: Apache Kafka + Spark Streaming
    • Real number: Stream processing delivers results in 2-5 seconds vs 1-24 hours for batch
  2. Daily Utilization Reports - Batch Processing
    • Why: Historical analysis doesn’t need real-time processing
    • Latency: 24 hours acceptable (run nightly)
    • Volume: Process full day’s data in one job
    • Technology: Apache Spark batch job scheduled at midnight
    • Real number: Batch processing is 10x cheaper than maintaining real-time streaming for historical queries
  3. Monthly Revenue - Batch Processing
    • Why: Aggregating 30 days of data, accuracy > speed
    • Latency: Can run monthly (1st of each month)
    • Volume: 50,000 events/hour x 720 hours/month = 36 million events
    • Technology: SQL query on data warehouse
    • Real number: Monthly batch job costs $5 vs $150/month for continuous stream processing

Cost Comparison:

Approach Infrastructure Cost/Month Use Case
Stream (All 3) Kafka + Spark Streaming 24/7 $500 Overkill for batch needs
Batch (All 3) Nightly Spark jobs $50 Too slow for real-time map
Hybrid (Stream + Batch) Streaming for map, batch for reports $200 Optimal cost/performance

Decision Rule:

Use STREAM processing when:
- Latency requirement < 1 minute
- Results needed continuously
- Data arrives in real-time
- Immediate action required (alerts, dashboards)

Use BATCH processing when:
- Latency requirement > 1 hour
- Results needed periodically (daily/weekly/monthly)
- Processing full datasets for accuracy
- Complex analytics on historical data

Use LAMBDA (Both) when:
- Need real-time + historical views
- Real-time alerts + periodic reports
- Different latency needs for different queries

1261.5 Batch vs Stream Processing Decision

Factor Use Batch Use Stream Use Lambda (Both)
Latency Need > 1 hour < 1 minute Mixed (real-time + reports)
Data Volume TB-PB scale MB-GB/sec Both
Query Pattern Historical trends Live dashboards Real-time + historical
Cost (1 TB/day) $20/month $200/month $150/month (hybrid)
Example Monthly reports Fraud detection E-commerce analytics

1261.5.1 Cost Breakdown Example (1 TB/day processing)

Batch Only:

Storage: 30 TB/month x $0.023/GB = $690/month
Compute: 1 daily Spark job x 100 nodes x 2 hours x $0.10/hour = $20/day = $600/month
Total: $1,290/month

Stream Only:

Kafka: 3 brokers x $200/month = $600/month
Spark Streaming: 50 workers x 24/7 x $0.20/hour = $7,200/month
Storage (30 days): 30 TB x $0.10/GB = $3,000/month
Total: $10,800/month (8.4x more expensive!)

Lambda Architecture (Hybrid):

Stream Layer: Last 24 hours (real-time alerts) = $300/month
Batch Layer: Historical (daily aggregates) = $1,000/month
Serving Layer: Query merging = $200/month
Total: $1,500/month
Best of both: Real-time insights + historical accuracy

1261.6 Knowledge Checks

Question: A smart meter system uses a watermark of 2 minutes for handling late-arriving data. An event with event_time=12:00:00 arrives at processing_time=12:02:30. The current watermark is at 12:02:00. What happens to this event?

Explanation: Watermarks define when windows close. The 2-minute watermark means we wait 2 minutes past the window end before finalizing. This event arrived 2.5 minutes late (event time 12:00:00, arrival time 12:02:30), exceeding the 2-minute tolerance, so it’s dropped. In production, track late_events metrics to tune watermark settings.

Question: A data engineer is choosing between tumbling windows (60s, non-overlapping) and sliding windows (60s window, 30s slide) for IoT temperature monitoring. The sliding window approach will produce more window outputs for the same data stream. What is the primary advantage that justifies this increased computational cost?

Explanation: Sliding windows update more frequently (every 30s vs 60s for tumbling), creating smoother visualizations on dashboards. Each event is processed in multiple overlapping windows. Use sliding when user experience benefits from gradual trend changes; use tumbling when cost efficiency matters.

Question: A smart building system processes HVAC sensor data. The data quality framework flags 20% of readings as “INVALID_TEMPERATURE” (out of range -40 to 85C) and writes them to a quarantine zone. Six months later, engineers discover the sensors were correctly measuring steam pipe temperatures (120C), but the validation rule was wrong. What data pipeline principle was violated?

Explanation: Data lineage means tracking data from source through transformations. When validation rules change, you should be able to reprocess quarantined data with corrected rules. The fix: preserve raw data with metadata (rule_version, quarantine_reason, timestamp) to enable reprocessing without data loss.

1261.7 Common Pitfall: Using Batch for Real-time Requirements

CautionPitfall: Using Batch Processing for Real-time Requirements

The mistake: Scheduling hourly or daily batch jobs to process data that requires real-time or near-real-time responses, causing missed SLAs and frustrated users.

Symptoms: - Users complain that dashboards show “stale” data (hours old) - Alerts for critical events arrive long after the incident - Fraud detection catches issues after damage is done - Business requests for “real-time” lead to increasingly frequent batch jobs (hourly -> every 15 min -> every 5 min)

Why it happens: Batch processing is familiar (SQL queries, scheduled cron jobs) and cheaper. Teams underestimate latency requirements or don’t distinguish between “nice to have real-time” and “must have real-time.” Stream processing seems complex and expensive.

The fix: Classify your use cases by actual latency requirements:

# Use Case Classification Framework
latency_requirements = {
    # BATCH OK (> 1 hour latency acceptable)
    "monthly_reports": "batch",
    "historical_analysis": "batch",
    "ML_model_training": "batch",

    # STREAM REQUIRED (< 1 minute latency needed)
    "fraud_detection": "stream",
    "equipment_alerts": "stream",
    "live_dashboards": "stream",
    "surge_pricing": "stream",

    # HYBRID (Lambda architecture)
    "inventory_tracking": "lambda",
    "user_analytics": "lambda"
}

Prevention: During requirements gathering, explicitly ask: “If this data is 1 hour old, what’s the business impact?” If the answer involves money loss, safety risk, or customer complaints, you need stream processing.

1261.8 Summary

  • Lambda architecture combines batch processing (accurate historical analysis) and stream processing (real-time approximate), merging results in a serving layer for low-latency queries with eventual accuracy.
  • Windowing strategies divide infinite streams into processable chunks: tumbling windows for periodic totals, sliding windows for moving averages, and session windows for user activity tracking.
  • Watermarks define how long to wait for late-arriving data - higher tolerance means more complete data but delayed results; lower tolerance means faster results but potential data loss.
  • Cost comparison: Batch-only processing costs $1,290/month for 1 TB/day; stream-only costs $10,800/month (8x more); Lambda hybrid provides best value at $1,500/month with both real-time and historical capabilities.

1261.9 What’s Next

Now that you understand pipeline architectures, continue to: