1262  Big Data Technologies

Learning Objectives

After completing this chapter, you will be able to:

  • Understand the Apache Hadoop ecosystem and its components
  • Differentiate between batch processing (Spark) and stream processing (Kafka, Flink)
  • Select appropriate time-series databases for IoT workloads
  • Design technology stacks for specific IoT scenarios

1262.1 Apache Hadoop Ecosystem

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'background': '#ffffff', 'mainBkg': '#2C3E50', 'secondBkg': '#16A085', 'tertiaryBkg': '#E67E22'}}}%%
graph TB
    subgraph Applications["Application Layer"]
        App1[Hive<br/>SQL Queries]
        App2[Pig<br/>Scripting]
        App3[Mahout<br/>Machine Learning]
    end

    subgraph Processing["Processing Layer"]
        Spark[Apache Spark<br/>In-Memory Processing]
        MR[MapReduce<br/>Batch Processing]
    end

    subgraph Resource["Resource Management"]
        YARN[YARN<br/>Resource Scheduler]
    end

    subgraph Storage["Storage Layer"]
        HDFS[HDFS<br/>Distributed Storage]
    end

    App1 & App2 & App3 --> Processing
    Spark & MR --> YARN
    YARN --> HDFS

    style HDFS fill:#2C3E50,stroke:#16A085,color:#fff
    style YARN fill:#16A085,stroke:#2C3E50,color:#fff
    style Spark fill:#E67E22,stroke:#2C3E50,color:#fff
    style MR fill:#2C3E50,stroke:#16A085,color:#fff

Figure 1262.1: Apache Hadoop Ecosystem Stack with HDFS YARN Spark

Apache Hadoop Ecosystem: The four-layer architecture with HDFS for distributed storage, YARN for resource management, Spark/MapReduce for processing, and Hive/Pig/Mahout for applications.

1262.1.1 Key Components

Component Purpose Use in IoT
HDFS Distributed storage Store petabytes of sensor data across commodity hardware
MapReduce Batch processing Process historical sensor data in parallel
Spark In-memory processing Real-time analytics, 10-100x faster than MapReduce
Hive SQL on Hadoop Query sensor data with SQL instead of code
Kafka Streaming data Handle real-time sensor streams
HBase NoSQL database Store time-series sensor readings
YARN Resource management Schedule processing jobs across the cluster
ZooKeeper Coordination Manage distributed system configuration

1262.1.2 HDFS: Distributed File System

Why HDFS for IoT? - Handles petabytes of data across thousands of machines - Automatic replication (default: 3 copies) prevents data loss - Designed for large files (GB-TB) and sequential reads

HDFS Architecture:

NameNode (Master):
- Stores file metadata (which blocks on which DataNodes)
- Single point of coordination

DataNodes (Workers):
- Store actual data blocks (128 MB default)
- Report health to NameNode
- Replicate blocks to other DataNodes

Example: 1 TB file stored as:
- 8,192 blocks x 128 MB each
- 3 copies = 24,576 blocks total
- Distributed across 100+ DataNodes

1262.1.3 MapReduce: The Original Big Data Engine

How It Works: 1. Map phase: Split data, process in parallel 2. Shuffle phase: Group results by key 3. Reduce phase: Aggregate grouped data

IoT Example: Calculate Average Temperature by City

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Input["Input Data"]
        I1[Block 1: NYC:72, LA:85, NYC:74]
        I2[Block 2: LA:82, NYC:71, CHI:65]
        I3[Block 3: CHI:68, LA:88, NYC:73]
    end

    subgraph Map["Map Phase (Parallel)"]
        M1[Mapper 1<br/>NYC:72, LA:85, NYC:74]
        M2[Mapper 2<br/>LA:82, NYC:71, CHI:65]
        M3[Mapper 3<br/>CHI:68, LA:88, NYC:73]
    end

    subgraph Shuffle["Shuffle (Group by Key)"]
        S1[NYC: 72, 74, 71, 73]
        S2[LA: 85, 82, 88]
        S3[CHI: 65, 68]
    end

    subgraph Reduce["Reduce Phase"]
        R1[NYC: AVG = 72.5F]
        R2[LA: AVG = 85F]
        R3[CHI: AVG = 66.5F]
    end

    I1 --> M1
    I2 --> M2
    I3 --> M3
    M1 & M2 & M3 --> S1 & S2 & S3
    S1 --> R1
    S2 --> R2
    S3 --> R3

    style I1 fill:#2C3E50,stroke:#16A085,color:#fff
    style I2 fill:#2C3E50,stroke:#16A085,color:#fff
    style I3 fill:#2C3E50,stroke:#16A085,color:#fff
    style M1 fill:#16A085,stroke:#2C3E50,color:#fff
    style M2 fill:#16A085,stroke:#2C3E50,color:#fff
    style M3 fill:#16A085,stroke:#2C3E50,color:#fff
    style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style S2 fill:#E67E22,stroke:#2C3E50,color:#fff
    style S3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style R1 fill:#27AE60,stroke:#2C3E50,color:#fff
    style R2 fill:#27AE60,stroke:#2C3E50,color:#fff
    style R3 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1262.2: MapReduce Example: Temperature Aggregation by City

MapReduce Data Flow: Input data is split across mappers, shuffled by city key, and reduced to calculate average temperatures. Parallel execution across nodes enables processing terabytes in minutes.

1262.2 Apache Spark: In-Memory Processing

Spark is the modern replacement for MapReduce, offering 10-100x faster processing through in-memory computation.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
graph TB
    subgraph SparkCore["Spark Core (RDD)"]
        SC[Resilient Distributed<br/>Datasets]
    end

    subgraph Libraries["Spark Libraries"]
        SQL[Spark SQL<br/>Structured Data]
        ML[MLlib<br/>Machine Learning]
        Stream[Spark Streaming<br/>Real-Time]
        Graph[GraphX<br/>Graph Processing]
    end

    SQL & ML & Stream & Graph --> SparkCore

    subgraph Cluster["Cluster Manager"]
        YARN[YARN]
        Mesos[Mesos]
        K8s[Kubernetes]
        Standalone[Standalone]
    end

    SparkCore --> Cluster

    subgraph Storage["Storage"]
        HDFS2[HDFS]
        S3[S3]
        Cassandra[Cassandra]
        Kafka2[Kafka]
    end

    Cluster --> Storage

    style SparkCore fill:#E67E22,stroke:#2C3E50,color:#fff
    style SQL fill:#2C3E50,stroke:#16A085,color:#fff
    style ML fill:#16A085,stroke:#2C3E50,color:#fff
    style Stream fill:#2C3E50,stroke:#16A085,color:#fff
    style Graph fill:#16A085,stroke:#2C3E50,color:#fff

Figure 1262.3: Apache Spark Architecture with Core Libraries and Cluster Managers

Apache Spark Architecture: Spark Core with RDDs provides the foundation, supporting SQL, ML, streaming, and graph libraries that run on various cluster managers and storage backends.

1262.2.1 Why Spark Over MapReduce?

Feature MapReduce Spark
Processing speed Disk-based (slow) In-memory (fast)
Ease of use Complex Java code Simple Python/SQL
Iterative algorithms Multiple disk writes Single memory pass
Real-time Not supported Spark Streaming
Machine learning Basic MLlib (100+ algorithms)

1262.2.2 Spark for IoT: Real-World Example

Processing 1 Billion Sensor Readings:

# Read sensor data from distributed storage
sensor_data = spark.read.parquet("s3://iot-data/sensors/")

# Filter, aggregate, and analyze in memory
hourly_averages = sensor_data \
    .filter(col("timestamp") > "2024-01-01") \
    .groupBy(window("timestamp", "1 hour"), "sensor_id") \
    .agg(avg("temperature").alias("avg_temp"),
         max("temperature").alias("max_temp"),
         min("temperature").alias("min_temp"))

# Write results back to storage
hourly_averages.write.parquet("s3://iot-data/hourly-summaries/")

# Performance:
# - MapReduce: 45 minutes (multiple disk read/write cycles)
# - Spark: 3 minutes (in-memory processing)

1262.3 Apache Kafka: Real-Time Streaming

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Producers["Data Producers"]
        P1[Temperature<br/>Sensors]
        P2[Motion<br/>Sensors]
        P3[Camera<br/>Metadata]
    end

    subgraph Kafka["Apache Kafka Cluster"]
        T1[Topic: temperature<br/>Partitions: 10]
        T2[Topic: motion<br/>Partitions: 5]
        T3[Topic: camera<br/>Partitions: 20]
    end

    subgraph Consumers["Data Consumers"]
        C1[Real-Time<br/>Dashboard]
        C2[Anomaly<br/>Detection]
        C3[Long-Term<br/>Storage]
    end

    P1 --> T1
    P2 --> T2
    P3 --> T3

    T1 --> C1 & C2 & C3
    T2 --> C1 & C2 & C3
    T3 --> C1 & C2 & C3

    style P1 fill:#2C3E50,stroke:#16A085,color:#fff
    style P2 fill:#2C3E50,stroke:#16A085,color:#fff
    style P3 fill:#2C3E50,stroke:#16A085,color:#fff
    style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T2 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff

Figure 1262.4: Apache Kafka Architecture with Producers Topics and Consumers

Apache Kafka Architecture: Multiple IoT producers publish to topic partitions, which are consumed by real-time dashboards, anomaly detection systems, and storage pipelines simultaneously.

1262.3.1 Why Kafka for IoT?

Feature Value IoT Benefit
Throughput 100K-1M messages/sec Handle sensor firehose
Latency Single-digit ms Real-time alerts
Durability Replicated, persistent No data loss
Scalability Add brokers on demand Grow with sensors
Multiple consumers Read same data multiple times Analytics + storage + alerts

1262.3.2 Kafka Configuration for IoT

# Production Kafka configuration for IoT
kafka_config:
  brokers: 3                    # Minimum for production
  replication_factor: 3         # Survive 2 broker failures
  partitions_per_topic: 10      # Parallelism for consumers

  topics:
    sensor-readings:
      partitions: 20            # High throughput
      retention_hours: 168      # 7 days for replay
      compression: lz4          # 50% size reduction

    alerts:
      partitions: 5             # Lower volume
      retention_hours: 720      # 30 days for audit

  performance:
    batch_size: 16384           # 16 KB batches
    linger_ms: 5                # Wait 5ms for batch fill
    buffer_memory: 33554432     # 32 MB producer buffer

1262.4 Time-Series Databases

IoT data is inherently time-series - continuous streams of timestamped readings. Specialized databases handle this pattern 100x better than general-purpose databases.

1262.4.1 Time-Series Database Comparison

Database Type Best For Write Speed Query Speed Compression
InfluxDB Time-series High-frequency sensor data 100K+ writes/sec Fast time-range 10x
TimescaleDB Time-series (PostgreSQL) SQL-compatible analytics 50K writes/sec Complex joins 8x
Prometheus Metrics System monitoring 100K samples/sec PromQL queries 5x
Cassandra Wide-column Massive scale writes 1M+ writes/sec Key-based lookup 3x

1262.4.2 Why Time-Series DBs Are Different

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Traditional["Traditional DB (Row-Based)"]
        T1[Row 1: timestamp, sensor_id, temp, humidity, pressure]
        T2[Row 2: timestamp, sensor_id, temp, humidity, pressure]
        T3[Row 3: timestamp, sensor_id, temp, humidity, pressure]
        T4[Query: AVG temp<br/>Reads ALL columns<br/>100% I/O]
    end

    subgraph TimeSeries["Time-Series DB (Columnar)"]
        C1[timestamp column: t1, t2, t3, ...]
        C2[temp column: 72, 73, 71, ...]
        C3[humidity column: 45, 46, 44, ...]
        C4[pressure column: 1013, 1012, 1014, ...]
        C5[Query: AVG temp<br/>Reads ONLY temp column<br/>20% I/O]
    end

    style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T2 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T4 fill:#E74C3C,stroke:#2C3E50,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#27AE60,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff
    style C4 fill:#16A085,stroke:#2C3E50,color:#fff
    style C5 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1262.5: Row-Based versus Columnar Storage for Time-Series Queries

Columnar vs Row-Based Storage: Traditional databases read all columns for every query (100% I/O). Time-series columnar databases read only the requested columns (20% I/O for temperature queries), providing 5x performance improvement.

1262.4.3 InfluxDB for IoT: Configuration Example

# InfluxDB configuration for IoT workload
influxdb_config:
  # Retention policies - automatic data lifecycle
  retention_policies:
    - name: "realtime"
      duration: "7d"           # Keep raw data 7 days
      replication: 2           # 2 copies for durability
      shard_duration: "1d"     # New shard daily

    - name: "downsampled"
      duration: "365d"         # Keep aggregates 1 year
      replication: 2

  # Continuous queries - automatic aggregation
  continuous_queries:
    - name: "downsample_temperature"
      query: |
        SELECT mean(temperature) AS avg_temp,
               max(temperature) AS max_temp,
               min(temperature) AS min_temp
        INTO downsampled.temperature_hourly
        FROM realtime.sensor_readings
        GROUP BY time(1h), sensor_id

  # Storage optimization
  storage:
    cache_snapshot_memory: "256m"
    compact_full_write_cold: "4h"
    index_version: "tsi1"      # Time-series index

1262.5 Understanding Check: Time-Series DB vs Regular DB

Scenario: A smart building has 10,000 sensors (temperature, humidity, occupancy, energy consumption) sampling every 5 seconds. You need to: - Query: โ€œShow me energy consumption trends for Floor 3 over the past weekโ€ - Downsample: Keep 5-second data for 30 days, then 1-minute averages forever - Retention: Automatically delete raw data older than 30 days

Think about: Why would you use InfluxDB (time-series DB) instead of PostgreSQL (relational DB)?

Key Insights:

  1. Write Performance:

    • Volume: 10,000 sensors x 12 samples/min = 120,000 writes/minute = 2,000 writes/second
    • PostgreSQL: Optimized for transactional updates (hundreds of writes/sec)
    • InfluxDB: Optimized for time-series writes (millions of writes/sec)
    • Real number: InfluxDB handles 100x more time-series writes than PostgreSQL
  2. Storage Efficiency:

    • Daily data: 10,000 sensors x 17,280 samples/day x 50 bytes = 8.64 GB/day raw
    • PostgreSQL: Row-based storage, no time-series compression: 8.64 GB/day
    • InfluxDB: Columnar + time-series compression (similar values compress well): 864 MB/day (90% reduction)
    • Real number: InfluxDB uses 10x less storage for time-series data
  3. Query Performance for Time Ranges:

    PostgreSQL approach:

    SELECT date_trunc('hour', timestamp), AVG(temperature)
    FROM sensor_readings
    WHERE sensor_location = 'Floor3'
      AND timestamp > NOW() - INTERVAL '7 days'
    GROUP BY date_trunc('hour', timestamp);
    • Scans full table, filters rows, then aggregates
    • Query time: 30-60 seconds on billions of rows

    InfluxDB approach:

    SELECT MEAN(temperature)
    FROM sensors
    WHERE location = 'Floor3'
      AND time > now() - 7d
    GROUP BY time(1h);
    • Time-based indexing, automatic downsampling
    • Query time: 100-500ms (100x faster!)
  4. Automatic Retention Policies:

    PostgreSQL:

    -- Manual cron job required
    DELETE FROM sensor_readings WHERE timestamp < NOW() - INTERVAL '30 days';
    -- Runs for hours, locks table

    InfluxDB:

    -- Built-in retention policy
    CREATE RETENTION POLICY "30days" ON "sensors" DURATION 30d REPLICATION 1 DEFAULT;
    -- Automatic, no performance impact

Decision Rule:

Use TIME-SERIES DB when:
- High write volume (>1,000 writes/sec)
- Time-range queries common ("last hour", "past week")
- Data has timestamps and is append-only
- Automatic downsampling/retention needed
- Examples: InfluxDB, TimescaleDB, Prometheus

Use RELATIONAL DB when:
- Transactional updates (UPDATE/DELETE common)
- Complex joins across many tables
- ACID compliance critical
- Data is not primarily time-ordered
- Examples: PostgreSQL, MySQL

1262.6 Technology Selection Framework

1262.6.1 Technology Comparison Table

Requirement Traditional DB (PostgreSQL) Time-Series DB (InfluxDB) Stream Processing (Kafka + Spark) Batch Processing (Hadoop)
Write Throughput 1K-10K writes/sec 100K-1M writes/sec 100K-10M events/sec N/A (batch loads)
Storage Cost $0.10/GB/month $0.10/GB (but 10x compressed) $0.023/GB (S3) $0.023/GB (HDFS/S3)
Query Latency 100ms-10s 10ms-1s (time-range) 100ms-5s (real-time) Minutes to hours
Data Retention Manual deletion Auto retention policies Kafka: 1-30 days Years (petabytes)
Use Case Transactional apps Sensor data, metrics Real-time dashboards Historical analytics
Cost (1M events/day) $500/month $50/month $200/month $20/month (batch)

1262.6.2 Common Pitfall: Using the Wrong Database

CautionPitfall: Using the Wrong Database for Time-Series Data

The Mistake:

-- Store sensor data in MySQL (row-oriented, general-purpose)
CREATE TABLE sensors (
    timestamp DATETIME,
    sensor_id INT,
    temperature FLOAT,
    humidity FLOAT
);

-- Query: Get average temperature per hour for last 30 days
SELECT DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00'),
       AVG(temperature)
FROM sensors
WHERE timestamp > NOW() - INTERVAL 30 DAY
GROUP BY DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00');

Why Itโ€™s Slow: - MySQL scans entire table even if you only need temperature column - Row-oriented storage reads all columns (timestamp, sensor_id, temperature, humidity) - No automatic time-based partitioning - Grouping by hour requires complex DATE_FORMAT parsing - Query time: 45 seconds on 100M rows

The Fix: Use time-series database like InfluxDB

-- InfluxDB query (columnar, time-optimized)
SELECT MEAN(temperature)
FROM sensors
WHERE time > now() - 30d
GROUP BY time(1h)

Why Itโ€™s Fast: - Columnar storage: Only reads temperature and time columns - Automatic time-based sharding (data partitioned by day/hour) - Native time aggregation (GROUP BY time(1h) is optimized) - Query time: 2 seconds on same 100M rows

Performance: 22x faster with purpose-built database

1262.7 Summary

  • Apache Hadoop ecosystem provides distributed processing through HDFS for storage, MapReduce and Spark for computation, and frameworks like Kafka for real-time streaming.
  • Apache Spark offers 10-100x faster processing than MapReduce through in-memory computation, supporting SQL queries, machine learning, and streaming in a unified platform.
  • Apache Kafka handles millions of events per second with millisecond latency, enabling real-time IoT data pipelines with built-in durability and scalability.
  • Time-series databases like InfluxDB provide 100x better write performance and 10x storage compression compared to traditional databases for sensor data workloads.

1262.8 Whatโ€™s Next

Now that you understand the technology landscape, continue to: