1262 Big Data Technologies

Learning Objectives

After completing this chapter, you will be able to:

Understand the Apache Hadoop ecosystem and its components
Differentiate between batch processing (Spark) and stream processing (Kafka, Flink)
Select appropriate time-series databases for IoT workloads
Design technology stacks for specific IoT scenarios

1262.1 Apache Hadoop Ecosystem

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'background': '#ffffff', 'mainBkg': '#2C3E50', 'secondBkg': '#16A085', 'tertiaryBkg': '#E67E22'}}}%%
graph TB
    subgraph Applications["Application Layer"]
        App1[Hive<br/>SQL Queries]
        App2[Pig<br/>Scripting]
        App3[Mahout<br/>Machine Learning]
    end

    subgraph Processing["Processing Layer"]
        Spark[Apache Spark<br/>In-Memory Processing]
        MR[MapReduce<br/>Batch Processing]
    end

    subgraph Resource["Resource Management"]
        YARN[YARN<br/>Resource Scheduler]
    end

    subgraph Storage["Storage Layer"]
        HDFS[HDFS<br/>Distributed Storage]
    end

    App1 & App2 & App3 --> Processing
    Spark & MR --> YARN
    YARN --> HDFS

    style HDFS fill:#2C3E50,stroke:#16A085,color:#fff
    style YARN fill:#16A085,stroke:#2C3E50,color:#fff
    style Spark fill:#E67E22,stroke:#2C3E50,color:#fff
    style MR fill:#2C3E50,stroke:#16A085,color:#fff

Figure 1262.1: Apache Hadoop Ecosystem Stack with HDFS YARN Spark

Apache Hadoop Ecosystem: The four-layer architecture with HDFS for distributed storage, YARN for resource management, Spark/MapReduce for processing, and Hive/Pig/Mahout for applications.

Hadoop Component Deep Dive

1262.1.1 Key Components

Component	Purpose	Use in IoT
HDFS	Distributed storage	Store petabytes of sensor data across commodity hardware
MapReduce	Batch processing	Process historical sensor data in parallel
Spark	In-memory processing	Real-time analytics, 10-100x faster than MapReduce
Hive	SQL on Hadoop	Query sensor data with SQL instead of code
Kafka	Streaming data	Handle real-time sensor streams
HBase	NoSQL database	Store time-series sensor readings
YARN	Resource management	Schedule processing jobs across the cluster
ZooKeeper	Coordination	Manage distributed system configuration

1262.1.2 HDFS: Distributed File System

Why HDFS for IoT? - Handles petabytes of data across thousands of machines - Automatic replication (default: 3 copies) prevents data loss - Designed for large files (GB-TB) and sequential reads

HDFS Architecture:

NameNode (Master):
- Stores file metadata (which blocks on which DataNodes)
- Single point of coordination

DataNodes (Workers):
- Store actual data blocks (128 MB default)
- Report health to NameNode
- Replicate blocks to other DataNodes

Example: 1 TB file stored as:
- 8,192 blocks x 128 MB each
- 3 copies = 24,576 blocks total
- Distributed across 100+ DataNodes

1262.1.3 MapReduce: The Original Big Data Engine

How It Works: 1. Map phase: Split data, process in parallel 2. Shuffle phase: Group results by key 3. Reduce phase: Aggregate grouped data

IoT Example: Calculate Average Temperature by City

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Input["Input Data"]
        I1[Block 1: NYC:72, LA:85, NYC:74]
        I2[Block 2: LA:82, NYC:71, CHI:65]
        I3[Block 3: CHI:68, LA:88, NYC:73]
    end

    subgraph Map["Map Phase (Parallel)"]
        M1[Mapper 1<br/>NYC:72, LA:85, NYC:74]
        M2[Mapper 2<br/>LA:82, NYC:71, CHI:65]
        M3[Mapper 3<br/>CHI:68, LA:88, NYC:73]
    end

    subgraph Shuffle["Shuffle (Group by Key)"]
        S1[NYC: 72, 74, 71, 73]
        S2[LA: 85, 82, 88]
        S3[CHI: 65, 68]
    end

    subgraph Reduce["Reduce Phase"]
        R1[NYC: AVG = 72.5F]
        R2[LA: AVG = 85F]
        R3[CHI: AVG = 66.5F]
    end

    I1 --> M1
    I2 --> M2
    I3 --> M3
    M1 & M2 & M3 --> S1 & S2 & S3
    S1 --> R1
    S2 --> R2
    S3 --> R3

    style I1 fill:#2C3E50,stroke:#16A085,color:#fff
    style I2 fill:#2C3E50,stroke:#16A085,color:#fff
    style I3 fill:#2C3E50,stroke:#16A085,color:#fff
    style M1 fill:#16A085,stroke:#2C3E50,color:#fff
    style M2 fill:#16A085,stroke:#2C3E50,color:#fff
    style M3 fill:#16A085,stroke:#2C3E50,color:#fff
    style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style S2 fill:#E67E22,stroke:#2C3E50,color:#fff
    style S3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style R1 fill:#27AE60,stroke:#2C3E50,color:#fff
    style R2 fill:#27AE60,stroke:#2C3E50,color:#fff
    style R3 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1262.2: MapReduce Example: Temperature Aggregation by City

MapReduce Data Flow: Input data is split across mappers, shuffled by city key, and reduced to calculate average temperatures. Parallel execution across nodes enables processing terabytes in minutes.

1262.2 Apache Spark: In-Memory Processing

Spark is the modern replacement for MapReduce, offering 10-100x faster processing through in-memory computation.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
graph TB
    subgraph SparkCore["Spark Core (RDD)"]
        SC[Resilient Distributed<br/>Datasets]
    end

    subgraph Libraries["Spark Libraries"]
        SQL[Spark SQL<br/>Structured Data]
        ML[MLlib<br/>Machine Learning]
        Stream[Spark Streaming<br/>Real-Time]
        Graph[GraphX<br/>Graph Processing]
    end

    SQL & ML & Stream & Graph --> SparkCore

    subgraph Cluster["Cluster Manager"]
        YARN[YARN]
        Mesos[Mesos]
        K8s[Kubernetes]
        Standalone[Standalone]
    end

    SparkCore --> Cluster

    subgraph Storage["Storage"]
        HDFS2[HDFS]
        S3[S3]
        Cassandra[Cassandra]
        Kafka2[Kafka]
    end

    Cluster --> Storage

    style SparkCore fill:#E67E22,stroke:#2C3E50,color:#fff
    style SQL fill:#2C3E50,stroke:#16A085,color:#fff
    style ML fill:#16A085,stroke:#2C3E50,color:#fff
    style Stream fill:#2C3E50,stroke:#16A085,color:#fff
    style Graph fill:#16A085,stroke:#2C3E50,color:#fff

Figure 1262.3: Apache Spark Architecture with Core Libraries and Cluster Managers

Apache Spark Architecture: Spark Core with RDDs provides the foundation, supporting SQL, ML, streaming, and graph libraries that run on various cluster managers and storage backends.

1262.2.1 Why Spark Over MapReduce?

Feature	MapReduce	Spark
Processing speed	Disk-based (slow)	In-memory (fast)
Ease of use	Complex Java code	Simple Python/SQL
Iterative algorithms	Multiple disk writes	Single memory pass
Real-time	Not supported	Spark Streaming
Machine learning	Basic	MLlib (100+ algorithms)

1262.2.2 Spark for IoT: Real-World Example

Processing 1 Billion Sensor Readings:

# Read sensor data from distributed storage
sensor_data = spark.read.parquet("s3://iot-data/sensors/")

# Filter, aggregate, and analyze in memory
hourly_averages = sensor_data \
    .filter(col("timestamp") > "2024-01-01") \
    .groupBy(window("timestamp", "1 hour"), "sensor_id") \
    .agg(avg("temperature").alias("avg_temp"),
         max("temperature").alias("max_temp"),
         min("temperature").alias("min_temp"))

# Write results back to storage
hourly_averages.write.parquet("s3://iot-data/hourly-summaries/")

# Performance:
# - MapReduce: 45 minutes (multiple disk read/write cycles)
# - Spark: 3 minutes (in-memory processing)

1262.3 Apache Kafka: Real-Time Streaming

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Producers["Data Producers"]
        P1[Temperature<br/>Sensors]
        P2[Motion<br/>Sensors]
        P3[Camera<br/>Metadata]
    end

    subgraph Kafka["Apache Kafka Cluster"]
        T1[Topic: temperature<br/>Partitions: 10]
        T2[Topic: motion<br/>Partitions: 5]
        T3[Topic: camera<br/>Partitions: 20]
    end

    subgraph Consumers["Data Consumers"]
        C1[Real-Time<br/>Dashboard]
        C2[Anomaly<br/>Detection]
        C3[Long-Term<br/>Storage]
    end

    P1 --> T1
    P2 --> T2
    P3 --> T3

    T1 --> C1 & C2 & C3
    T2 --> C1 & C2 & C3
    T3 --> C1 & C2 & C3

    style P1 fill:#2C3E50,stroke:#16A085,color:#fff
    style P2 fill:#2C3E50,stroke:#16A085,color:#fff
    style P3 fill:#2C3E50,stroke:#16A085,color:#fff
    style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T2 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff

Figure 1262.4: Apache Kafka Architecture with Producers Topics and Consumers

Apache Kafka Architecture: Multiple IoT producers publish to topic partitions, which are consumed by real-time dashboards, anomaly detection systems, and storage pipelines simultaneously.

1262.3.1 Why Kafka for IoT?

Feature	Value	IoT Benefit
Throughput	100K-1M messages/sec	Handle sensor firehose
Latency	Single-digit ms	Real-time alerts
Durability	Replicated, persistent	No data loss
Scalability	Add brokers on demand	Grow with sensors
Multiple consumers	Read same data multiple times	Analytics + storage + alerts

1262.3.2 Kafka Configuration for IoT

# Production Kafka configuration for IoT
kafka_config:
  brokers: 3                    # Minimum for production
  replication_factor: 3         # Survive 2 broker failures
  partitions_per_topic: 10      # Parallelism for consumers

  topics:
    sensor-readings:
      partitions: 20            # High throughput
      retention_hours: 168      # 7 days for replay
      compression: lz4          # 50% size reduction

    alerts:
      partitions: 5             # Lower volume
      retention_hours: 720      # 30 days for audit

  performance:
    batch_size: 16384           # 16 KB batches
    linger_ms: 5                # Wait 5ms for batch fill
    buffer_memory: 33554432     # 32 MB producer buffer

1262.4 Time-Series Databases

IoT data is inherently time-series - continuous streams of timestamped readings. Specialized databases handle this pattern 100x better than general-purpose databases.

1262.4.1 Time-Series Database Comparison

Database	Type	Best For	Write Speed	Query Speed	Compression
InfluxDB	Time-series	High-frequency sensor data	100K+ writes/sec	Fast time-range	10x
TimescaleDB	Time-series (PostgreSQL)	SQL-compatible analytics	50K writes/sec	Complex joins	8x
Prometheus	Metrics	System monitoring	100K samples/sec	PromQL queries	5x
Cassandra	Wide-column	Massive scale writes	1M+ writes/sec	Key-based lookup	3x

1262.4.2 Why Time-Series DBs Are Different

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    subgraph Traditional["Traditional DB (Row-Based)"]
        T1[Row 1: timestamp, sensor_id, temp, humidity, pressure]
        T2[Row 2: timestamp, sensor_id, temp, humidity, pressure]
        T3[Row 3: timestamp, sensor_id, temp, humidity, pressure]
        T4[Query: AVG temp<br/>Reads ALL columns<br/>100% I/O]
    end

    subgraph TimeSeries["Time-Series DB (Columnar)"]
        C1[timestamp column: t1, t2, t3, ...]
        C2[temp column: 72, 73, 71, ...]
        C3[humidity column: 45, 46, 44, ...]
        C4[pressure column: 1013, 1012, 1014, ...]
        C5[Query: AVG temp<br/>Reads ONLY temp column<br/>20% I/O]
    end

    style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T2 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style T4 fill:#E74C3C,stroke:#2C3E50,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#27AE60,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff
    style C4 fill:#16A085,stroke:#2C3E50,color:#fff
    style C5 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1262.5: Row-Based versus Columnar Storage for Time-Series Queries

Columnar vs Row-Based Storage: Traditional databases read all columns for every query (100% I/O). Time-series columnar databases read only the requested columns (20% I/O for temperature queries), providing 5x performance improvement.

1262.4.3 InfluxDB for IoT: Configuration Example

# InfluxDB configuration for IoT workload
influxdb_config:
  # Retention policies - automatic data lifecycle
  retention_policies:
    - name: "realtime"
      duration: "7d"           # Keep raw data 7 days
      replication: 2           # 2 copies for durability
      shard_duration: "1d"     # New shard daily

    - name: "downsampled"
      duration: "365d"         # Keep aggregates 1 year
      replication: 2

  # Continuous queries - automatic aggregation
  continuous_queries:
    - name: "downsample_temperature"
      query: |
        SELECT mean(temperature) AS avg_temp,
               max(temperature) AS max_temp,
               min(temperature) AS min_temp
        INTO downsampled.temperature_hourly
        FROM realtime.sensor_readings
        GROUP BY time(1h), sensor_id

  # Storage optimization
  storage:
    cache_snapshot_memory: "256m"
    compact_full_write_cold: "4h"
    index_version: "tsi1"      # Time-series index

1262.5 Understanding Check: Time-Series DB vs Regular DB

Understanding Check: When to Use Time-Series Databases

Scenario: A smart building has 10,000 sensors (temperature, humidity, occupancy, energy consumption) sampling every 5 seconds. You need to: - Query: “Show me energy consumption trends for Floor 3 over the past week” - Downsample: Keep 5-second data for 30 days, then 1-minute averages forever - Retention: Automatically delete raw data older than 30 days

Think about: Why would you use InfluxDB (time-series DB) instead of PostgreSQL (relational DB)?

Key Insights:

Write Performance:
- Volume: 10,000 sensors x 12 samples/min = 120,000 writes/minute = 2,000 writes/second
- PostgreSQL: Optimized for transactional updates (hundreds of writes/sec)
- InfluxDB: Optimized for time-series writes (millions of writes/sec)
- Real number: InfluxDB handles 100x more time-series writes than PostgreSQL
Storage Efficiency:
- Daily data: 10,000 sensors x 17,280 samples/day x 50 bytes = 8.64 GB/day raw
- PostgreSQL: Row-based storage, no time-series compression: 8.64 GB/day
- InfluxDB: Columnar + time-series compression (similar values compress well): 864 MB/day (90% reduction)
- Real number: InfluxDB uses 10x less storage for time-series data

Query Performance for Time Ranges:

PostgreSQL approach:

SELECT date_trunc('hour', timestamp), AVG(temperature)
FROM sensor_readings
WHERE sensor_location = 'Floor3'
  AND timestamp > NOW() - INTERVAL '7 days'
GROUP BY date_trunc('hour', timestamp);

Scans full table, filters rows, then aggregates
Query time: 30-60 seconds on billions of rows

InfluxDB approach:

SELECT MEAN(temperature)
FROM sensors
WHERE location = 'Floor3'
  AND time > now() - 7d
GROUP BY time(1h);

Time-based indexing, automatic downsampling
Query time: 100-500ms (100x faster!)

Automatic Retention Policies:

PostgreSQL:

-- Manual cron job required
DELETE FROM sensor_readings WHERE timestamp < NOW() - INTERVAL '30 days';
-- Runs for hours, locks table

InfluxDB:

-- Built-in retention policy
CREATE RETENTION POLICY "30days" ON "sensors" DURATION 30d REPLICATION 1 DEFAULT;
-- Automatic, no performance impact

Decision Rule:

Use TIME-SERIES DB when:
- High write volume (>1,000 writes/sec)
- Time-range queries common ("last hour", "past week")
- Data has timestamps and is append-only
- Automatic downsampling/retention needed
- Examples: InfluxDB, TimescaleDB, Prometheus

Use RELATIONAL DB when:
- Transactional updates (UPDATE/DELETE common)
- Complex joins across many tables
- ACID compliance critical
- Data is not primarily time-ordered
- Examples: PostgreSQL, MySQL

1262.6 Technology Selection Framework

1262.6.1 Technology Comparison Table

Requirement	Traditional DB (PostgreSQL)	Time-Series DB (InfluxDB)	Stream Processing (Kafka + Spark)	Batch Processing (Hadoop)
Write Throughput	1K-10K writes/sec	100K-1M writes/sec	100K-10M events/sec	N/A (batch loads)
Storage Cost	$0.10/GB/month	$0.10/GB (but 10x compressed)	$0.023/GB (S3)	$0.023/GB (HDFS/S3)
Query Latency	100ms-10s	10ms-1s (time-range)	100ms-5s (real-time)	Minutes to hours
Data Retention	Manual deletion	Auto retention policies	Kafka: 1-30 days	Years (petabytes)
Use Case	Transactional apps	Sensor data, metrics	Real-time dashboards	Historical analytics
Cost (1M events/day)	$500/month	$50/month	$200/month	$20/month (batch)

1262.6.2 Common Pitfall: Using the Wrong Database

Pitfall: Using the Wrong Database for Time-Series Data

The Mistake:

-- Store sensor data in MySQL (row-oriented, general-purpose)
CREATE TABLE sensors (
    timestamp DATETIME,
    sensor_id INT,
    temperature FLOAT,
    humidity FLOAT
);

-- Query: Get average temperature per hour for last 30 days
SELECT DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00'),
       AVG(temperature)
FROM sensors
WHERE timestamp > NOW() - INTERVAL 30 DAY
GROUP BY DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00');

Why It’s Slow: - MySQL scans entire table even if you only need temperature column - Row-oriented storage reads all columns (timestamp, sensor_id, temperature, humidity) - No automatic time-based partitioning - Grouping by hour requires complex DATE_FORMAT parsing - Query time: 45 seconds on 100M rows

The Fix: Use time-series database like InfluxDB

-- InfluxDB query (columnar, time-optimized)
SELECT MEAN(temperature)
FROM sensors
WHERE time > now() - 30d
GROUP BY time(1h)

Why It’s Fast: - Columnar storage: Only reads temperature and time columns - Automatic time-based sharding (data partitioned by day/hour) - Native time aggregation (GROUP BY time(1h) is optimized) - Query time: 2 seconds on same 100M rows

Performance: 22x faster with purpose-built database

1262.7 Summary

Apache Hadoop ecosystem provides distributed processing through HDFS for storage, MapReduce and Spark for computation, and frameworks like Kafka for real-time streaming.
Apache Spark offers 10-100x faster processing than MapReduce through in-memory computation, supporting SQL queries, machine learning, and streaming in a unified platform.
Apache Kafka handles millions of events per second with millisecond latency, enabling real-time IoT data pipelines with built-in durability and scalability.
Time-series databases like InfluxDB provide 100x better write performance and 10x storage compression compared to traditional databases for sensor data workloads.

1262.8 What’s Next

Now that you understand the technology landscape, continue to:

Big Data Pipelines - Design stream and batch processing architectures with Lambda architecture
Big Data Operations - Learn monitoring, debugging, and production best practices
Big Data Overview - Return to the chapter index