%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'background': '#ffffff', 'mainBkg': '#2C3E50', 'secondBkg': '#16A085', 'tertiaryBkg': '#E67E22'}}}%%
graph TB
subgraph Applications["Application Layer"]
App1[Hive<br/>SQL Queries]
App2[Pig<br/>Scripting]
App3[Mahout<br/>Machine Learning]
end
subgraph Processing["Processing Layer"]
Spark[Apache Spark<br/>In-Memory Processing]
MR[MapReduce<br/>Batch Processing]
end
subgraph Resource["Resource Management"]
YARN[YARN<br/>Resource Scheduler]
end
subgraph Storage["Storage Layer"]
HDFS[HDFS<br/>Distributed Storage]
end
App1 & App2 & App3 --> Processing
Spark & MR --> YARN
YARN --> HDFS
style HDFS fill:#2C3E50,stroke:#16A085,color:#fff
style YARN fill:#16A085,stroke:#2C3E50,color:#fff
style Spark fill:#E67E22,stroke:#2C3E50,color:#fff
style MR fill:#2C3E50,stroke:#16A085,color:#fff
1262 Big Data Technologies
Learning Objectives
After completing this chapter, you will be able to:
- Understand the Apache Hadoop ecosystem and its components
- Differentiate between batch processing (Spark) and stream processing (Kafka, Flink)
- Select appropriate time-series databases for IoT workloads
- Design technology stacks for specific IoT scenarios
1262.1 Apache Hadoop Ecosystem
Apache Hadoop Ecosystem: The four-layer architecture with HDFS for distributed storage, YARN for resource management, Spark/MapReduce for processing, and Hive/Pig/Mahout for applications.
1262.1.1 Key Components
| Component | Purpose | Use in IoT |
|---|---|---|
| HDFS | Distributed storage | Store petabytes of sensor data across commodity hardware |
| MapReduce | Batch processing | Process historical sensor data in parallel |
| Spark | In-memory processing | Real-time analytics, 10-100x faster than MapReduce |
| Hive | SQL on Hadoop | Query sensor data with SQL instead of code |
| Kafka | Streaming data | Handle real-time sensor streams |
| HBase | NoSQL database | Store time-series sensor readings |
| YARN | Resource management | Schedule processing jobs across the cluster |
| ZooKeeper | Coordination | Manage distributed system configuration |
1262.1.2 HDFS: Distributed File System
Why HDFS for IoT? - Handles petabytes of data across thousands of machines - Automatic replication (default: 3 copies) prevents data loss - Designed for large files (GB-TB) and sequential reads
HDFS Architecture:
NameNode (Master):
- Stores file metadata (which blocks on which DataNodes)
- Single point of coordination
DataNodes (Workers):
- Store actual data blocks (128 MB default)
- Report health to NameNode
- Replicate blocks to other DataNodes
Example: 1 TB file stored as:
- 8,192 blocks x 128 MB each
- 3 copies = 24,576 blocks total
- Distributed across 100+ DataNodes
1262.1.3 MapReduce: The Original Big Data Engine
How It Works: 1. Map phase: Split data, process in parallel 2. Shuffle phase: Group results by key 3. Reduce phase: Aggregate grouped data
IoT Example: Calculate Average Temperature by City
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
subgraph Input["Input Data"]
I1[Block 1: NYC:72, LA:85, NYC:74]
I2[Block 2: LA:82, NYC:71, CHI:65]
I3[Block 3: CHI:68, LA:88, NYC:73]
end
subgraph Map["Map Phase (Parallel)"]
M1[Mapper 1<br/>NYC:72, LA:85, NYC:74]
M2[Mapper 2<br/>LA:82, NYC:71, CHI:65]
M3[Mapper 3<br/>CHI:68, LA:88, NYC:73]
end
subgraph Shuffle["Shuffle (Group by Key)"]
S1[NYC: 72, 74, 71, 73]
S2[LA: 85, 82, 88]
S3[CHI: 65, 68]
end
subgraph Reduce["Reduce Phase"]
R1[NYC: AVG = 72.5F]
R2[LA: AVG = 85F]
R3[CHI: AVG = 66.5F]
end
I1 --> M1
I2 --> M2
I3 --> M3
M1 & M2 & M3 --> S1 & S2 & S3
S1 --> R1
S2 --> R2
S3 --> R3
style I1 fill:#2C3E50,stroke:#16A085,color:#fff
style I2 fill:#2C3E50,stroke:#16A085,color:#fff
style I3 fill:#2C3E50,stroke:#16A085,color:#fff
style M1 fill:#16A085,stroke:#2C3E50,color:#fff
style M2 fill:#16A085,stroke:#2C3E50,color:#fff
style M3 fill:#16A085,stroke:#2C3E50,color:#fff
style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
style S2 fill:#E67E22,stroke:#2C3E50,color:#fff
style S3 fill:#E67E22,stroke:#2C3E50,color:#fff
style R1 fill:#27AE60,stroke:#2C3E50,color:#fff
style R2 fill:#27AE60,stroke:#2C3E50,color:#fff
style R3 fill:#27AE60,stroke:#2C3E50,color:#fff
MapReduce Data Flow: Input data is split across mappers, shuffled by city key, and reduced to calculate average temperatures. Parallel execution across nodes enables processing terabytes in minutes.
1262.2 Apache Spark: In-Memory Processing
Spark is the modern replacement for MapReduce, offering 10-100x faster processing through in-memory computation.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
graph TB
subgraph SparkCore["Spark Core (RDD)"]
SC[Resilient Distributed<br/>Datasets]
end
subgraph Libraries["Spark Libraries"]
SQL[Spark SQL<br/>Structured Data]
ML[MLlib<br/>Machine Learning]
Stream[Spark Streaming<br/>Real-Time]
Graph[GraphX<br/>Graph Processing]
end
SQL & ML & Stream & Graph --> SparkCore
subgraph Cluster["Cluster Manager"]
YARN[YARN]
Mesos[Mesos]
K8s[Kubernetes]
Standalone[Standalone]
end
SparkCore --> Cluster
subgraph Storage["Storage"]
HDFS2[HDFS]
S3[S3]
Cassandra[Cassandra]
Kafka2[Kafka]
end
Cluster --> Storage
style SparkCore fill:#E67E22,stroke:#2C3E50,color:#fff
style SQL fill:#2C3E50,stroke:#16A085,color:#fff
style ML fill:#16A085,stroke:#2C3E50,color:#fff
style Stream fill:#2C3E50,stroke:#16A085,color:#fff
style Graph fill:#16A085,stroke:#2C3E50,color:#fff
Apache Spark Architecture: Spark Core with RDDs provides the foundation, supporting SQL, ML, streaming, and graph libraries that run on various cluster managers and storage backends.
1262.2.1 Why Spark Over MapReduce?
| Feature | MapReduce | Spark |
|---|---|---|
| Processing speed | Disk-based (slow) | In-memory (fast) |
| Ease of use | Complex Java code | Simple Python/SQL |
| Iterative algorithms | Multiple disk writes | Single memory pass |
| Real-time | Not supported | Spark Streaming |
| Machine learning | Basic | MLlib (100+ algorithms) |
1262.2.2 Spark for IoT: Real-World Example
Processing 1 Billion Sensor Readings:
# Read sensor data from distributed storage
sensor_data = spark.read.parquet("s3://iot-data/sensors/")
# Filter, aggregate, and analyze in memory
hourly_averages = sensor_data \
.filter(col("timestamp") > "2024-01-01") \
.groupBy(window("timestamp", "1 hour"), "sensor_id") \
.agg(avg("temperature").alias("avg_temp"),
max("temperature").alias("max_temp"),
min("temperature").alias("min_temp"))
# Write results back to storage
hourly_averages.write.parquet("s3://iot-data/hourly-summaries/")
# Performance:
# - MapReduce: 45 minutes (multiple disk read/write cycles)
# - Spark: 3 minutes (in-memory processing)1262.3 Apache Kafka: Real-Time Streaming
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
subgraph Producers["Data Producers"]
P1[Temperature<br/>Sensors]
P2[Motion<br/>Sensors]
P3[Camera<br/>Metadata]
end
subgraph Kafka["Apache Kafka Cluster"]
T1[Topic: temperature<br/>Partitions: 10]
T2[Topic: motion<br/>Partitions: 5]
T3[Topic: camera<br/>Partitions: 20]
end
subgraph Consumers["Data Consumers"]
C1[Real-Time<br/>Dashboard]
C2[Anomaly<br/>Detection]
C3[Long-Term<br/>Storage]
end
P1 --> T1
P2 --> T2
P3 --> T3
T1 --> C1 & C2 & C3
T2 --> C1 & C2 & C3
T3 --> C1 & C2 & C3
style P1 fill:#2C3E50,stroke:#16A085,color:#fff
style P2 fill:#2C3E50,stroke:#16A085,color:#fff
style P3 fill:#2C3E50,stroke:#16A085,color:#fff
style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
style T2 fill:#E67E22,stroke:#2C3E50,color:#fff
style T3 fill:#E67E22,stroke:#2C3E50,color:#fff
style C1 fill:#16A085,stroke:#2C3E50,color:#fff
style C2 fill:#16A085,stroke:#2C3E50,color:#fff
style C3 fill:#16A085,stroke:#2C3E50,color:#fff
Apache Kafka Architecture: Multiple IoT producers publish to topic partitions, which are consumed by real-time dashboards, anomaly detection systems, and storage pipelines simultaneously.
1262.3.1 Why Kafka for IoT?
| Feature | Value | IoT Benefit |
|---|---|---|
| Throughput | 100K-1M messages/sec | Handle sensor firehose |
| Latency | Single-digit ms | Real-time alerts |
| Durability | Replicated, persistent | No data loss |
| Scalability | Add brokers on demand | Grow with sensors |
| Multiple consumers | Read same data multiple times | Analytics + storage + alerts |
1262.3.2 Kafka Configuration for IoT
# Production Kafka configuration for IoT
kafka_config:
brokers: 3 # Minimum for production
replication_factor: 3 # Survive 2 broker failures
partitions_per_topic: 10 # Parallelism for consumers
topics:
sensor-readings:
partitions: 20 # High throughput
retention_hours: 168 # 7 days for replay
compression: lz4 # 50% size reduction
alerts:
partitions: 5 # Lower volume
retention_hours: 720 # 30 days for audit
performance:
batch_size: 16384 # 16 KB batches
linger_ms: 5 # Wait 5ms for batch fill
buffer_memory: 33554432 # 32 MB producer buffer1262.4 Time-Series Databases
IoT data is inherently time-series - continuous streams of timestamped readings. Specialized databases handle this pattern 100x better than general-purpose databases.
1262.4.1 Time-Series Database Comparison
| Database | Type | Best For | Write Speed | Query Speed | Compression |
|---|---|---|---|---|---|
| InfluxDB | Time-series | High-frequency sensor data | 100K+ writes/sec | Fast time-range | 10x |
| TimescaleDB | Time-series (PostgreSQL) | SQL-compatible analytics | 50K writes/sec | Complex joins | 8x |
| Prometheus | Metrics | System monitoring | 100K samples/sec | PromQL queries | 5x |
| Cassandra | Wide-column | Massive scale writes | 1M+ writes/sec | Key-based lookup | 3x |
1262.4.2 Why Time-Series DBs Are Different
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
subgraph Traditional["Traditional DB (Row-Based)"]
T1[Row 1: timestamp, sensor_id, temp, humidity, pressure]
T2[Row 2: timestamp, sensor_id, temp, humidity, pressure]
T3[Row 3: timestamp, sensor_id, temp, humidity, pressure]
T4[Query: AVG temp<br/>Reads ALL columns<br/>100% I/O]
end
subgraph TimeSeries["Time-Series DB (Columnar)"]
C1[timestamp column: t1, t2, t3, ...]
C2[temp column: 72, 73, 71, ...]
C3[humidity column: 45, 46, 44, ...]
C4[pressure column: 1013, 1012, 1014, ...]
C5[Query: AVG temp<br/>Reads ONLY temp column<br/>20% I/O]
end
style T1 fill:#E67E22,stroke:#2C3E50,color:#fff
style T2 fill:#E67E22,stroke:#2C3E50,color:#fff
style T3 fill:#E67E22,stroke:#2C3E50,color:#fff
style T4 fill:#E74C3C,stroke:#2C3E50,color:#fff
style C1 fill:#16A085,stroke:#2C3E50,color:#fff
style C2 fill:#27AE60,stroke:#2C3E50,color:#fff
style C3 fill:#16A085,stroke:#2C3E50,color:#fff
style C4 fill:#16A085,stroke:#2C3E50,color:#fff
style C5 fill:#27AE60,stroke:#2C3E50,color:#fff
Columnar vs Row-Based Storage: Traditional databases read all columns for every query (100% I/O). Time-series columnar databases read only the requested columns (20% I/O for temperature queries), providing 5x performance improvement.
1262.4.3 InfluxDB for IoT: Configuration Example
# InfluxDB configuration for IoT workload
influxdb_config:
# Retention policies - automatic data lifecycle
retention_policies:
- name: "realtime"
duration: "7d" # Keep raw data 7 days
replication: 2 # 2 copies for durability
shard_duration: "1d" # New shard daily
- name: "downsampled"
duration: "365d" # Keep aggregates 1 year
replication: 2
# Continuous queries - automatic aggregation
continuous_queries:
- name: "downsample_temperature"
query: |
SELECT mean(temperature) AS avg_temp,
max(temperature) AS max_temp,
min(temperature) AS min_temp
INTO downsampled.temperature_hourly
FROM realtime.sensor_readings
GROUP BY time(1h), sensor_id
# Storage optimization
storage:
cache_snapshot_memory: "256m"
compact_full_write_cold: "4h"
index_version: "tsi1" # Time-series index1262.5 Understanding Check: Time-Series DB vs Regular DB
Scenario: A smart building has 10,000 sensors (temperature, humidity, occupancy, energy consumption) sampling every 5 seconds. You need to: - Query: โShow me energy consumption trends for Floor 3 over the past weekโ - Downsample: Keep 5-second data for 30 days, then 1-minute averages forever - Retention: Automatically delete raw data older than 30 days
Think about: Why would you use InfluxDB (time-series DB) instead of PostgreSQL (relational DB)?
Key Insights:
Write Performance:
- Volume: 10,000 sensors x 12 samples/min = 120,000 writes/minute = 2,000 writes/second
- PostgreSQL: Optimized for transactional updates (hundreds of writes/sec)
- InfluxDB: Optimized for time-series writes (millions of writes/sec)
- Real number: InfluxDB handles 100x more time-series writes than PostgreSQL
Storage Efficiency:
- Daily data: 10,000 sensors x 17,280 samples/day x 50 bytes = 8.64 GB/day raw
- PostgreSQL: Row-based storage, no time-series compression: 8.64 GB/day
- InfluxDB: Columnar + time-series compression (similar values compress well): 864 MB/day (90% reduction)
- Real number: InfluxDB uses 10x less storage for time-series data
Query Performance for Time Ranges:
PostgreSQL approach:
SELECT date_trunc('hour', timestamp), AVG(temperature) FROM sensor_readings WHERE sensor_location = 'Floor3' AND timestamp > NOW() - INTERVAL '7 days' GROUP BY date_trunc('hour', timestamp);- Scans full table, filters rows, then aggregates
- Query time: 30-60 seconds on billions of rows
InfluxDB approach:
SELECT MEAN(temperature) FROM sensors WHERE location = 'Floor3' AND time > now() - 7d GROUP BY time(1h);- Time-based indexing, automatic downsampling
- Query time: 100-500ms (100x faster!)
Automatic Retention Policies:
PostgreSQL:
-- Manual cron job required DELETE FROM sensor_readings WHERE timestamp < NOW() - INTERVAL '30 days'; -- Runs for hours, locks tableInfluxDB:
-- Built-in retention policy CREATE RETENTION POLICY "30days" ON "sensors" DURATION 30d REPLICATION 1 DEFAULT; -- Automatic, no performance impact
Decision Rule:
Use TIME-SERIES DB when:
- High write volume (>1,000 writes/sec)
- Time-range queries common ("last hour", "past week")
- Data has timestamps and is append-only
- Automatic downsampling/retention needed
- Examples: InfluxDB, TimescaleDB, Prometheus
Use RELATIONAL DB when:
- Transactional updates (UPDATE/DELETE common)
- Complex joins across many tables
- ACID compliance critical
- Data is not primarily time-ordered
- Examples: PostgreSQL, MySQL
1262.6 Technology Selection Framework
1262.6.1 Technology Comparison Table
| Requirement | Traditional DB (PostgreSQL) | Time-Series DB (InfluxDB) | Stream Processing (Kafka + Spark) | Batch Processing (Hadoop) |
|---|---|---|---|---|
| Write Throughput | 1K-10K writes/sec | 100K-1M writes/sec | 100K-10M events/sec | N/A (batch loads) |
| Storage Cost | $0.10/GB/month | $0.10/GB (but 10x compressed) | $0.023/GB (S3) | $0.023/GB (HDFS/S3) |
| Query Latency | 100ms-10s | 10ms-1s (time-range) | 100ms-5s (real-time) | Minutes to hours |
| Data Retention | Manual deletion | Auto retention policies | Kafka: 1-30 days | Years (petabytes) |
| Use Case | Transactional apps | Sensor data, metrics | Real-time dashboards | Historical analytics |
| Cost (1M events/day) | $500/month | $50/month | $200/month | $20/month (batch) |
1262.6.2 Common Pitfall: Using the Wrong Database
The Mistake:
-- Store sensor data in MySQL (row-oriented, general-purpose)
CREATE TABLE sensors (
timestamp DATETIME,
sensor_id INT,
temperature FLOAT,
humidity FLOAT
);
-- Query: Get average temperature per hour for last 30 days
SELECT DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00'),
AVG(temperature)
FROM sensors
WHERE timestamp > NOW() - INTERVAL 30 DAY
GROUP BY DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00');Why Itโs Slow: - MySQL scans entire table even if you only need temperature column - Row-oriented storage reads all columns (timestamp, sensor_id, temperature, humidity) - No automatic time-based partitioning - Grouping by hour requires complex DATE_FORMAT parsing - Query time: 45 seconds on 100M rows
The Fix: Use time-series database like InfluxDB
-- InfluxDB query (columnar, time-optimized)
SELECT MEAN(temperature)
FROM sensors
WHERE time > now() - 30d
GROUP BY time(1h)Why Itโs Fast: - Columnar storage: Only reads temperature and time columns - Automatic time-based sharding (data partitioned by day/hour) - Native time aggregation (GROUP BY time(1h) is optimized) - Query time: 2 seconds on same 100M rows
Performance: 22x faster with purpose-built database
1262.7 Summary
- Apache Hadoop ecosystem provides distributed processing through HDFS for storage, MapReduce and Spark for computation, and frameworks like Kafka for real-time streaming.
- Apache Spark offers 10-100x faster processing than MapReduce through in-memory computation, supporting SQL queries, machine learning, and streaming in a unified platform.
- Apache Kafka handles millions of events per second with millisecond latency, enabling real-time IoT data pipelines with built-in durability and scalability.
- Time-series databases like InfluxDB provide 100x better write performance and 10x storage compression compared to traditional databases for sensor data workloads.
1262.8 Whatโs Next
Now that you understand the technology landscape, continue to:
- Big Data Pipelines - Design stream and batch processing architectures with Lambda architecture
- Big Data Operations - Learn monitoring, debugging, and production best practices
- Big Data Overview - Return to the chapter index