28 Big Data Technologies
Learning Objectives
After completing this chapter, you will be able to:
- Explain the Apache Hadoop ecosystem architecture including HDFS, YARN, Spark, and Hive components
- Compare batch processing (Spark) and stream processing (Kafka, Flink) by throughput, latency, and cost
- Select appropriate time-series databases for IoT workloads based on write speed and query patterns
- Design technology stacks for specific IoT scenarios matching tools to data volume and latency requirements
Key Concepts
- Apache Kafka: A distributed message streaming platform designed for high-throughput, low-latency event pipelines; the de facto standard for ingesting high-velocity IoT sensor streams.
- Apache Spark: A distributed in-memory data processing engine supporting batch analytics, stream processing (Structured Streaming), and ML (MLlib) on large IoT datasets.
- Apache Flink: A distributed stream processing framework optimised for stateful, low-latency, exactly-once stream processing — preferred over Spark for sub-second IoT latency requirements.
- HDFS (Hadoop Distributed File System): A fault-tolerant distributed storage system that partitions large datasets across commodity servers with 3x replication, the foundational storage layer for Hadoop-based analytics.
- InfluxDB: A purpose-built time-series database optimised for storing and querying timestamped sensor readings with automatic data retention and downsampling policies.
- Apache Arrow: A cross-language in-memory columnar data format that enables zero-copy reads between processing frameworks, dramatically reducing serialisation overhead in multi-system IoT pipelines.
For Beginners: Big Data Technologies
Big data technologies are specialized tools built for data that is too large, too fast, or too varied for ordinary databases. Think of them as industrial kitchen equipment compared to home appliances – built for scale and speed. Tools like Apache Kafka handle fast-moving streams while Apache Spark processes massive datasets in parallel.
28.1 Apache Hadoop Ecosystem
How It Works: Hadoop MapReduce for IoT Sensor Data
MapReduce processes massive IoT datasets by splitting work across hundreds of machines in three phases.
Example: Calculate Average Temperature by City from 1 Billion Sensor Readings
Step 1 - Map Phase (Parallel across 100 nodes):
Input (raw sensor log):
sensor_123, NYC, 2024-01-15T10:00:00, 22.5°C
sensor_456, LA, 2024-01-15T10:00:05, 28.3°C
sensor_789, NYC, 2024-01-15T10:00:10, 23.1°C
...
Map function (runs on each node):
def map(line):
sensor_id, city, timestamp, temp = parse(line)
emit(city, temp) # Output: (NYC, 22.5), (LA, 28.3), (NYC, 23.1)
Result: 1 billion (city, temperature) pairs distributed across nodes
Step 2 - Shuffle Phase (Hadoop framework does this automatically):
Group all temperatures by city key:
NYC → [22.5, 23.1, 22.8, ... 230 million values]
LA → [28.3, 27.9, 29.1, ... 180 million values]
CHI → [... 150 million values]
Step 3 - Reduce Phase (Parallel across cities):
Reduce function (runs once per city):
def reduce(city, temperature_list):
total = sum(temperature_list)
count = len(temperature_list)
avg = total / count
emit(city, avg)
Final output:
NYC, 22.9°C
LA, 28.4°C
CHI, 19.2°C
Why This Is Fast: 100 nodes each process 10 million rows in parallel. What would take 1 node 100 hours finishes in 1 hour across 100 nodes. Hadoop handles node failures automatically - if node 47 crashes, its work is reassigned to node 48.
Apache Hadoop Ecosystem: The four-layer architecture with HDFS for distributed storage, YARN for resource management, Spark/MapReduce for processing, and Hive/Pig/Mahout for applications.
Hadoop Component Deep Dive
28.1.1 Key Components
| Component | Purpose | Use in IoT |
|---|---|---|
| HDFS | Distributed storage | Store petabytes of sensor data across commodity hardware |
| MapReduce | Batch processing | Process historical sensor data in parallel |
| Spark | In-memory processing | Real-time analytics, 10-100x faster than MapReduce |
| Hive | SQL on Hadoop | Query sensor data with SQL instead of code |
| Kafka | Streaming data | Handle real-time sensor streams |
| HBase | NoSQL database | Store time-series sensor readings |
| YARN | Resource management | Schedule processing jobs across the cluster |
| ZooKeeper | Coordination | Manage distributed system configuration |
28.1.2 HDFS: Distributed File System
Why HDFS for IoT?
- Handles petabytes of data across thousands of machines
- Automatic replication (default: 3 copies) prevents data loss
- Designed for large files (GB-TB) and sequential reads
HDFS Architecture:
NameNode (Master):
- Stores file metadata (which blocks on which DataNodes)
- Single point of coordination
DataNodes (Workers):
- Store actual data blocks (128 MB default)
- Report health to NameNode
- Replicate blocks to other DataNodes
Example: 1 TB file stored as:
- 8,192 blocks x 128 MB each
- 3 copies = 24,576 blocks total
- Distributed across 100+ DataNodes
28.1.3 MapReduce: The Original Big Data Engine
How It Works:
- Map phase: Split data, process in parallel
- Shuffle phase: Group results by key
- Reduce phase: Aggregate grouped data
IoT Example: Calculate Average Temperature by City
MapReduce Data Flow: Input data is split across mappers, shuffled by city key, and reduced to calculate average temperatures. Parallel execution across nodes enables processing terabytes in minutes.
28.1.4 HDFS Storage Calculator
Estimate how HDFS distributes and replicates your IoT data across a cluster.
28.2 Apache Spark: In-Memory Processing
Spark is the modern replacement for MapReduce, offering 10-100x faster processing through in-memory computation.
Apache Spark Architecture: Spark Core with RDDs provides the foundation, supporting SQL, ML, streaming, and graph libraries that run on various cluster managers and storage backends.
28.2.1 Why Spark Over MapReduce?
| Feature | MapReduce | Spark |
|---|---|---|
| Processing speed | Disk-based (slow) | In-memory (fast) |
| Ease of use | Complex Java code | Simple Python/SQL |
| Iterative algorithms | Multiple disk writes | Single memory pass |
| Real-time | Not supported | Spark Streaming |
| Machine learning | Basic | MLlib (100+ algorithms) |
28.2.2 Spark for IoT: Real-World Example
Processing 1 Billion Sensor Readings:
# Read sensor data from distributed storage
sensor_data = spark.read.parquet("s3://iot-data/sensors/")
# Filter, aggregate, and analyze in memory
hourly_averages = sensor_data \
.filter(col("timestamp") > "2024-01-01") \
.groupBy(window("timestamp", "1 hour"), "sensor_id") \
.agg(avg("temperature").alias("avg_temp"),
max("temperature").alias("max_temp"),
min("temperature").alias("min_temp"))
# Write results back to storage
hourly_averages.write.parquet("s3://iot-data/hourly-summaries/")
# Performance:
# - MapReduce: 45 minutes (multiple disk read/write cycles)
# - Spark: 3 minutes (in-memory processing)28.3 Apache Kafka: Real-Time Streaming
While Spark excels at processing data already in storage, Apache Kafka handles the real-time ingestion challenge – getting millions of sensor events per second from producers to consumers with millisecond latency.
Apache Kafka Architecture: Multiple IoT producers publish to topic partitions, which are consumed by real-time dashboards, anomaly detection systems, and storage pipelines simultaneously.
28.3.1 Why Kafka for IoT?
| Feature | Value | IoT Benefit |
|---|---|---|
| Throughput | 100K-1M messages/sec | Handle sensor firehose |
| Latency | Single-digit ms | Real-time alerts |
| Durability | Replicated, persistent | No data loss |
| Scalability | Add brokers on demand | Grow with sensors |
| Multiple consumers | Read same data multiple times | Analytics + storage + alerts |
28.3.2 Kafka Configuration for IoT
# Production Kafka configuration for IoT
kafka_config:
brokers: 3 # Minimum for production
replication_factor: 3 # Survive 2 broker failures
partitions_per_topic: 10 # Parallelism for consumers
topics:
sensor-readings:
partitions: 20 # High throughput
retention_hours: 168 # 7 days for replay
compression: lz4 # 50% size reduction
alerts:
partitions: 5 # Lower volume
retention_hours: 720 # 30 days for audit
performance:
batch_size: 16384 # 16 KB batches
linger_ms: 5 # Wait 5ms for batch fill
buffer_memory: 33554432 # 32 MB producer buffer28.4 Technology Selection Decision Framework
Choosing the right big data technology depends on your latency requirement, data volume, and query pattern. Use this decision tree:
Decision Framework: Batch vs Stream vs Hybrid
Step 1 – What is your latency requirement?
| If insights needed within… | Use | Example |
|---|---|---|
| Milliseconds to seconds | Stream processing (Kafka + Flink) | Real-time anomaly detection |
| Minutes to hours | Micro-batch (Spark Streaming) | Dashboard updates every 5 min |
| Hours to days | Batch processing (Spark) | Daily energy reports |
| Mixed requirements | Lambda Architecture (Spark + Kafka) | Real-time alerts + daily analytics |
Step 2 – What is your data volume?
| Daily data volume | Recommended stack | Estimated monthly cost (AWS) |
|---|---|---|
| < 10 GB/day | Single-node PostgreSQL or InfluxDB | $50-200 |
| 10-100 GB/day | Managed Kafka + Spark (3-node cluster) | $500-2,000 |
| 100 GB - 1 TB/day | Kafka cluster + Spark cluster (10+ nodes) | $2,000-10,000 |
| > 1 TB/day | Multi-region Kafka + auto-scaling Spark | $10,000-50,000+ |
Step 3 – What is your query pattern?
| Primary query type | Best technology | Why |
|---|---|---|
| Time-range aggregations | InfluxDB / TimescaleDB | Optimized for WHERE time > X |
| Complex joins across tables | TimescaleDB (PostgreSQL) | Full SQL support |
| Key-value lookups | Redis / DynamoDB | Sub-millisecond reads |
| Full-text search on logs | Elasticsearch | Inverted index for text |
| ML feature engineering | Spark MLlib + Delta Lake | In-memory iterative processing |
Worked Example – Smart Factory (5,000 sensors at 1 Hz):
- Daily raw data: 5,000 sensors x 86,400 seconds x 50 bytes = 21.6 GB/day
- Latency requirement: Real-time alerts (< 1s) + daily reports
- Decision: Lambda Architecture with Kafka (streaming) + Spark (batch)
- Estimated cost: Kafka 3-broker cluster ($900/mo) + Spark 5-node cluster ($1,500/mo) + InfluxDB ($400/mo) = $2,800/month
- Compare to: all-cloud without edge processing adds cloud ingestion and compute costs – managed Kafka ingestion ($1,200/mo), additional Spark instances for raw data ($800/mo), and network transfer ($200/mo for 21.6 GB/day x 30 days) = $5,000/month total
- Edge processing (filtering 90% of data locally, sending only anomalies and aggregates) reduces cloud volume to 2.16 GB/day, cutting cloud costs to $1,510/month total
28.4.1 IoT Data Volume Estimator
Use this calculator to estimate your daily data volume and find the recommended technology stack for your IoT deployment.
28.5 Time-Series Databases
IoT data is inherently time-series - continuous streams of timestamped readings. Specialized databases handle this pattern 100x better than general-purpose databases.
28.5.1 Time-Series Database Comparison
| Database | Type | Best For | Write Speed | Query Speed | Compression |
|---|---|---|---|---|---|
| InfluxDB | Time-series | High-frequency sensor data | 100K+ writes/sec | Fast time-range | 10x |
| TimescaleDB | Time-series (PostgreSQL) | SQL-compatible analytics | 50K writes/sec | Complex joins | 8x |
| Prometheus | Metrics | System monitoring | 100K samples/sec | PromQL queries | 5x |
| Cassandra | Wide-column | Massive scale writes | 1M+ writes/sec | Key-based lookup | 3x |
28.5.2 Why Time-Series DBs Are Different
Columnar vs Row-Based Storage: Traditional databases read all columns for every query (100% I/O). Time-series columnar databases read only the requested columns (20% I/O for temperature queries), providing 5x performance improvement.
28.5.3 InfluxDB for IoT: Configuration Example
# InfluxDB configuration for IoT workload
influxdb_config:
# Retention policies - automatic data lifecycle
retention_policies:
- name: "realtime"
duration: "7d" # Keep raw data 7 days
replication: 2 # 2 copies for durability
shard_duration: "1d" # New shard daily
- name: "downsampled"
duration: "365d" # Keep aggregates 1 year
replication: 2
# Continuous queries - automatic aggregation
continuous_queries:
- name: "downsample_temperature"
query: |
SELECT mean(temperature) AS avg_temp,
max(temperature) AS max_temp,
min(temperature) AS min_temp
INTO downsampled.temperature_hourly
FROM realtime.sensor_readings
GROUP BY time(1h), sensor_id
# Storage optimization
storage:
cache_snapshot_memory: "256m"
compact_full_write_cold: "4h"
index_version: "tsi1" # Time-series index28.6 Understanding Check: Time-Series DB vs Regular DB
Understanding Check: When to Use Time-Series Databases
Scenario: A smart building has 10,000 sensors (temperature, humidity, occupancy, energy consumption) sampling every 5 seconds. You need to: - Query: “Show me energy consumption trends for Floor 3 over the past week” - Downsample: Keep 5-second data for 30 days, then 1-minute averages forever - Retention: Automatically delete raw data older than 30 days
Think about: Why would you use InfluxDB (time-series DB) instead of PostgreSQL (relational DB)?
Key Insights:
Write Performance:
- Volume: 10,000 sensors x 12 samples/min = 120,000 writes/minute = 2,000 writes/second
- PostgreSQL: Optimized for transactional updates (hundreds of writes/sec)
- InfluxDB: Optimized for time-series writes (millions of writes/sec)
- Real number: InfluxDB handles 100x more time-series writes than PostgreSQL
Storage Efficiency:
- Daily data: 10,000 sensors x 17,280 samples/day x 50 bytes = 8.64 GB/day raw
- PostgreSQL: Row-based storage, no time-series compression: 8.64 GB/day
- InfluxDB: Columnar + time-series compression (similar values compress well): 864 MB/day (90% reduction)
- Real number: InfluxDB uses 10x less storage for time-series data
Query Performance for Time Ranges:
PostgreSQL approach:
SELECT date_trunc('hour', timestamp), AVG(temperature) FROM sensor_readings WHERE sensor_location = 'Floor3' AND timestamp > NOW() - INTERVAL '7 days' GROUP BY date_trunc('hour', timestamp);- Scans full table, filters rows, then aggregates
- Query time: 30-60 seconds on billions of rows
InfluxDB approach:
SELECT MEAN(temperature) FROM sensors WHERE location = 'Floor3' AND time > now() - 7d GROUP BY time(1h);- Time-based indexing, automatic downsampling
- Query time: 100-500ms (100x faster!)
Automatic Retention Policies:
PostgreSQL:
-- Manual cron job required DELETE FROM sensor_readings WHERE timestamp < NOW() - INTERVAL '30 days'; -- Runs for hours, locks tableInfluxDB:
-- Built-in retention policy CREATE RETENTION POLICY "30days" ON "sensors" DURATION 30d REPLICATION 1 DEFAULT; -- Automatic, no performance impact
Decision Rule:
Use TIME-SERIES DB when:
- High write volume (>1,000 writes/sec)
- Time-range queries common ("last hour", "past week")
- Data has timestamps and is append-only
- Automatic downsampling/retention needed
- Examples: InfluxDB, TimescaleDB, Prometheus
Use RELATIONAL DB when:
- Transactional updates (UPDATE/DELETE common)
- Complex joins across many tables
- ACID compliance critical
- Data is not primarily time-ordered
- Examples: PostgreSQL, MySQL
28.7 Technology Comparison
With all the major technologies covered, the following table provides a side-by-side comparison to guide your technology selection.
| Requirement | Traditional DB (PostgreSQL) | Time-Series DB (InfluxDB) | Stream Processing (Kafka + Spark) | Batch Processing (Hadoop) |
|---|---|---|---|---|
| Write Throughput | 1K-10K writes/sec | 100K-1M writes/sec | 100K-10M events/sec | N/A (batch loads) |
| Storage Cost | $0.10/GB/month | $0.10/GB (but 10x compressed) | $0.023/GB (S3) | $0.023/GB (HDFS/S3) |
| Query Latency | 100ms-10s | 10ms-1s (time-range) | 100ms-5s (real-time) | Minutes to hours |
| Data Retention | Manual deletion | Auto retention policies | Kafka: 1-30 days | Years (petabytes) |
| Use Case | Transactional apps | Sensor data, metrics | Real-time dashboards | Historical analytics |
| Cost (1M events/day) | $500/month | $50/month | $200/month | $20/month (batch) |
28.7.1 Common Pitfall: Using the Wrong Database
Pitfall: Using the Wrong Database for Time-Series Data
The Mistake:
-- Store sensor data in MySQL (row-oriented, general-purpose)
CREATE TABLE sensors (
timestamp DATETIME,
sensor_id INT,
temperature FLOAT,
humidity FLOAT
);
-- Query: Get average temperature per hour for last 30 days
SELECT DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00'),
AVG(temperature)
FROM sensors
WHERE timestamp > NOW() - INTERVAL 30 DAY
GROUP BY DATE_FORMAT(timestamp, '%Y-%m-%d %H:00:00');Why It’s Slow:
- MySQL scans entire table even if you only need temperature column
- Row-oriented storage reads all columns (timestamp, sensor_id, temperature, humidity)
- No automatic time-based partitioning
- Grouping by hour requires complex DATE_FORMAT parsing
- Query time: 45 seconds on 100M rows
The Fix: Use time-series database like InfluxDB
-- InfluxDB query (columnar, time-optimized)
SELECT MEAN(temperature)
FROM sensors
WHERE time > now() - 30d
GROUP BY time(1h)Why It’s Fast:
- Columnar storage: Only reads
temperatureandtimecolumns - Automatic time-based sharding (data partitioned by day/hour)
- Native time aggregation (GROUP BY time(1h) is optimized)
- Query time: 2 seconds on same 100M rows
Performance: 22x faster with purpose-built database
Putting Numbers to It
Time-Series DB Storage Efficiency: 10,000 sensors report temperature every 5 seconds. Daily readings: \(10,000 \times 17,280 = 172.8M\) readings.
Row-based storage (PostgreSQL): Each row stores timestamp (8 bytes), sensor_id (4 bytes), temperature (4 bytes), humidity (4 bytes) = 20 bytes/row. Daily storage: \(172.8M \times 20 = 3.46 \text{ GB/day}\). No compression.
Columnar storage (InfluxDB): Store columns separately. Temperature column: similar values compress well with delta encoding. Instead of storing \([22.1, 22.2, 22.1, 22.3]\), store \(22.1 + [+0.1, 0, +0.2]\) = 4 values in space of ~2. Compression ratio: 10:1.
Daily InfluxDB storage: \(3.46 / 10 = 0.346 \text{ GB/day}\). Annual savings: \((3.46 - 0.346) \times 365 \times \$0.023 = \$26.14\) storage cost alone. Query speedup: temperature column scans \(172.8M \times 4\text{ bytes} = 691\text{ MB}\) instead of \(3.46\text{ GB}\) (row-based must read all columns). Result: 5x faster queries, 10x less storage.
For Kids: Meet the Sensor Squad!
Think of big data tools like different helpers in a giant library – each one is best at a specific job!
28.7.2 The Sensor Squad Adventure: The Giant Library
Sammy the Sensor had been collecting temperature readings for a whole year. “I have MILLIONS of numbers!” she exclaimed. “How will I ever find patterns?”
Max the Microcontroller took Sammy to the Giant Data Library. “Meet our helpers!” he said.
First, they met Hadoop the Librarian. “I split big books into chapters and give each chapter to a different reader,” Hadoop explained. “Instead of one person reading the whole book, a hundred people each read one chapter at the same time. We finish 100 times faster!”
Next was Sparky the Speed Reader. “I’m like Hadoop, but I memorize everything instead of writing notes on paper. That makes me 10 to 100 times faster!” Sparky zoomed through Sammy’s data in minutes.
Then they met Kafka the Messenger. “I’m like a super-fast mail carrier,” Kafka said. “When new readings arrive, I deliver them to everyone who needs them – the dashboard, the alarm system, and the storage room – all at the same time, in just milliseconds!”
Finally, Lila the LED introduced them to Indy the Time-Keeper (InfluxDB). “I’m special because I ONLY organize things by time,” Indy said. “When Sammy asks ‘What was the temperature last Tuesday at 3pm?’ I can answer in a blink because that’s ALL I do!”
Bella the Battery smiled. “Each helper is amazing at their special job. The key is picking the right helper for the right task!”
28.7.3 Key Words for Kids
| Word | What It Means |
|---|---|
| Hadoop | A system that splits big jobs into small pieces for many computers to work on together |
| Spark | A super-fast data processor that keeps everything in memory instead of writing to disk |
| Kafka | A messenger that delivers data to many places at the same time, super quickly |
| Time-series database | A special database that organizes everything by time, perfect for sensor data |
Key Takeaway
Choosing the right big data technology depends on your IoT workload: use Hadoop/Spark for historical batch analysis of terabytes, Kafka for real-time streaming at millions of events per second, and time-series databases like InfluxDB for sensor data that needs fast time-range queries and automatic retention policies. Using a general-purpose database for time-series IoT data can be 22x slower than a purpose-built solution.
Common Pitfalls
1. Choosing Kafka for low-volume deployments
Kafka requires a cluster (typically 3+ brokers + ZooKeeper/KRaft) with significant operational overhead. For deployments with fewer than 1,000 messages per second, MQTT brokers or managed services (AWS IoT Core, Azure IoT Hub) are simpler and cheaper.
2. Using Spark for sub-second latency requirements
Spark’s micro-batch model introduces at least 100ms latency per batch. For real-time anomaly detection requiring <100ms response, use Apache Flink or a dedicated stream processing engine.
3. Running a general-purpose database for time-series IoT data
Storing millions of timestamped sensor readings in PostgreSQL without partitioning will produce table scans taking minutes for simple time-range queries. Use InfluxDB, TimescaleDB, or Cassandra with time-based partitioning.
4. Over-engineering with multiple frameworks for simple use cases
A small factory with 50 sensors does not need Kafka, Spark, and HDFS. A time-series database with a batch Python job often satisfies the requirement with 1/10th the operational complexity.
28.8 Summary
- Apache Hadoop ecosystem provides distributed processing through HDFS for storage, MapReduce and Spark for computation, and frameworks like Kafka for real-time streaming.
- Apache Spark offers 10-100x faster processing than MapReduce through in-memory computation, supporting SQL queries, machine learning, and streaming in a unified platform.
- Apache Kafka handles millions of events per second with millisecond latency, enabling real-time IoT data pipelines with built-in durability and scalability.
- Time-series databases like InfluxDB provide 100x better write performance and 10x storage compression compared to traditional databases for sensor data workloads.
28.9 Knowledge Check
Concept Relationships
Big Data Technologies connects to:
- Technology Selection: Storage tier (HDFS vs S3) → Processing tier (MapReduce vs Spark) → Query tier (Hive vs Presto) must align
- Lambda Architecture Dependencies: Batch layer uses Spark/Hadoop, Speed layer uses Kafka/Flink (Big Data Pipelines)
- Alternative Approaches: Time-series DBs (InfluxDB) replace general-purpose DBs (PostgreSQL) for 100x better IoT performance
- Cost Optimization: Edge processing (Edge Compute Patterns) reduces data volume before these technologies run
Technology Decision Matrix:
Data Volume → Processing Frequency → Technology Choice
├─ <1 TB/day, Hourly → Spark batch jobs on single node
├─ 1-10 TB/day, Real-time → Kafka + Flink cluster
└─ >10 TB/day, Both → Lambda (Kafka + Spark + Flink)
Key Insight: Match tool to workload. Hadoop excels at massive batch jobs. Spark wins for iterative ML. Kafka dominates real-time streams. Time-series DBs beat all for sensor queries. Mixing wrong tool costs 10-100x performance.
See Also
Within This Module:
- Big Data Overview - Technology selection framework
- Big Data Pipelines - How technologies integrate in Lambda architecture
- Big Data Operations - Monitoring and debugging these systems
Related Modules:
- Data Storage and Databases - Storage technology foundations
- Stream Processing - Stream processing patterns
- Cloud Data Platforms - Managed versions of these technologies
External Resources:
- Apache Spark Documentation - Official Spark programming guide
- Confluent Kafka Tutorials - Hands-on Kafka patterns
- InfluxDB Best Practices - Time-series optimization
- Jay Kreps, “I ❤ Logs” (2014) - Kafka design philosophy
28.10 What’s Next
| If you want to… | Read this |
|---|---|
| Understand the pipeline architecture these technologies support | Big Data Pipelines |
| See how these technologies handle edge processing | Big Data Edge Processing |
| Explore operational deployments of these technologies | Big Data Case Studies |
| Apply ML on top of big data infrastructure | Modeling and Inferencing |
| Return to the module overview | Big Data Overview |