26  Big Data Overview

26.1 Big Data Overview

This section provides a stable anchor for cross-references to big data concepts across the module. Big data in IoT represents a fundamental shift from traditional data management, requiring distributed processing where data is filtered at the edge, aggregated at intermediate nodes, and only valuable insights reach cloud storage.

Learning Objectives

After completing this series of chapters, you will be able to:

  • Explain the characteristics of big data in IoT contexts using the 5 V’s framework
  • Analyze why traditional databases cannot handle IoT scale by comparing throughput limits
  • Implement edge processing strategies for bandwidth and cost optimization
  • Select appropriate big data technologies (Hadoop, Spark, Kafka, time-series databases) for specific workloads
  • Design stream processing systems with Lambda architecture combining batch and real-time layers
  • Debug and monitor production big data pipelines using key metrics and alerting thresholds
  • Apply lessons from real-world smart city deployments to new IoT projects

Key Concepts

  • 5Vs of big data: Volume (scale), Velocity (speed), Variety (types), Veracity (quality), and Value (business benefit) — the five dimensions that characterise IoT data challenges.
  • Lambda architecture: A data processing pattern combining a batch layer (high accuracy, high latency) with a speed layer (lower accuracy, low latency) to serve both historical and real-time queries.
  • Data lake: A centralised repository storing raw IoT data in its native format at massive scale, allowing any processing framework to read and analyse it as needed.
  • Time-series database: A database optimised for storing and querying sequential sensor readings indexed by timestamp, offering far better performance than general-purpose databases for IoT workloads.
  • Ingestion pipeline: The infrastructure that receives, validates, and routes incoming IoT sensor data from devices to storage and processing systems.
  • Horizontal scalability: The ability to increase system capacity by adding more commodity servers rather than upgrading existing ones — the fundamental design principle of big data infrastructure.
In 60 Seconds

IoT systems generate data at a scale — billions of sensor readings per day — that requires distributed big data infrastructure far beyond what traditional databases can handle, making the 5Vs framework (Volume, Velocity, Variety, Veracity, Value) the essential starting point for any IoT analytics architecture design. The key decision is choosing between batch processing for historical analysis and stream processing for real-time responses.

Prerequisites

Before diving into these chapters, you should be familiar with:

Minimum Viable Understanding
  • Big data in IoT is defined by 5 V’s – Volume (terabytes daily from thousands of sensors), Velocity (real-time streams arriving in milliseconds), Variety (structured telemetry mixed with images, audio, and logs), Veracity (noisy, incomplete, and sometimes incorrect sensor data), and Value (the actionable insights hidden in the noise).
  • Traditional tools break at IoT scale – a single SQL database cannot ingest millions of events per second, store petabytes economically, or process unbounded real-time streams, so you must adopt distributed architectures that split work across edge, fog, and cloud tiers.
  • Architecture follows latency requirements – use stream processing (Kafka/Flink) when you need sub-second responses, data lakes (S3/HDFS) for raw storage, and batch processing (Spark) for historical analysis; the Lambda architecture combines all three into one coherent pipeline.

Diagram showing the 5 V's of Big Data in IoT arranged as a star pattern with Volume at center connecting to Velocity, Variety, Veracity, and Value, each with a brief description of its meaning in IoT contexts

Sammy the Sensor was staring at a screen full of numbers when Lila the LED blinked over.

“What is ALL that, Sammy?” Lila asked, her light flickering with curiosity.

“This is one day of data from the Smart City,” Sammy replied. “Every traffic light, every air quality sensor, every parking meter, every weather station – they have ALL been sending me messages. Look at these numbers: 10 million temperature readings, 5 million traffic counts, 2 million air quality samples, and 500,000 parking updates. That is 17.5 million data points… from just ONE day!”

Max the Microcontroller whirred over. “That is what the engineers call Big Data! It is like trying to drink from a fire hose – the data comes SO fast that a normal computer cannot keep up.”

Bella the Battery added, “And here is the clever part – we do NOT send everything to the cloud. That would use too much of my energy! Instead, each sensor does a little math first. The traffic camera counts cars locally and only sends the COUNT, not every single video frame. That saves 99% of the data!”

“So Big Data is not just about having LOTS of data,” Sammy summarized. “It is about being SMART about which data matters. We filter at the edge, summarize in the middle, and only the really important stuff goes all the way to the cloud.”

“Just like how you do not tell your teacher every single thing that happened at recess,” Lila laughed. “You just share the highlights!”

The lesson: Big data in IoT is not about storing everything – it is about processing data at the right place (edge, fog, or cloud) so you keep only what matters.

If you are new to big data concepts, here is a simple way to think about it.

The Library Analogy

Imagine a library that receives 10,000 new books every single hour, 24 hours a day, 7 days a week. The books arrive in different languages (English, Spanish, Mandarin), different formats (paperback, hardcover, audiobook, e-book), and some books have pages missing or typos. Your job is to organize them, find useful information, and answer questions from readers – in real time.

That is exactly the challenge of big data in IoT:

  • Volume = the 10,000 books per hour (massive amounts of data)
  • Velocity = they arrive continuously, not in batches (real-time streams)
  • Variety = different languages and formats (structured sensor readings, unstructured images, semi-structured logs)
  • Veracity = missing pages and typos (noisy, incomplete sensor data)
  • Value = finding useful answers for readers (extracting actionable insights)

Why can’t a normal database handle this?

A traditional database is like a single librarian with one desk. When books arrive slowly, they can catalog each one carefully. But when 10,000 books per hour start arriving, the librarian falls hopelessly behind. You need a TEAM of librarians (distributed systems), working in parallel, with a smart system for dividing the work.

The key insight: IoT big data is not just “more data.” It requires fundamentally different tools and architectures – distributed storage (data lakes), real-time processing (stream engines), and intelligent filtering (edge computing) – because the old tools simply cannot keep up.

Three things to remember:

  1. Filter early – process data at the edge to reduce volume by 90-99%
  2. Process in parallel – use distributed systems like Spark and Kafka to split the work
  3. Store smartly – use time-series databases for telemetry, data lakes for raw storage, and only keep what you need

26.2 The IoT Big Data Challenge

IoT systems generate data at a scale that fundamentally breaks traditional data management approaches. To understand why, consider the numbers:

Flowchart showing data flow from IoT devices through edge processing, fog aggregation, and cloud analytics, with data volume reducing at each stage from terabytes at the device level to gigabytes at the edge to megabytes of insights in the cloud

Scale Factor Traditional System IoT Big Data System
Devices Dozens of users 10,000 to 1,000,000 sensors
Data rate Hundreds of queries/sec Millions of events/sec
Data volume Gigabytes per month Terabytes per day
Latency requirement Seconds acceptable Milliseconds required
Data types Structured rows Mixed: numbers, images, logs, video
Retention Months Years (regulatory compliance)

26.3 Big Data Technology Landscape

Choosing the right technology depends on where your data sits in the processing pipeline. The following diagram maps the major tool categories to their roles:

Architecture diagram showing the big data technology landscape organized into four layers -- ingestion with Kafka and MQTT, stream processing with Flink and Spark Streaming, batch processing with Spark and Hadoop MapReduce, and storage with HDFS data lakes, time-series databases, and data warehouses

26.4 Lambda Architecture for IoT

The Lambda architecture is the dominant pattern for IoT big data systems because it handles both real-time and historical analysis in a single unified design:

Lambda architecture diagram showing IoT data entering the system and splitting into two paths -- a speed layer for real-time stream processing with sub-second latency, and a batch layer for comprehensive historical processing, both merging at the serving layer to provide complete query results

Layer Purpose Technology Latency Accuracy
Speed layer Real-time approximate results Flink, Kafka Streams Milliseconds Approximate
Batch layer Complete historical reprocessing Spark, Hadoop Hours Exact
Serving layer Merge and serve query results Druid, Cassandra Sub-second Best available

26.5 Chapter Series

This topic has been organized into focused chapters for easier learning:

26.5.1 Big Data Fundamentals

Understanding the scale challenge: the 5 V’s of big data, why traditional databases fail at IoT scale, and the economics of distributed systems. Includes the Sensor Squad introduction for beginners.

26.5.2 Edge Processing for Big Data

The 90/10 rule: how edge computing reduces data volume by 99%, making impossible big data problems manageable. Cost comparisons and real-world traffic camera examples.

26.5.3 Big Data Technologies

Technology deep-dive: Apache Hadoop ecosystem (HDFS, Spark, Hive, Kafka), time-series databases (InfluxDB, TimescaleDB), and when to use each technology.

26.5.4 Big Data Pipelines

Architecture patterns: Lambda architecture combining batch and stream processing, ETL vs ELT, data lakes vs data warehouses, and windowing strategies for stream processing.

26.5.5 Big Data Operations

Production readiness: monitoring metrics, debugging common issues (Kafka lag, OOM errors, late arrivals), and avoiding the seven deadly pitfalls of IoT big data.

26.5.6 Big Data Case Studies

Real-world applications: Barcelona Smart City deployment with concrete numbers, worked examples for data lake architecture, and decision frameworks for technology selection.

Common Pitfalls in IoT Big Data

1. Sending everything to the cloud The most expensive mistake in IoT. A traffic camera generating 30 frames/second at 2 MB/frame produces 5 TB/day. Cloud ingestion at $0.02/GB means $100/day per camera. Edge processing that sends only vehicle counts reduces this to pennies.

2. Using a relational database for time-series data PostgreSQL can handle small IoT deployments, but query performance degrades dramatically as tables grow beyond a few hundred million rows. Time-series databases (InfluxDB, TimescaleDB) automatically partition by time and provide 10-100x faster range queries.

3. Ignoring data quality at the source Bad sensor data (stuck readings, calibration drift, noise) propagates through the entire pipeline and poisons ML models, triggers false alerts, and leads to wrong business decisions. Always validate at the edge before ingestion.

4. Building batch-only pipelines for real-time needs If your application needs sub-second alerts (equipment failure, security breach), a nightly Spark job is too late. Match your architecture to your latency requirements – use stream processing for real-time, batch for historical analysis.

5. No data lifecycle management IoT data grows forever if you do not plan for it. Without retention policies and tiered storage (hot/warm/cold), storage costs grow linearly and eventually consume your entire budget. Implement automatic archival from day one.

26.6 Worked Example: Smart Factory Data Architecture

Scenario: A manufacturing plant has 2,000 sensors monitoring vibration, temperature, pressure, and power consumption across 50 production machines. Each sensor reports every second. The plant needs real-time anomaly detection (sub-second alerts for equipment failure) and historical trend analysis (monthly efficiency reports).

Step 1 – Calculate the data volume

Parameter Value
Sensors 2,000
Readings per second per sensor 1
Bytes per reading (timestamp + value + metadata) ~100 bytes
Raw data rate 2,000 x 1 x 100 = 200 KB/sec
Daily raw volume 200 KB/sec x 86,400 sec = ~17 GB/day
Monthly raw volume 17 GB x 30 = ~510 GB/month
Annual raw volume 510 GB x 12 = ~6.1 TB/year

Edge Processing ROI: Factory generates \(200 \text{ KB/sec} = 17 \text{ GB/day}\) raw data. Cloud data transfer at \(\$0.09/\text{GB}\) costs \(17 \times 30 \times 0.09 = \$45.90/\text{month}\) bandwidth. Storage at \(\$0.023/\text{GB/month}\): \(510 \times 0.023 = \$11.73/\text{month}\). Total: \(\$57.63/\text{month}\) or \(\$692/\text{year}\).

With edge aggregation (1-minute averages): reduce 60 readings to 1 per minute. New rate: \(200/60 = 3.33 \text{ KB/sec} = 280 \text{ MB/day}\). Cloud costs: bandwidth \((0.28 \times 30 \times 0.09) = \$0.76/\text{month}\), storage \((8.4 \times 0.023) = \$0.19/\text{month}\). Total: \(\$0.95/\text{month}\) or \(\$11.40/\text{year}\).

Savings: \(\$692 - \$11.40 = \$680.60/\text{year}\). Edge gateway costs \(\$500\) with 5-year lifetime (amortized \(\$100/\text{year}\)). Net benefit: \(\$580/\text{year}\), or 580% ROI. Edge processing achieves 98.4% data reduction, paying for itself in under 9 months.

Step 2 – Design the edge processing

At the edge, each gateway aggregates 40 sensors (50 gateways total):

  • Real-time forwarding: vibration readings above threshold immediately sent to stream processor (estimated 2% of readings = 0.34 GB/day)
  • Local aggregation: 1-minute averages of temperature, pressure, power (reduces 60 readings to 1 per minute = ~280 MB/day)
  • Result: edge processing reduces 17 GB/day to approximately 630 MB/day forwarded to cloud (96% reduction)

Step 3 – Select the technology stack

Ingestion:     Apache Kafka (handles burst traffic, decouples producers/consumers)
Stream:        Apache Flink (stateful anomaly detection with <100ms latency)
Time-series:   InfluxDB (optimized for sensor telemetry queries)
Batch:         Apache Spark (monthly efficiency reports from data lake)
Data lake:     S3 (raw data archive for reprocessing, $0.023/GB/month)
Dashboard:     Grafana (real-time visualization connected to InfluxDB)

Step 4 – Estimate monthly costs

Component Specification Monthly Cost
Kafka cluster (3 brokers) m5.large instances ~$300
Flink cluster (2 workers) c5.xlarge instances ~$250
InfluxDB (90-day retention) 54 GB hot storage ~$150
S3 data lake (raw archive) ~6 TB/year growing ~$140/year
Spark (on-demand, monthly) 4-hour job ~$20
Total ~$740/month

Comparison: sending all raw data (17 GB/day) to a cloud SQL database would cost approximately $2,500/month in compute and storage, and queries would be 10-100x slower. The edge-first architecture saves over 70%.

26.7 Knowledge Checks

Test your understanding of big data concepts in IoT:

Run this code to calculate data volumes for real IoT deployments and see how edge processing reduces cloud costs by orders of magnitude.

import random

def calculate_iot_data_volume(sensors, bytes_per_reading,
                               readings_per_sec, days=1):
    """Calculate raw data volume for an IoT deployment."""
    secs_per_day = 86_400
    daily_bytes = sensors * bytes_per_reading * readings_per_sec * secs_per_day
    total_bytes = daily_bytes * days
    return total_bytes

def format_bytes(b):
    for unit in ["B", "KB", "MB", "GB", "TB", "PB"]:
        if abs(b) < 1024:
            return f"{b:.1f} {unit}"
        b /= 1024
    return f"{b:.1f} EB"

def cloud_cost(bytes_val, cost_per_gb=0.023):
    """Estimate monthly cloud storage cost (S3 Standard pricing)."""
    gb = bytes_val / (1024**3)
    return gb * cost_per_gb

# === Scenario 1: Smart Factory ===
print("=== Scenario 1: Smart Factory ===")
factory_daily = calculate_iot_data_volume(
    sensors=2000, bytes_per_reading=100,
    readings_per_sec=1, days=1
)
factory_monthly = factory_daily * 30
factory_yearly = factory_daily * 365
print(f"  Sensors: 2,000 at 1 reading/sec (100 bytes each)")
print(f"  Daily:   {format_bytes(factory_daily)}")
print(f"  Monthly: {format_bytes(factory_monthly)}")
print(f"  Yearly:  {format_bytes(factory_yearly)}")
print(f"  Cloud storage cost: ${cloud_cost(factory_monthly):.0f}/month")

# === Scenario 2: Smart City Traffic ===
print("\n=== Scenario 2: Smart City (Traffic Cameras) ===")
# 500 cameras, 30 fps, 2 MB per frame
camera_daily = calculate_iot_data_volume(
    sensors=500, bytes_per_reading=2_000_000,
    readings_per_sec=30, days=1
)
print(f"  Cameras: 500 at 30 fps (2 MB/frame)")
print(f"  Daily raw: {format_bytes(camera_daily)}")
print(f"  Cloud cost if sent raw: ${cloud_cost(camera_daily * 30):,.0f}/month")

# Edge processing: only send vehicle counts (100 bytes/min)
camera_edge = calculate_iot_data_volume(
    sensors=500, bytes_per_reading=100,
    readings_per_sec=1/60, days=1
)
reduction = (1 - camera_edge / camera_daily) * 100
print(f"  After edge processing (vehicle counts only):")
print(f"  Daily:    {format_bytes(camera_edge)}")
print(f"  Reduction: {reduction:.2f}%")
print(f"  Cloud cost: ${cloud_cost(camera_edge * 30):.2f}/month")
savings = cloud_cost(camera_daily * 30) - cloud_cost(camera_edge * 30)
print(f"  Monthly savings: ${savings:,.0f}")

# === Edge Processing Simulation ===
print("\n=== Edge Processing Strategies Comparison ===")
random.seed(42)

# Simulate 1 hour of temperature readings (1/sec = 3600 readings)
raw_readings = [22.0 + random.gauss(0, 0.3) for _ in range(3600)]
raw_bytes = len(raw_readings) * 8  # 8 bytes per float

strategies = {
    "Raw (no processing)": {
        "readings": len(raw_readings),
        "bytes": raw_bytes,
    },
    "1-min averages": {
        "readings": 60,  # 3600/60
        "bytes": 60 * 8,
    },
    "5-min averages": {
        "readings": 12,  # 3600/300
        "bytes": 12 * 8,
    },
    "Delta encoding (>0.5C change)": {
        "readings": sum(1 for i in range(1, len(raw_readings))
                        if abs(raw_readings[i] - raw_readings[i-1]) > 0.5),
        "bytes": 0,  # Calculated below
    },
    "Min/max/avg per 5 min": {
        "readings": 12 * 3,  # 3 values per window
        "bytes": 12 * 3 * 8,
    },
}
strategies["Delta encoding (>0.5C change)"]["bytes"] = (
    strategies["Delta encoding (>0.5C change)"]["readings"] * 8
)

print(f"{'Strategy':<32} {'Readings':>10} {'Bytes':>10} {'Reduction':>10}")
print("-" * 65)
for name, info in strategies.items():
    reduction = (1 - info["bytes"] / raw_bytes) * 100
    print(f"{name:<32} {info['readings']:>10,} "
          f"{format_bytes(info['bytes']):>10} {reduction:>9.1f}%")

# Scale to deployment
print(f"\nScaled to 10,000 sensors for 30 days:")
for name, info in strategies.items():
    monthly = info["bytes"] * 10_000 * 24 * 30  # per hour * sensors * hours * days
    cost = cloud_cost(monthly)
    print(f"  {name:<32} {format_bytes(monthly):>10}  ${cost:>8,.0f}/month")

What to Observe:

  • A modest factory (2,000 sensors) generates ~17 GB/day – manageable but growing fast
  • Traffic cameras generate ~2.5 TB/day – impossible to send raw to cloud
  • Edge processing reduces camera data by 99.99%, cutting costs from thousands to pennies per month
  • Even simple strategies (1-min averages) achieve 98%+ reduction for temperature data
  • Delta encoding is most efficient for stable signals but loses context during gradual drift

26.8 Tiered Storage Cost Explorer

Use this calculator to see how tiered storage policies reduce costs for IoT data lakes. Adjust the total data volume and the proportion stored in each tier.

26.9 Summary

Big data in IoT is not simply “more data” – it is a fundamentally different challenge that requires purpose-built architectures and technologies. This chapter series equips you with the knowledge to design, build, and operate big data systems at IoT scale.

Key takeaways:

  • The 5 V’s define the challenge: Volume, Velocity, Variety, Veracity, and Value each demand specific architectural responses. No single technology addresses all five.
  • Edge processing is the single most impactful optimization: Filtering and aggregating data at the source reduces cloud costs by 90-99% and eliminates bandwidth bottlenecks before they occur.
  • Lambda architecture unifies real-time and batch: Stream processing (Flink/Kafka Streams) delivers sub-second alerts while batch processing (Spark) provides complete historical accuracy. The serving layer merges both views.
  • Technology selection follows data patterns: Use Kafka for ingestion, Flink for streaming, Spark for batch analytics, InfluxDB/TimescaleDB for time-series queries, and S3/HDFS for data lake storage. Match the tool to the access pattern.
  • Tiered storage prevents runaway costs: Hot storage for recent data, warm for intermediate, and cold/glacier for archives. Without lifecycle policies, storage costs grow linearly and eventually dominate your budget.
  • Data quality must be enforced at the source: A bad sensor reading at the edge becomes a false alert in the stream processor, a corrupted record in the data lake, and a biased model in the ML pipeline. Validate early, validate often.

Big Data Overview connects to:

  • Upstream Dependencies: Edge computing provides the first filtering tier that reduces big data volume by 90-99% before cloud ingestion (Edge Processing)
  • Downstream Applications: Big data technologies enable real-time analytics (Stream Processing), machine learning (ML for IoT), and business intelligence
  • Parallel Concepts: Data lakes vs data warehouses (Cloud Data Architecture) represent different approaches to storing and querying big data
  • Cross-Module Links: Distributed system architectures (Edge-Fog-Cloud) provide the infrastructure where big data processing occurs

Key Concept Map:

Edge Filtering (90-99% reduction)
    ↓
Big Data Ingestion (Kafka)
    ↓
├─ Stream Processing (real-time) → Alerts, Dashboards
└─ Batch Processing (historical) → Reports, ML Training

The 5 V’s provide the decision framework for which technology to use at each stage.

Within This Module:

Related Modules:

Learning Hubs:

External Resources:

26.10 What’s Next

If you want to… Read this
Learn the core concepts in depth Big Data Fundamentals
Explore specific processing technologies Big Data Technologies
Understand end-to-end pipeline design Big Data Pipelines
Apply edge processing to reduce data volume Big Data Edge Processing
See deployments in real IoT systems Big Data Case Studies