26 Big Data Overview
26.1 Big Data Overview
This section provides a stable anchor for cross-references to big data concepts across the module. Big data in IoT represents a fundamental shift from traditional data management, requiring distributed processing where data is filtered at the edge, aggregated at intermediate nodes, and only valuable insights reach cloud storage.
Learning Objectives
After completing this series of chapters, you will be able to:
- Explain the characteristics of big data in IoT contexts using the 5 V’s framework
- Analyze why traditional databases cannot handle IoT scale by comparing throughput limits
- Implement edge processing strategies for bandwidth and cost optimization
- Select appropriate big data technologies (Hadoop, Spark, Kafka, time-series databases) for specific workloads
- Design stream processing systems with Lambda architecture combining batch and real-time layers
- Debug and monitor production big data pipelines using key metrics and alerting thresholds
- Apply lessons from real-world smart city deployments to new IoT projects
Key Concepts
- 5Vs of big data: Volume (scale), Velocity (speed), Variety (types), Veracity (quality), and Value (business benefit) — the five dimensions that characterise IoT data challenges.
- Lambda architecture: A data processing pattern combining a batch layer (high accuracy, high latency) with a speed layer (lower accuracy, low latency) to serve both historical and real-time queries.
- Data lake: A centralised repository storing raw IoT data in its native format at massive scale, allowing any processing framework to read and analyse it as needed.
- Time-series database: A database optimised for storing and querying sequential sensor readings indexed by timestamp, offering far better performance than general-purpose databases for IoT workloads.
- Ingestion pipeline: The infrastructure that receives, validates, and routes incoming IoT sensor data from devices to storage and processing systems.
- Horizontal scalability: The ability to increase system capacity by adding more commodity servers rather than upgrading existing ones — the fundamental design principle of big data infrastructure.
Prerequisites
Before diving into these chapters, you should be familiar with:
- Edge, Fog, and Cloud Overview: Understanding the three-tier architecture helps contextualize where big data processing occurs
- Data Storage and Databases: Knowledge of database types provides foundation for understanding distributed storage
- Networking Basics: Familiarity with network protocols is crucial for understanding data flow
Minimum Viable Understanding
- Big data in IoT is defined by 5 V’s – Volume (terabytes daily from thousands of sensors), Velocity (real-time streams arriving in milliseconds), Variety (structured telemetry mixed with images, audio, and logs), Veracity (noisy, incomplete, and sometimes incorrect sensor data), and Value (the actionable insights hidden in the noise).
- Traditional tools break at IoT scale – a single SQL database cannot ingest millions of events per second, store petabytes economically, or process unbounded real-time streams, so you must adopt distributed architectures that split work across edge, fog, and cloud tiers.
- Architecture follows latency requirements – use stream processing (Kafka/Flink) when you need sub-second responses, data lakes (S3/HDFS) for raw storage, and batch processing (Spark) for historical analysis; the Lambda architecture combines all three into one coherent pipeline.
Sensor Squad: The Big Data Adventure
Sammy the Sensor was staring at a screen full of numbers when Lila the LED blinked over.
“What is ALL that, Sammy?” Lila asked, her light flickering with curiosity.
“This is one day of data from the Smart City,” Sammy replied. “Every traffic light, every air quality sensor, every parking meter, every weather station – they have ALL been sending me messages. Look at these numbers: 10 million temperature readings, 5 million traffic counts, 2 million air quality samples, and 500,000 parking updates. That is 17.5 million data points… from just ONE day!”
Max the Microcontroller whirred over. “That is what the engineers call Big Data! It is like trying to drink from a fire hose – the data comes SO fast that a normal computer cannot keep up.”
Bella the Battery added, “And here is the clever part – we do NOT send everything to the cloud. That would use too much of my energy! Instead, each sensor does a little math first. The traffic camera counts cars locally and only sends the COUNT, not every single video frame. That saves 99% of the data!”
“So Big Data is not just about having LOTS of data,” Sammy summarized. “It is about being SMART about which data matters. We filter at the edge, summarize in the middle, and only the really important stuff goes all the way to the cloud.”
“Just like how you do not tell your teacher every single thing that happened at recess,” Lila laughed. “You just share the highlights!”
The lesson: Big data in IoT is not about storing everything – it is about processing data at the right place (edge, fog, or cloud) so you keep only what matters.
For Beginners: What Is Big Data in IoT?
If you are new to big data concepts, here is a simple way to think about it.
The Library Analogy
Imagine a library that receives 10,000 new books every single hour, 24 hours a day, 7 days a week. The books arrive in different languages (English, Spanish, Mandarin), different formats (paperback, hardcover, audiobook, e-book), and some books have pages missing or typos. Your job is to organize them, find useful information, and answer questions from readers – in real time.
That is exactly the challenge of big data in IoT:
- Volume = the 10,000 books per hour (massive amounts of data)
- Velocity = they arrive continuously, not in batches (real-time streams)
- Variety = different languages and formats (structured sensor readings, unstructured images, semi-structured logs)
- Veracity = missing pages and typos (noisy, incomplete sensor data)
- Value = finding useful answers for readers (extracting actionable insights)
Why can’t a normal database handle this?
A traditional database is like a single librarian with one desk. When books arrive slowly, they can catalog each one carefully. But when 10,000 books per hour start arriving, the librarian falls hopelessly behind. You need a TEAM of librarians (distributed systems), working in parallel, with a smart system for dividing the work.
The key insight: IoT big data is not just “more data.” It requires fundamentally different tools and architectures – distributed storage (data lakes), real-time processing (stream engines), and intelligent filtering (edge computing) – because the old tools simply cannot keep up.
Three things to remember:
- Filter early – process data at the edge to reduce volume by 90-99%
- Process in parallel – use distributed systems like Spark and Kafka to split the work
- Store smartly – use time-series databases for telemetry, data lakes for raw storage, and only keep what you need
26.2 The IoT Big Data Challenge
IoT systems generate data at a scale that fundamentally breaks traditional data management approaches. To understand why, consider the numbers:
| Scale Factor | Traditional System | IoT Big Data System |
|---|---|---|
| Devices | Dozens of users | 10,000 to 1,000,000 sensors |
| Data rate | Hundreds of queries/sec | Millions of events/sec |
| Data volume | Gigabytes per month | Terabytes per day |
| Latency requirement | Seconds acceptable | Milliseconds required |
| Data types | Structured rows | Mixed: numbers, images, logs, video |
| Retention | Months | Years (regulatory compliance) |
26.3 Big Data Technology Landscape
Choosing the right technology depends on where your data sits in the processing pipeline. The following diagram maps the major tool categories to their roles:
26.4 Lambda Architecture for IoT
The Lambda architecture is the dominant pattern for IoT big data systems because it handles both real-time and historical analysis in a single unified design:
| Layer | Purpose | Technology | Latency | Accuracy |
|---|---|---|---|---|
| Speed layer | Real-time approximate results | Flink, Kafka Streams | Milliseconds | Approximate |
| Batch layer | Complete historical reprocessing | Spark, Hadoop | Hours | Exact |
| Serving layer | Merge and serve query results | Druid, Cassandra | Sub-second | Best available |
26.5 Chapter Series
This topic has been organized into focused chapters for easier learning:
26.5.1 Big Data Fundamentals
Understanding the scale challenge: the 5 V’s of big data, why traditional databases fail at IoT scale, and the economics of distributed systems. Includes the Sensor Squad introduction for beginners.
26.5.2 Edge Processing for Big Data
The 90/10 rule: how edge computing reduces data volume by 99%, making impossible big data problems manageable. Cost comparisons and real-world traffic camera examples.
26.5.3 Big Data Technologies
Technology deep-dive: Apache Hadoop ecosystem (HDFS, Spark, Hive, Kafka), time-series databases (InfluxDB, TimescaleDB), and when to use each technology.
26.5.4 Big Data Pipelines
Architecture patterns: Lambda architecture combining batch and stream processing, ETL vs ELT, data lakes vs data warehouses, and windowing strategies for stream processing.
26.5.5 Big Data Operations
Production readiness: monitoring metrics, debugging common issues (Kafka lag, OOM errors, late arrivals), and avoiding the seven deadly pitfalls of IoT big data.
26.5.6 Big Data Case Studies
Real-world applications: Barcelona Smart City deployment with concrete numbers, worked examples for data lake architecture, and decision frameworks for technology selection.
Common Pitfalls in IoT Big Data
1. Sending everything to the cloud The most expensive mistake in IoT. A traffic camera generating 30 frames/second at 2 MB/frame produces 5 TB/day. Cloud ingestion at $0.02/GB means $100/day per camera. Edge processing that sends only vehicle counts reduces this to pennies.
2. Using a relational database for time-series data PostgreSQL can handle small IoT deployments, but query performance degrades dramatically as tables grow beyond a few hundred million rows. Time-series databases (InfluxDB, TimescaleDB) automatically partition by time and provide 10-100x faster range queries.
3. Ignoring data quality at the source Bad sensor data (stuck readings, calibration drift, noise) propagates through the entire pipeline and poisons ML models, triggers false alerts, and leads to wrong business decisions. Always validate at the edge before ingestion.
4. Building batch-only pipelines for real-time needs If your application needs sub-second alerts (equipment failure, security breach), a nightly Spark job is too late. Match your architecture to your latency requirements – use stream processing for real-time, batch for historical analysis.
5. No data lifecycle management IoT data grows forever if you do not plan for it. Without retention policies and tiered storage (hot/warm/cold), storage costs grow linearly and eventually consume your entire budget. Implement automatic archival from day one.
26.6 Worked Example: Smart Factory Data Architecture
Scenario: A manufacturing plant has 2,000 sensors monitoring vibration, temperature, pressure, and power consumption across 50 production machines. Each sensor reports every second. The plant needs real-time anomaly detection (sub-second alerts for equipment failure) and historical trend analysis (monthly efficiency reports).
Step 1 – Calculate the data volume
| Parameter | Value |
|---|---|
| Sensors | 2,000 |
| Readings per second per sensor | 1 |
| Bytes per reading (timestamp + value + metadata) | ~100 bytes |
| Raw data rate | 2,000 x 1 x 100 = 200 KB/sec |
| Daily raw volume | 200 KB/sec x 86,400 sec = ~17 GB/day |
| Monthly raw volume | 17 GB x 30 = ~510 GB/month |
| Annual raw volume | 510 GB x 12 = ~6.1 TB/year |
Putting Numbers to It
Edge Processing ROI: Factory generates \(200 \text{ KB/sec} = 17 \text{ GB/day}\) raw data. Cloud data transfer at \(\$0.09/\text{GB}\) costs \(17 \times 30 \times 0.09 = \$45.90/\text{month}\) bandwidth. Storage at \(\$0.023/\text{GB/month}\): \(510 \times 0.023 = \$11.73/\text{month}\). Total: \(\$57.63/\text{month}\) or \(\$692/\text{year}\).
With edge aggregation (1-minute averages): reduce 60 readings to 1 per minute. New rate: \(200/60 = 3.33 \text{ KB/sec} = 280 \text{ MB/day}\). Cloud costs: bandwidth \((0.28 \times 30 \times 0.09) = \$0.76/\text{month}\), storage \((8.4 \times 0.023) = \$0.19/\text{month}\). Total: \(\$0.95/\text{month}\) or \(\$11.40/\text{year}\).
Savings: \(\$692 - \$11.40 = \$680.60/\text{year}\). Edge gateway costs \(\$500\) with 5-year lifetime (amortized \(\$100/\text{year}\)). Net benefit: \(\$580/\text{year}\), or 580% ROI. Edge processing achieves 98.4% data reduction, paying for itself in under 9 months.
Step 2 – Design the edge processing
At the edge, each gateway aggregates 40 sensors (50 gateways total):
- Real-time forwarding: vibration readings above threshold immediately sent to stream processor (estimated 2% of readings = 0.34 GB/day)
- Local aggregation: 1-minute averages of temperature, pressure, power (reduces 60 readings to 1 per minute = ~280 MB/day)
- Result: edge processing reduces 17 GB/day to approximately 630 MB/day forwarded to cloud (96% reduction)
Step 3 – Select the technology stack
Ingestion: Apache Kafka (handles burst traffic, decouples producers/consumers)
Stream: Apache Flink (stateful anomaly detection with <100ms latency)
Time-series: InfluxDB (optimized for sensor telemetry queries)
Batch: Apache Spark (monthly efficiency reports from data lake)
Data lake: S3 (raw data archive for reprocessing, $0.023/GB/month)
Dashboard: Grafana (real-time visualization connected to InfluxDB)
Step 4 – Estimate monthly costs
| Component | Specification | Monthly Cost |
|---|---|---|
| Kafka cluster (3 brokers) | m5.large instances | ~$300 |
| Flink cluster (2 workers) | c5.xlarge instances | ~$250 |
| InfluxDB (90-day retention) | 54 GB hot storage | ~$150 |
| S3 data lake (raw archive) | ~6 TB/year growing | ~$140/year |
| Spark (on-demand, monthly) | 4-hour job | ~$20 |
| Total | ~$740/month |
Comparison: sending all raw data (17 GB/day) to a cloud SQL database would cost approximately $2,500/month in compute and storage, and queries would be 10-100x slower. The edge-first architecture saves over 70%.
26.7 Knowledge Checks
Test your understanding of big data concepts in IoT:
26.8 Tiered Storage Cost Explorer
Use this calculator to see how tiered storage policies reduce costs for IoT data lakes. Adjust the total data volume and the proportion stored in each tier.
26.9 Summary
Big data in IoT is not simply “more data” – it is a fundamentally different challenge that requires purpose-built architectures and technologies. This chapter series equips you with the knowledge to design, build, and operate big data systems at IoT scale.
Key takeaways:
- The 5 V’s define the challenge: Volume, Velocity, Variety, Veracity, and Value each demand specific architectural responses. No single technology addresses all five.
- Edge processing is the single most impactful optimization: Filtering and aggregating data at the source reduces cloud costs by 90-99% and eliminates bandwidth bottlenecks before they occur.
- Lambda architecture unifies real-time and batch: Stream processing (Flink/Kafka Streams) delivers sub-second alerts while batch processing (Spark) provides complete historical accuracy. The serving layer merges both views.
- Technology selection follows data patterns: Use Kafka for ingestion, Flink for streaming, Spark for batch analytics, InfluxDB/TimescaleDB for time-series queries, and S3/HDFS for data lake storage. Match the tool to the access pattern.
- Tiered storage prevents runaway costs: Hot storage for recent data, warm for intermediate, and cold/glacier for archives. Without lifecycle policies, storage costs grow linearly and eventually dominate your budget.
- Data quality must be enforced at the source: A bad sensor reading at the edge becomes a false alert in the stream processor, a corrupted record in the data lake, and a biased model in the ML pipeline. Validate early, validate often.
Concept Relationships
Big Data Overview connects to:
- Upstream Dependencies: Edge computing provides the first filtering tier that reduces big data volume by 90-99% before cloud ingestion (Edge Processing)
- Downstream Applications: Big data technologies enable real-time analytics (Stream Processing), machine learning (ML for IoT), and business intelligence
- Parallel Concepts: Data lakes vs data warehouses (Cloud Data Architecture) represent different approaches to storing and querying big data
- Cross-Module Links: Distributed system architectures (Edge-Fog-Cloud) provide the infrastructure where big data processing occurs
Key Concept Map:
Edge Filtering (90-99% reduction)
↓
Big Data Ingestion (Kafka)
↓
├─ Stream Processing (real-time) → Alerts, Dashboards
└─ Batch Processing (historical) → Reports, ML Training
The 5 V’s provide the decision framework for which technology to use at each stage.
See Also
Within This Module:
- Big Data Technologies - Apache Hadoop, Spark, Kafka deep dive
- Big Data Pipelines - Lambda architecture and windowing strategies
- Edge Compute Patterns - Processing placement decisions
- Multi-Sensor Data Fusion - Combining data streams
Related Modules:
- Data Storage and Databases - Storage foundations
- Stream Processing Fundamentals - Real-time processing
- Edge, Fog and Cloud Overview - Three-tier architecture
- Fog Architecture - Fog computing details
Learning Hubs:
- Quiz Navigator - Data analytics quizzes
- Simulation Playground - Edge latency explorer, sensor fusion tools
- Knowledge Gaps Tracker - Track your progress
External Resources:
- Apache Spark Documentation - Official Spark guide
- Confluent Kafka Guide - Kafka best practices
- Martin Kleppmann, “Designing Data-Intensive Applications” (2017) - Big data architecture patterns
26.10 What’s Next
| If you want to… | Read this |
|---|---|
| Learn the core concepts in depth | Big Data Fundamentals |
| Explore specific processing technologies | Big Data Technologies |
| Understand end-to-end pipeline design | Big Data Pipelines |
| Apply edge processing to reduce data volume | Big Data Edge Processing |
| See deployments in real IoT systems | Big Data Case Studies |