27  Big Data Fundamentals

In 60 Seconds

Big data in IoT is characterized by the 5 V’s: Volume (petabytes from billions of devices), Velocity (real-time streaming), Variety (structured, unstructured, time-series), Veracity (data quality challenges), and Value (actionable insights). Traditional databases fail at IoT scale because single servers hit physical limits, making horizontal scaling with distributed systems essential.

Learning Objectives

After completing this chapter, you will be able to:

  • Explain the characteristics of big data in IoT contexts using the 5 V’s framework
  • Distinguish between the 5 V’s of big data (Volume, Velocity, Variety, Veracity, Value) with IoT-specific examples
  • Analyze why traditional databases cannot handle IoT scale by calculating throughput limits
  • Calculate the economics of distributed versus centralized processing to justify architecture decisions

Key Concepts

  • Volume: The sheer quantity of data generated by IoT deployments — billions of sensor readings per day from millions of devices — that exceeds the capacity of traditional database systems.
  • Velocity: The speed at which IoT data arrives, often in real-time streams requiring processing within milliseconds to seconds for time-sensitive applications like anomaly detection.
  • Variety: The diversity of IoT data types — structured sensor readings, unstructured video streams, semi-structured JSON payloads — requiring flexible storage and processing frameworks.
  • Veracity: The trustworthiness of IoT data, challenged by sensor noise, calibration drift, transmission errors, and missing readings that must be detected and corrected before analysis.
  • Value: The business benefit extracted from IoT data through analytics, the ultimate justification for the cost of collection, storage, and processing infrastructure.
  • Distributed file system: A storage system (e.g., HDFS) that partitions large datasets across many commodity servers, enabling parallel processing and fault tolerance through replication.
  • MapReduce: A parallel processing paradigm where the Map phase distributes and transforms data across nodes and the Reduce phase aggregates results, enabling petabyte-scale batch analytics.

27.1 Getting Started (For Beginners)

Big Data is like having a SUPER memory that never forgets anything!

27.1.1 The Sensor Squad Adventure: The Mountain of Messages

One sunny morning, Signal Sam woke up to find something amazing - and a little scary! During the night, all the sensors in the smart house had been sending messages, and now there was a HUGE pile of data sitting in Signal Sam’s inbox.

“Whoa!” gasped Sam, looking at the mountain of messages. “Sunny sent 86,400 light readings yesterday. Thermo sent 86,400 temperature readings. Motion Mo detected movement 2,347 times. And Power Pete tracked every single watt of electricity - that’s over 300,000 data points!”

Sunny the Light Sensor flew over. “Is that… too much data?”

“Not if we’re smart about it!” Sam grinned. “This is what grown-ups call Big Data. Let me teach you the 5 V’s - they’re like our five superpowers for handling all this information!”

V #1 - Volume (The GIANT pile): “See this mountain? That’s Volume - how MUCH data we have. It’s like trying to count every grain of sand on a beach!”

V #2 - Velocity (The SPEED demon): Motion Mo zoomed by. “That’s me! I send 10 readings every SECOND. Velocity means how FAST new data arrives - like trying to drink from a fire hose!”

V #3 - Variety (The MIX master): “Look at all the different types!” said Thermo. “Numbers from me, pictures from cameras, sounds from microphones. Variety means different KINDS of data mixed together.”

V #4 - Veracity (The TRUTH teller): Power Pete spoke up seriously. “Sometimes sensors make mistakes. Veracity means checking if the data is TRUE and ACCURATE.”

V #5 - Value (The TREASURE finder): “And the best part,” Sam announced, “is finding the gold nuggets in all this data! Value means turning mountains of numbers into useful answers - like knowing exactly when to water the plants or when someone might slip on the stairs.”

The Sensor Squad cheered! They had learned that Big Data isn’t scary - it’s just lots of little pieces of information that tell an amazing story when you put them together!

27.1.2 Key Words for Kids

Word What It Means
Big Data So much information that regular computers can’t handle it - like trying to fit an ocean in a bathtub!
Volume How MUCH data there is - measured in terabytes (that’s a TRILLION bytes!)
Velocity How FAST data is coming in - some sensors send thousands of readings per second
Variety Different TYPES of data mixed together - numbers, pictures, sounds, and more
Veracity Making sure the data is TRUE and correct - like double-checking your homework
Value The useful ANSWERS we find hidden in all that data - the treasure!

27.1.3 Try This at Home!

Be a Data Detective for One Day:

  1. Pick ONE thing to track for a whole day (like how many times you open the fridge)
  2. Write down the TIME and what happened each time (8:00am - got milk, 12:30pm - got cheese…)
  3. At the end of the day, count your entries. That’s your Volume!
  4. Look at how often you made entries. That’s your Velocity!
  5. Think about patterns: “I open the fridge most around mealtimes!” That’s finding Value!

Now imagine if EVERYTHING in your house was keeping track like you did - the lights, the doors, the TV, the thermostat. THAT’S why IoT creates Big Data!

Challenge: Can you calculate how many data points your house might create in one day if every device sent just ONE message per minute? (Hint: 1 device x 60 minutes x 24 hours = 1,440. Now multiply by how many devices you have!)

New to Big Data? Start Here!

If you have not worked with distributed data systems before, the sections below introduce the 5 V’s of big data, explain why traditional databases fail at IoT scale, and walk through the economics of horizontal versus vertical scaling. If you already understand why PostgreSQL cannot handle 1 million writes per second, skip ahead to The Economics of Scale.

27.1.4 What is Big Data? (Simple Explanation)

The Refrigerator Analogy

Imagine your refrigerator could talk. Every second, it tells you: - Current temperature: 37.2 degrees F - Door status: Closed - Compressor: Running - Ice maker: Full

That’s 4 pieces of information per second, or 345,600 data points per day - just from your fridge!

Now imagine EVERY appliance in your home doing this: - Refrigerator: 345,600/day - Thermostat: 86,400/day - Smart lights (10): 864,000/day - Security cameras (4): 10 TB/day of video

A single smart home generates more data in one day than a traditional business did in an entire year in 1990.

That’s why we need “big data” technology - not because we want to, but because traditional tools literally cannot keep up.

Why IoT creates Big Data:

  • 50 billion IoT devices worldwide (2025)
  • Each device sends 100+ readings per day
  • 50 billion x 100 = 5 trillion data points every single day
  • That’s like every person on Earth posting 600 social media updates per day!

27.1.5 The 5 V’s of Big Data (Made Simple)

V Meaning IoT Example
Volume HOW MUCH data 50 billion devices x 100 readings/day = HUGE
Velocity HOW FAST data arrives Self-driving car: 1 GB per SECOND
Variety HOW MANY types Temp, video, GPS, audio, text–all different formats
Veracity HOW ACCURATE Is that sensor reading correct or is it broken?
Value WHAT IT’S WORTH Can we actually use this data to make decisions?

27.1.6 Why Does IoT Create So Much Data?

Flowchart showing how billions of IoT devices generating frequent sensor readings multiply into zettabytes of annual data through five compounding factors
Figure 27.1: IoT Data Explosion from Devices to Zettabytes

IoT Data Explosion Factors: Five multiplying factors create exponential data growth from billions of constantly-sensing devices to zettabytes of annual data generation.

Comparison diagram showing 5 Vs of big data by domain
Figure 27.2: 5 V’s by IoT Domain: Different applications have different big data profiles requiring tailored architectures
Key Takeaway

In one sentence: Big data for IoT is not about storing everything - it is about building distributed systems that can process continuous streams faster than data arrives.

Remember this: When IoT data overwhelms your system, scale horizontally (add more nodes) rather than vertically (bigger server) - a single server hits a ceiling, but distributed systems scale indefinitely.

27.1.7 The Problem: Can’t Use Regular Databases!

Challenge Regular Database Big Data System
Data Size GB (gigabytes) PB (petabytes = 1M GB)
Processing One computer Thousands of computers
Speed Minutes to query Seconds to query
Data Types Tables only Tables, images, video, JSON
Cost Expensive per GB Cheap at scale

27.2 Why Traditional Databases Can’t Handle IoT

Think about how a traditional database works - it’s like a librarian organizing books. One librarian can shelve about 100 books per hour. What happens when trucks start delivering 10,000 books per hour?

27.2.1 The Single Server Problem

A typical database server can process about 10,000-50,000 simple operations per second. That sounds like a lot, until you do the math for IoT:

Scenario Data Generation Rate Traditional DB Capacity Gap
Smart home (20 sensors) 20 readings/second Easily handled None
Small factory (500 sensors) 5,000 readings/second Near limit Small
Smart city (100,000 sensors) 1,000,000 readings/second 20x over capacity Massive

Try it yourself – adjust the sensor count and sampling rate to see when your IoT deployment would overwhelm a traditional database:

Real Numbers from Production Systems:

  • MySQL (single server): ~5,000 inserts/second sustained
  • PostgreSQL (single server): ~10,000 inserts/second sustained
  • Oracle (high-end server): ~50,000 inserts/second sustained
  • IoT Smart City: 1,000,000+ sensor readings/second required

Even the most expensive database server hits a physical limit - one CPU can only process so many operations per second, and one hard drive can only write so fast.

27.2.2 The Bottleneck Visualized

Imagine a single door to a stadium. Even if 50,000 people need to enter, only one person can go through at a time. Traditional databases have this same bottleneck - all data must pass through one processing point.

Architecture diagram comparing a single server bottleneck that drops data versus a distributed system with multiple parallel processing nodes handling the full data load
Figure 27.3: Single Server Bottleneck versus Distributed Processing Architecture
Diagram illustrating scaling strategies
Figure 27.4: Comparison of vertical vs horizontal scaling strategies

Single Server Bottleneck vs Distributed Processing: Traditional databases force all data through one processing point creating a bottleneck that loses 99% of IoT data; distributed big data systems spread the load across 100+ servers to handle full data volume.

27.2.3 Why Can’t We Just Buy a Bigger Server?

The Law of Diminishing Returns:

Server Upgrade Cost Performance Gain Cost Per Operation
Basic server ($1,000) $1,000 1,000 ops/sec (baseline) $1.00
Mid-tier server ($10,000) $10,000 5,000 ops/sec (5x faster) $2.00 (worse!)
High-end server ($100,000) $100,000 20,000 ops/sec (20x faster) $5.00 (much worse!)
100 basic servers $100,000 100,000 ops/sec (100x faster) $1.00 (same as baseline!)

The Physics Problem:

  • Single server speed is limited by the speed of light (signals travel ~1 foot per nanosecond)
  • Larger servers with more RAM and CPUs hit coordination overhead
  • Hard drives have maximum rotation speeds (7,200-15,000 RPM)
  • Network cards max out at 10-100 Gbps

You can’t just “buy faster” - you hit physical limits. The solution is horizontal scaling (many cheap servers) not vertical scaling (one expensive server).

Explore the cost tradeoff – see how vertical scaling costs grow faster than horizontal as you increase throughput:

27.3 The Economics of Scale

This seems backwards - more data should cost more, right? Here’s why it doesn’t:

27.3.1 Traditional Database Costs

Example: Processing 1 Million Sensor Readings Per Second

Component Cost Capacity Total Cost
High-performance server $50,000 50,000 ops/sec $50,000
Need 20 servers to handle load x20 1M ops/sec $1,000,000
Oracle Enterprise licenses $47,500/CPU x 4 CPUs x 20 servers - $3,800,000
5-year total cost Maintenance $200K/year - $5,800,000

Cost per million operations: $5.80

The economics of horizontal scaling follow a power law. For throughput \(T\) operations/second, the cost comparison is:

Vertical scaling: \(C_v(T) = C_0 \times \left(\frac{T}{T_0}\right)^{1.5}\)

Horizontal scaling: \(C_h(T) = C_0 \times \left(\frac{T}{T_0}\right)\)

where \(C_0\) is base cost and \(T_0\) is base throughput.

Example: Starting at \(T_0 = 50,000\) ops/sec for \(C_0 = \$50,000\): - At \(T = 1,000,000\) ops/sec (20x increase): - Vertical: \[50,000 \times (20)^{1.5} \approx \$4.5M\] - Horizontal: \[50,000 \times 20 = \$1.0M\]

The \(1.5\) exponent reflects diminishing returns: doubling performance costs \(2^{1.5} = 2.8\times\). At 100x scale, vertical costs \((100)^{1.5} = 1,000\times\) base while horizontal costs only \(100\times\) base.

27.3.2 Big Data Approach Costs

Same Workload: 1 Million Sensor Readings Per Second

Component Cost Capacity Total Cost
100 commodity servers $500 each 10,000 ops/sec each $50,000
Open-source software Free (Hadoop, Cassandra) 1M ops/sec total $0
5-year total cost Maintenance $10K/year - $100,000

Cost per million operations: $0.10

Cost reduction: 58x cheaper!

27.3.3 The Cloud Multiplier

Cloud providers like AWS, Google, and Azure buy millions of servers. Their cost per server is much lower than buying one yourself. When you use their big data services, you benefit from:

  • Volume discounts on hardware: They pay $200 per server (vs your $500)
  • Shared infrastructure costs: One admin manages 10,000 servers (vs your 1 admin per 10 servers)
  • Pay only for what you use: Process 1M readings/sec during peak hours, 10K/sec at night (pay for average, not peak)

Real Example: Smart City Traffic Analysis

Approach Upfront Cost Annual Cost 10-Year Total
Traditional DB $500,000 (servers + licenses) $100,000/year (maintenance) $1,500,000
Self-hosted Big Data $50,000 (servers) $20,000/year (maintenance) $250,000
Cloud Big Data (AWS EMR) $0 (no upfront) $50,000/year (pay-per-use) $500,000

Key Insight: Cloud is 3x cheaper than traditional databases, even though it “looks” more expensive per hour ($10/hour sounds expensive, but you only run it when needed).

27.4 The Five Vs of Big Data in IoT

Understanding IoT Data Scale
Diagram showing the five characteristics of big data: Volume (scale of data), Velocity (speed of data), Variety (different forms of data), Veracity (uncertainty of data), and Value (actionable insights)
Figure 27.5: The Five Vs of Big Data: Volume, Velocity, Variety, Veracity, Value

27.4.1 Volume: The Scale Challenge

IoT Domain Data Generated Storage Challenge
Smart city (1M sensors) ~1 PB/year Need distributed storage
Autonomous vehicle ~1 TB/day per car Edge processing essential
Industrial IoT factory ~1 GB/hour per line Time-series databases
SKA Telescope 12 Exabytes/year Largest scientific data source

27.4.2 Case Study: Square Kilometre Array (SKA)

Artist rendering of the Square Kilometre Array radio telescope installation with thousands of antenna dishes spread across the landscape
Figure 27.6: Square Kilometre Array radio telescope

The SKA is the world’s largest IoT project:

Metric Value Comparison
Antennas 130,000+ More sensors than most smart cities
Raw data rate 12 Exabytes/year ~33 PB/day of raw sensor data
After processing 300 PB/year Still massive
Network bandwidth 100 Gbps Equivalent to 1M home connections

Key insight: Even with extreme compression and filtering, SKA generates 300 PB/year. IoT data volume grows faster than storage capacity!

27.4.3 Velocity: Real-Time Requirements

Application Data Rate Latency Requirement
Smart meter 1 reading/15 min Minutes OK
Traffic sensor 10 readings/sec Seconds
Industrial vibration 10,000 samples/sec Milliseconds
Autonomous vehicle 1 GB/sec <100ms or crash

27.4.4 Variety: Heterogeneous Data

IoT generates diverse data types that must be integrated:

  • Structured: Sensor readings (temperature: 22.5 degrees C)
  • Semi-structured: JSON logs, MQTT messages
  • Unstructured: Camera images, audio streams
  • Time-series: Continuous sensor streams
  • Geospatial: GPS coordinates, location data

27.4.5 Veracity: Data Quality Issues

Issue IoT Example Mitigation
Sensor drift Temperature offset over time Regular calibration
Missing data Network packet loss Interpolation, redundancy
Outliers Spike from electrical noise Statistical filtering
Duplicates Retry on timeout Deduplication at ingestion

27.4.6 Value: Extracting Actionable Insights

The ultimate goal of collecting IoT data is extracting value – turning raw sensor streams into decisions. Without value extraction, the other four V’s are just expensive storage problems.

IoT Domain Raw Data Extracted Value
Smart agriculture Soil moisture readings every 10 min Automated irrigation schedules saving 30% water
Predictive maintenance Vibration sensor streams at 10 kHz Machine failure prediction 48 hours in advance
Smart grid 100M smart meter readings/day Load balancing reducing peak demand by 15%
Connected health Continuous heart rate monitoring Early cardiac event detection saving lives
Comparison diagram showing 5 Vs big data framework
Figure 27.7: Five Vs of Big Data: Volume Velocity Variety Veracity Value

The 5 V’s Framework: Volume, Velocity, Variety, and Veracity characteristics of IoT data must be processed effectively to extract Value through big data technologies.

27.5 Self-Check Questions

Before continuing, make sure you understand:

  1. What makes data “big”? (Answer: Volume, Velocity, Variety–too much, too fast, too varied for normal databases)
  2. Why can’t we just use a bigger regular database? (Answer: Even the biggest single computer can’t handle petabytes and real-time streaming)
  3. What’s the most important ‘V’ for IoT? (Answer: Velocity–IoT data arrives constantly and needs real-time processing)
Common Misconception: “More Storage Solves Big Data Problems”

The Myth: “Our IoT system generates 10 TB/day. We just need bigger hard drives and more powerful servers to handle it.”

Why It’s Wrong: Big data isn’t primarily a storage problem–it’s a velocity and processing problem. Here’s the reality:

Real-World Example: Smart City Traffic Cameras

A city installs 10,000 traffic cameras, each generating 30 fps x 2 MB/frame = 5.2 TB/day per camera.

Approach Storage Cost Network Bandwidth Processing Time Feasibility
“Just store it all” ~$36M/month (accumulates 1.56 EB/month at $0.023/GB) ~4.8 Tbps continuous Years per query Impossible
Big data approach (edge processing) $84/year (10 GB/day x $0.023/GB x 365) 1 Mbps average Seconds per query Production reality

The Real Problem: Without big data architecture: - Velocity: 52 PB/day arrives too fast for any single system to process (~600 GB/sec sustained) - Query latency: Finding “show all red light violations yesterday” would scan 52 PB (takes over 16 years at 100 MB/s single-disk read speed) - Network impossibility: Transferring 52 PB/day requires ~4.8 Tbps sustained bandwidth – far beyond any city network

Remember: Big data = distributed processing + smart filtering, not just big storage.

A smart building company monitors 50,000 sensors across 500 buildings. Current system: PostgreSQL database on AWS RDS db.r5.4xlarge ($3,456/month) handling 5,000 inserts/second. After winning 3 new contracts, sensor count will grow to 200,000 (20,000 inserts/second). Should they scale vertically (bigger PostgreSQL) or horizontally (distributed big data stack)?

Option 1: Vertical Scaling (Bigger PostgreSQL)

Server Size vCPUs RAM Storage IOPS Write Capacity Monthly Cost
db.r5.4xlarge (current) 16 128 GB 10,000 ~5,000 writes/s $3,456
db.r5.8xlarge 32 256 GB 20,000 ~10,000 writes/s $6,912
db.r5.16xlarge 64 512 GB 40,000 ~18,000 writes/s $13,824
db.r5.24xlarge 96 768 GB 80,000 ~22,000 writes/s $20,736

For 20,000 writes/s, need db.r5.24xlarge: $20,736/month

Challenges: - Single point of failure (all data processing on one instance) - Limited future scalability (24xlarge is maximum size) - Backup window grows from 1 hour to 6+ hours (768 GB database) - Query latency increases (scanning 2 years of data = 126 billion rows)

Option 2: Horizontal Scaling (Kafka + Spark + Cassandra)

Architecture:

  • Kafka (ingestion): 3 brokers (m5.2xlarge) = $1,382/month
  • Spark (processing): 10 workers (r5.xlarge) = $1,920/month
  • Cassandra (storage): 6 nodes (i3.2xlarge) = $4,992/month
  • Total: $8,294/month

Capacity:

  • Kafka: 100,000+ writes/second (20x headroom)
  • Spark: Processes 1M events/second (50x headroom)
  • Cassandra: Linear scalability (add nodes for more writes)

Comparison:

Metric PostgreSQL (Vertical) Kafka+Spark+Cassandra (Horizontal)
Monthly Cost $20,736 $8,294 (60% cheaper)
Write Capacity 22,000/s (barely sufficient) 100,000/s (5x headroom)
Scalability Maxed out (no larger instance) Add nodes infinitely
Fault Tolerance Single point of failure Multi-node replication
Query Latency Seconds (full table scans) Milliseconds (partitioned)
Backup Window 6+ hours Continuous (no downtime)

ROI Calculation:

  • 3-year cost savings: ($20,736 - $8,294) × 36 = $447,912
  • Migration cost (one-time): ~$150,000 (engineering + testing)
  • Net savings: $297,912 over 3 years

Key Insight: Beyond 10,000 writes/second or 1 TB database size, distributed big data systems become both cheaper and more capable than vertical scaling. The laptop test fails - workload requires distributed architecture.

Factor Traditional Database (PostgreSQL/MySQL) Big Data Stack (Kafka/Spark/Cassandra) Threshold to Switch
Write Throughput 10,000-50,000 writes/sec (single node) 100,000-1M+ writes/sec (cluster) Switch when >50,000 writes/sec sustained
Data Volume 100 GB - 10 TB (on largest instances) 10 TB - Petabytes (add nodes) Switch when >5 TB or growing >1 TB/year
Query Patterns OLTP (transactions), complex JOINs OLAP (analytics), time-series scans Switch when <10% queries use JOINs
Growth Rate 2-5x over 5 years 10-100x over 5 years Switch when growth >10x in 3 years
Team Skills Standard SQL, most developers know it Distributed systems expertise needed Can delay switch if no big data team
Cost at Scale Exponential (larger instances 2-3x $/GB) Linear (add commodity nodes at same $/GB) Switch when cost >$10K/month and growing
Latency Tolerance Sub-second (OLTP requirements) Seconds to minutes (batch analytics) Traditional DB if <100ms latency required

The “Laptop Test” for Decision Making:

  1. Can this workload run on a laptop with 16 GB RAM?
    • Yes: Traditional database is fine (PostgreSQL/MySQL)
    • No: Consider big data stack
  2. Will this workload fit on a laptop in 2 years?
    • Yes: Traditional database
    • No: Invest in big data now (avoid painful migration later)
  3. Does your team have distributed systems expertise?
    • No: Delay big data until you hit clear limits (>50K writes/s, >5 TB, >$15K/month)
    • Yes: Adopt big data proactively at lower thresholds

Migration Trigger Events:

  • Database costs >$10K/month and growing >20%/year → Economics favor distributed
  • Write latency >500ms during peak hours → Single server saturated
  • Backup windows exceeding 4 hours → Data volume too large
  • Query timeouts during analytics → Need parallel processing
Common Mistake: Premature Big Data Adoption

The Error: A startup with 500 IoT devices (<500 sensor readings/second, 50 GB database) invests $200K in a “big data platform” (Kafka + Spark + Cassandra cluster) because “we might scale to millions of devices someday.”

Reality Check:

  • Current workload: 500 writes/second, 50 GB data
  • PostgreSQL on m5.xlarge: Handles 10,000 writes/sec, 2 TB data for $280/month
  • Big data cluster: 10 nodes @ $5,000/month + $150K setup = $210,000 first year

Consequences of Premature Adoption:

  1. Operational complexity: Team spends 60% of time managing Kafka partitions, Spark jobs, Cassandra repairs instead of building product features
  2. Debugging nightmare: Simple “count devices by city” query takes 3 hours to debug across 3 systems vs 5 minutes in PostgreSQL
  3. Hiring bottleneck: Can’t find engineers with Kafka/Spark expertise; takes 6 months to hire, onboarding 3+ months
  4. Opportunity cost: $200K could have funded 18 months of product development

Two Years Later:

  • Company grows to 2,000 devices (2,000 writes/sec, 200 GB data)
  • PostgreSQL on m5.2xlarge still works fine ($560/month)
  • Big data cluster: Underutilized, running at 2% capacity, costs $60K/year

The “You Aren’t Google” Rule: Big data tools solve Google-scale problems: - Kafka: Handles 1M+ messages/second (you have 500/sec) - Spark: Processes petabyte datasets (you have 50 GB) - Cassandra: Scales to 1000+ nodes (you need 1 PostgreSQL instance)

Correct Approach - Defer Until Proven Need:

  1. Start with PostgreSQL (or MySQL) - handles 99% of use cases up to $10K/month spend
  2. Add read replicas when queries slow down (still <$2K/month)
  3. Partition tables by time when you hit 1 TB (still PostgreSQL)
  4. Migrate to big data only when:
    • Write throughput consistently exceeds 50K/sec
    • Data volume exceeds 5 TB with >1 TB/year growth
    • Database costs exceed $15K/month and growing

Key Lesson: Big data tools have massive overhead (complexity, cost, expertise). Use traditional databases until you have measurable evidence they fail at your scale. “We might need it someday” costs $200K+ in wasted engineering and infrastructure. Wait for “we need it now” based on actual load tests showing database saturation.

Common Pitfalls

A real-time safety monitoring system prioritises velocity and veracity above volume. An annual energy audit prioritises volume and value. Identify which Vs dominate your use case before designing the architecture.

For IoT deployments under 1 TB with well-defined schemas, a well-tuned PostgreSQL or TimescaleDB instance outperforms a Hadoop cluster in both performance and operational simplicity. Use big data tools only when the scale genuinely requires them.

Collecting more data from unreliable sensors does not improve analytics accuracy — it amplifies noise. Invest in sensor calibration, data validation pipelines, and missing-data detection before scaling collection infrastructure.

A Spark cluster running 24/7 to process hourly sensor batches is wasteful. Use serverless or auto-scaling architectures (AWS EMR, Google Dataproc) that scale to zero between jobs.

27.6 Summary

  • Big data in IoT is characterized by the 5 V’s: Volume (massive scale from billions of devices), Velocity (high-speed streaming data), Variety (structured, semi-structured, and unstructured formats), Veracity (data quality and trustworthiness), and Value (extracting actionable insights).
  • Traditional databases fail at IoT scale because single servers have physical limits–a PostgreSQL server maxes out at ~10,000 writes/second while smart cities need 1,000,000+ writes/second.
  • Horizontal scaling beats vertical scaling at cost and capability: 100 commodity servers cost $50,000 and handle 1M ops/sec, while equivalent vertical scaling would cost $1M+ and still hit physical limits.
  • Cloud big data services provide 3-10x cost reduction over self-hosted options through volume discounts, shared infrastructure, and pay-per-use pricing models.

27.7 What’s Next

If you want to… Read this
Explore specific big data processing technologies Big Data Technologies
Understand end-to-end pipeline design Big Data Pipelines
Apply edge processing to reduce data volume Big Data Edge Processing
See big data in real IoT deployments Big Data Case Studies
Return to the module overview Big Data Overview