32 Big Data Case Studies

In 60 Seconds

Big data case studies demonstrate how real-world IoT deployments like smart cities generate hundreds of gigabytes daily from thousands of sensors, requiring data lake architectures with tiered storage, multi-protocol ingestion, and cross-domain analytics to extract actionable value at manageable cost.

Learning Objectives

After completing this chapter, you will be able to:

Apply big data concepts to real-world smart city deployments
Design data lake architectures for multi-domain IoT systems
Calculate storage, bandwidth, and cost requirements for production systems
Use decision frameworks to select appropriate technologies

Key Concepts

Lambda architecture: A data processing architecture combining batch processing for accuracy with stream processing for speed, producing results by merging outputs from both layers.
Smart city data platform: An integrated IoT data infrastructure aggregating real-time feeds from traffic, utilities, and environmental sensors to enable city-wide optimisation and incident response.
Predictive maintenance: Using historical sensor data and ML models to forecast equipment failures before they occur, replacing time-based maintenance schedules with condition-based ones.
Data gravity: The tendency of data to attract applications, services, and other data towards it; large IoT data stores generate gravity that shapes where processing and analytics are deployed.
Polyglot persistence: Using multiple specialised database types (time-series, document, graph) within a single system, each optimised for a specific data access pattern.
Digital twin: A real-time virtual replica of a physical asset or system that ingests live sensor data and is used for simulation, monitoring, and optimisation.

For Beginners: Big Data Case Studies

If you have ever wondered how a city with 12,000 sensors manages 222 GB of data every single day, this chapter answers that question with real numbers. We walk through Barcelona’s smart city deployment, calculate exactly how much storage and processing power these systems need, and design a data lake architecture that organizes terabytes of sensor data into hot, warm, and cold storage tiers. No prior experience with big data tools is required – just follow the math and the architecture diagrams.

32.1 Real-World Example: Barcelona Smart City

Barcelona Smart City: Concrete Numbers

Deployment Scale (Based on Barcelona’s actual smart city initiative): - 1,000 smart parking sensors (detect available parking spaces) - 500 smart bins (measure fill level to optimize collection routes) - 200 air quality stations (monitor NO2, PM2.5, O3, temperature, humidity) - 300 traffic cameras (monitor congestion, count vehicles) - 10,000 smart streetlights (adaptive lighting, motion sensors)

Daily Data Generation:

System	Sensors	Frequency	Size/Reading	Daily Data
Parking	1,000	1/min	50 bytes	72 MB
Smart Bins	500	1/hour	100 bytes	1.2 MB
Air Quality	200 x 5 readings	1/5min	80 bytes	23 MB
Traffic Cameras	300	1/sec (metadata)	200 bytes	5.2 GB
Traffic Cameras	300	1/min (images)	500 KB	216 GB
Streetlights	10,000	1/min	60 bytes	864 MB
TOTAL				~222 GB/day

Annual Scale: 222 GB/day x 365 days = 81 TB/year

Storage Costs (AWS S3 Standard pricing ~$0.023/GB/month): - Raw storage: 81 TB x $23/TB/month = $1,863/month = $22,356/year - With 10:1 compression: $2,236/year (more realistic)

Putting Numbers to It

Smart City Data Pipeline Economics: Barcelona’s 10,000 streetlights generate $10,000 \times 1440 \text{ readings/day} \times 60 \text{ bytes} = 864\text{ MB/day}$. Over 5 years: $864 \times 365 \times 5 / 1024 = 1.54\text{ TB}$.

Storage cost at $\$0.023/\text{GB/month}$: $1540 \times 0.023 = \$35.42/\text{month}$ or $\$425/\text{year}$. But energy savings from adaptive lighting: 30% reduction on 10,000 lamps × 100W × 12 hours/day × $\$0.12/\text{kWh}$ = $(10000 \times 0.1 \times 12 \times 0.12 \times 0.3 \times 365) = \$157,680$ annual savings.

ROI calculation: Annual cloud cost (storage + compute) ≈ $\$15,000$. Annual savings: $\$1M+$ (energy + maintenance optimization). ROI = $(1,000,000 - 15,000) / 15,000 = 65.67$, meaning every dollar spent on cloud infrastructure returns $\$66$ in operational savings. The data platform pays for itself in 5.5 days.

Processing Challenges:

Real-time traffic analysis: Must process 300 camera feeds simultaneously
Anomaly detection: Identify broken sensors among 12,000+ devices
Predictive maintenance: Forecast when streetlights or bins need service
Data integration: Combine parking + traffic data to optimize flow

Business Value:

Parking optimization saved citizens 230,000+ hours/year searching for parking
Waste collection efficiency reduced fuel costs by 30% through optimized routes
Air quality monitoring enabled targeted traffic restrictions on high-pollution days
Smart lighting reduced energy consumption by 30% (1M+ annual savings)

The Big Data Components in Action:

Volume: 81 TB/year requires distributed storage (HDFS or cloud object storage)
Velocity: Traffic cameras generate 5.2 GB/day of metadata requiring stream processing
Variety: Numeric (sensors), images (cameras), logs (system events), geographic (GPS)
Veracity: ~5% of sensors have occasional failures requiring automated validation
Value: 1M+ annual savings from energy alone, plus citizen time savings and better air quality

This demonstrates how even a mid-sized smart city creates big data challenges requiring specialized technologies beyond traditional databases.

32.2 Interactive: IoT Data Volume Calculator

Adjust the sensor parameters below to see how data volume scales with different IoT deployments. Compare your custom deployment against Barcelona’s 222 GB/day benchmark.

Show code

viewof sensorCount = Inputs.range([100, 100000], {
  value: 12000,
  step: 100,
  label: "Number of sensors"
})

viewof readingFreqSec = Inputs.range([1, 3600], {
  value: 60,
  step: 1,
  label: "Reading interval (seconds)"
})

viewof bytesPerReading = Inputs.range([10, 10000], {
  value: 100,
  step: 10,
  label: "Bytes per reading"
})

viewof retentionYears = Inputs.range([1, 10], {
  value: 3,
  step: 1,
  label: "Retention period (years)"
})

viewof storagePrice = Inputs.range([0.004, 0.05], {
  value: 0.023,
  step: 0.001,
  label: "Storage price ($/GB/month)"
})

Show code

readingsPerDay = Math.floor(86400 / readingFreqSec) * sensorCount
dailyBytes = readingsPerDay * bytesPerReading
dailyGB = dailyBytes / (1024 * 1024 * 1024)
dailyMB = dailyBytes / (1024 * 1024)
annualTB = (dailyGB * 365) / 1024
totalTB = annualTB * retentionYears
monthlyCostEnd = (totalTB * 1024) * storagePrice
avgMonthlyCost = (monthlyCostEnd / 2)
totalCost = avgMonthlyCost * 12 * retentionYears

barcelonaRatio = dailyGB / 222

html`<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(220px, 1fr)); gap: 1rem; margin: 1rem 0;">
  <div style="background: var(--bs-light, #f8f9fa); border-left: 4px solid #3498DB; padding: 1rem; border-radius: 4px;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Daily Data Volume</div>
    <div style="font-size: 1.5rem; font-weight: bold; color: #3498DB;">${dailyGB >= 1 ? dailyGB.toFixed(1) + " GB" : dailyMB.toFixed(1) + " MB"}</div>
    <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.6;">${readingsPerDay.toLocaleString()} readings/day</div>
  </div>
  <div style="background: var(--bs-light, #f8f9fa); border-left: 4px solid #16A085; padding: 1rem; border-radius: 4px;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Annual Storage</div>
    <div style="font-size: 1.5rem; font-weight: bold; color: #16A085;">${annualTB >= 1 ? annualTB.toFixed(1) + " TB" : (annualTB * 1024).toFixed(0) + " GB"}</div>
    <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.6;">${retentionYears}-year total: ${totalTB.toFixed(1)} TB</div>
  </div>
  <div style="background: var(--bs-light, #f8f9fa); border-left: 4px solid #E67E22; padding: 1rem; border-radius: 4px;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Estimated Storage Cost</div>
    <div style="font-size: 1.5rem; font-weight: bold; color: #E67E22;">$${avgMonthlyCost.toFixed(0)}/mo avg</div>
    <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.6;">${retentionYears}-year total: $${totalCost.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")}</div>
  </div>
  <div style="background: var(--bs-light, #f8f9fa); border-left: 4px solid ${barcelonaRatio > 1 ? '#E74C3C' : barcelonaRatio > 0.5 ? '#E67E22' : '#16A085'}; padding: 1rem; border-radius: 4px;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">vs Barcelona (222 GB/day)</div>
    <div style="font-size: 1.5rem; font-weight: bold; color: ${barcelonaRatio > 1 ? '#E74C3C' : barcelonaRatio > 0.5 ? '#E67E22' : '#16A085'};">${barcelonaRatio.toFixed(2)}x</div>
    <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.6;">${barcelonaRatio > 1 ? 'Exceeds Barcelona scale' : 'Below Barcelona scale'}</div>
  </div>
</div>
<div style="background: var(--bs-light, #f8f9fa); padding: 0.75rem 1rem; border-radius: 4px; margin-top: 0.5rem; font-size: 0.85rem; color: var(--bs-body-color);">
  <strong>Big data threshold check:</strong>
  Volume ${dailyGB > 100 ? '>' : '<'} 100 GB/day ${dailyGB > 100 ? '(needs distributed storage)' : '(single-server feasible)'} |
  Velocity: ${(sensorCount / readingFreqSec).toFixed(0)} readings/sec
  ${(sensorCount * bytesPerReading / readingFreqSec) > 1048576 ? '(> 1 MB/s -- stream processing recommended)' : '(< 1 MB/s -- batch processing sufficient)'}
</div>`

32.3 Understanding Big Data Scale

Imagine 1 Billion Sensors

The Challenge: Picture this scenario to understand the scale of IoT big data:

1 billion sensors deployed worldwide (smart meters, traffic cameras, air quality monitors)
Each sensor sends 1 reading per second (temperature, humidity, location, etc.)
Each reading is just 100 bytes (timestamp + sensor ID + value)

The Math:

1,000,000,000 sensors x 1 reading/second x 100 bytes = 100 GB/second
100 GB/second x 60 seconds x 60 minutes x 24 hours = 8,640 TB per day
8,640 TB per day x 365 days = 3,153,600 TB per year (~3,154 PB or ~3.15 EB/year)

That is over 3,000 PETABYTES (3+ exabytes) per year from a single simple IoT network!

Why This Matters:

Traditional databases like MySQL can’t handle this scale (typically max out at a few TB)
You can’t store this on a single hard drive (largest drives are ~20 TB)
Processing requires distributed systems like Hadoop or Spark running on hundreds of machines
Network bandwidth becomes critical: 100 GB/second = 800 Gbps (would saturate 800 standard internet connections)

The “Big Data” Part:

Volume: 3+ EB/year is too big for traditional systems
Velocity: 100 GB/second incoming data rate requires real-time processing
Variety: Sensors produce temperature (numeric), GPS (coordinates), images (binary), logs (text)
Veracity: Some sensors malfunction - how do you detect and filter bad data at this scale?
Value: The goal is actionable insights, not just hoarding data

This is why IoT needs big data technologies - regular tools simply break at this scale.

32.4 What Would Happen If: Disaster Scenarios

What If You Try Processing 1 TB on a Single Laptop?

Scenario: You work for a smart building company. Your boss asks you to analyze 1 TB of temperature sensor data from 1,000 buildings over the past year using your laptop.

What Happens:

Single Laptop

1 CPU, 16 GB RAM

Processing time: 72 hours
Memory bottleneck and crash risk
No fault tolerance
Single point of failure

Distributed System

4 worker nodes

Processing time: 3 hours
24x speedup through parallelism
Automatic fault recovery
Scales horizontally as data grows

Figure 32.1: Single Laptop versus Distributed System Processing Comparison

32.4.1 Attempt 1: Load Everything Into Memory (RAM)

Result: CRASH - Your laptop has 16 GB RAM - 1 TB = 1,024 GB - You’re 64x over capacity - Python crashes with MemoryError - Time wasted: 10 minutes trying to load before crash

32.4.2 Attempt 2: Process Chunk by Chunk

Result: SLOW BUT WORKS (BARELY) - Laptop hard drive reads at ~200 MB/second (SSD) - 1 TB / 0.2 GB/s = 5,000 seconds = 83 minutes JUST TO READ - Processing time: Add another 2-3x for computation = 4-6 hours total - Your laptop becomes unusable for other tasks during processing

If you had used a distributed system instead:

10-node Spark cluster (each with 32 GB RAM, 16 cores)
Reads 1 TB in parallel: 1024 GB / 10 nodes = 102.4 GB per node
Read time: 102.4 GB / 0.2 GB/s = ~8.5 minutes
Processing time with parallelism: 15-20 minutes total
18x faster than laptop, and you can keep working!

32.4.3 The Lesson: Scale Matters

Approach	Time	Cost	When to Use
Laptop (chunks)	4-6 hours	$0 (your time)	< 10 GB datasets, one-off analysis
Distributed System (Spark)	15-20 min	$5-10/run	> 100 GB datasets, frequent analysis
Time-Series DB (InfluxDB)	5-10 min	$20-50/month hosting	IoT sensor data, real-time queries
MySQL (traditional DB)	Hours/timeout	$0-100/month	< 1 GB transactional data, not analytics

The Big Data Rule: > “If processing your data takes longer than drinking a cup of coffee (>20 minutes), you need distributed/specialized systems.”

When Do You Actually Need Big Data Tools?

Volume: > 100 GB of data (laptop RAM limitations)
Velocity: > 1 MB/second incoming data (real-time processing needed)
Variety: > 3 different data types (structured + unstructured + images)
Query Frequency: > 10 queries/day (need optimized storage)

If you meet 2+ of these criteria, traditional tools will cause pain. Time to learn Spark, Hadoop, or time-series databases!

32.5 Worked Example: Smart City Data Lake for 50,000 Sensors

Scenario: A metropolitan city is deploying a smart city initiative with 50,000 IoT sensors across multiple domains: traffic monitoring (15,000 sensors), air quality (5,000 sensors), smart lighting (20,000 lights), water management (3,000 sensors), and public safety cameras (7,000 devices). The city needs a unified data platform that enables cross-domain analytics while managing costs and supporting future growth.

Goal: Design a data lake architecture that ingests 2 TB/day of sensor data, supports both real-time dashboards and historical analytics, maintains data governance for sensitive information, and keeps monthly costs under $15,000.

What we do: Design ingestion pipelines tailored to each sensor type’s data characteristics.

Data Volume Analysis:

traffic_sensors:
  count: 15000
  frequency: every_5_seconds
  payload_size: 200_bytes
  daily_volume: 51.8 GB

air_quality:
  count: 5000
  frequency: every_minute
  payload_size: 500_bytes
  daily_volume: 3.6 GB

smart_lighting:
  count: 20000
  frequency: every_15_minutes
  payload_size: 650_bytes
  daily_volume: 1.2 GB

water_management:
  count: 3000
  frequency: every_30_seconds
  payload_size: 300_bytes
  daily_volume: 2.6 GB

cameras:
  count: 7000
  metadata_only: true
  events_per_hour: 50000
  daily_volume: 8.5 GB

total_daily_ingestion: ~67.7 GB structured data
# Plus 1.8 TB camera footage (separate cold storage)

Ingestion Architecture:

MQTT broker for low-power sensors (air quality, water) - batch to Kafka
Kafka direct for high-frequency sensors (traffic) - 64 partitions, 3x replication
HTTP API for smart lighting controllers - rate limited 10K/minute
gRPC for camera event streaming

Why: Different sensor types have vastly different requirements. Traffic sensors need low-latency Kafka ingestion for real-time congestion alerts. Battery-powered air quality sensors use MQTT for power efficiency.

What we do: Implement a three-tier storage architecture (hot/warm/cold) optimized for access patterns.

Tiered Storage Design:

hot_tier:
  storage_class: S3_STANDARD
  data_age: 0-7_days
  access_pattern: frequent_queries
  format: Delta Lake (Parquet with ACID)
  partitioning: date/sensor_type/region
  expected_size: 500_GB
  monthly_cost: $11.50

warm_tier:
  storage_class: S3_INTELLIGENT_TIERING
  data_age: 7-90_days
  access_pattern: weekly_reports
  format: Parquet (compressed)
  expected_size: 6_TB
  monthly_cost: $72

cold_tier:
  storage_class: S3_GLACIER_INSTANT
  data_age: 90_days+
  access_pattern: compliance_audits
  retention: 7_years
  expected_size: 50_TB (growing)
  monthly_cost: $200

Why: 90% of queries access data from the last 7 days. By automatically tiering older data to cheaper storage classes, we reduce storage costs by 80% while maintaining instant access to recent data.

What we do: Configure query engines and optimize for common access patterns.

Query Engine Selection:

real_time_dashboards:
  engine: Apache Druid
  latency_target: <1_second
  data: last_24_hours
  use_cases:
    - Traffic congestion heatmap
    - Air quality alerts
    - Lighting status dashboard

historical_analytics:
  engine: Spark SQL / Databricks
  latency_target: <30_seconds
  data: any_time_range
  use_cases:
    - Weekly traffic pattern analysis
    - Seasonal air quality trends
    - Energy consumption reports

Materialized Views for Common Queries:

-- Pre-aggregated hourly summaries (reduce query cost 50x)
CREATE MATERIALIZED VIEW hourly_traffic_summary AS
SELECT
    date_trunc('hour', timestamp) as hour,
    region,
    COUNT(*) as reading_count,
    AVG(measurements['vehicle_count']) as avg_vehicles,
    MAX(measurements['vehicle_count']) as peak_vehicles
FROM smart_city.sensor_readings
WHERE sensor_type = 'traffic'
GROUP BY 1, 2;

Why: Smart city dashboards are accessed continuously. Pre-computing hourly aggregates reduces query latency from 30 seconds to under 1 second while cutting compute costs by 99%.

What we do: Implement access controls, data classification, and lineage tracking.

Data Classification Schema:

public:
  description: Anonymized aggregate statistics
  examples:
    - Hourly traffic counts by region
    - Daily air quality index
  access: Anyone (open data portal)

internal:
  description: Operational sensor data
  examples:
    - Individual sensor readings
    - Equipment status
  access: City employees with data access training

sensitive:
  description: Data that could identify individuals
  examples:
    - Camera footage
    - License plate readings
  access: Authorized personnel only
  encryption: At rest and in transit

Why: Smart city data includes sensitive information (camera footage, location patterns). Column-level access controls ensure traffic analysts cannot accidentally access camera data.

What we do: Design schema management that accommodates new sensor types.

Schema Registry Configuration:

# Base sensor schema with evolution support
sensor_schema_v1 = {
    "type": "record",
    "name": "SensorReading",
    "fields": [
        {"name": "sensor_id", "type": "string"},
        {"name": "timestamp", "type": "long"},
        {"name": "measurements", "type": {"type": "map", "values": "double"}},
        {"name": "metadata", "type": ["null", {"type": "map", "values": "string"}],
         "default": None}
    ]
}

# Evolution rules
compatibility_config = {
    "compatibility": "BACKWARD",
    "allowed_changes": [
        "add_optional_field",
        "add_field_with_default"
    ]
}

Why: Cities constantly add new sensor types (flood sensors, noise monitors, EV chargers). Using a flexible measurements map allows adding sensor types without modifying table schemas.

Outcome: A production data lake processing 2 TB/day from 50,000 sensors with sub-second dashboard queries and monthly costs of $12,400 (under budget).

Architecture Summary:

50,000 Sensors
     |
     +-- MQTT Broker (low-power) --+
     +-- Kafka (high-frequency) ---+---> Spark Streaming
     +-- HTTP API (controllers) ---+      |
     +-- gRPC (camera events) -----+      v
                                   Delta Lake (S3)
                                        |
                     +------------------+------------------+
                     |                  |                  |
                  Hot Tier          Warm Tier         Cold Tier
                  (7 days)          (90 days)         (7 years)
                  500 GB            6 TB              50+ TB
                     |                  |                  |
                     v                  v                  v
                  Druid            Spark SQL          Glacier
               (Real-time)        (Analytics)       (Compliance)

Cost Breakdown: | Component | Monthly Cost | |———–|————-| | S3 Storage (all tiers) | $283.50 | | Kafka (MSK) | $2,400 | | Spark (Databricks) | $4,800 | | Druid (real-time) | $3,200 | | Schema Registry | $200 | | Data Transfer | $1,500 | | Total | $12,383.50 |

Key Decisions Made:

Multi-protocol ingestion: Match ingestion to sensor constraints
Delta Lake format: ACID transactions prevent corruption
Three-tier storage: 80% cost reduction vs single-tier
Materialized views: 99% query cost reduction
Schema-flexible measurements map: Add sensor types without migrations
Column-level access controls: Compliance with privacy requirements

32.6 Interactive: Tiered Storage Cost Comparison

Compare the cost of storing all IoT data in a single tier versus using the hot/warm/cold tiered approach described in the data lake architecture above. Adjust the daily ingestion rate and see how tiering reduces costs over time.

Show code

viewof dailyIngestionGB = Inputs.range([1, 500], {
  value: 68,
  step: 1,
  label: "Daily structured data ingestion (GB)"
})

viewof hotDays = Inputs.range([1, 30], {
  value: 7,
  step: 1,
  label: "Hot tier retention (days)"
})

viewof warmDays = Inputs.range([30, 365], {
  value: 90,
  step: 5,
  label: "Warm tier retention (days)"
})

viewof coldYears = Inputs.range([1, 10], {
  value: 7,
  step: 1,
  label: "Cold tier retention (years)"
})

Show code

hotGB = dailyIngestionGB * hotDays
warmGB = dailyIngestionGB * (warmDays - hotDays)
coldGB = dailyIngestionGB * (coldYears * 365 - warmDays)
totalGB = hotGB + warmGB + coldGB

hotPrice = 0.023
warmPrice = 0.012
coldPrice = 0.004

tieredMonthlyCost = (hotGB * hotPrice) + (warmGB * warmPrice) + (coldGB * coldPrice)
singleTierCost = totalGB * hotPrice

savings = ((singleTierCost - tieredMonthlyCost) / singleTierCost * 100)

html`<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(200px, 1fr)); gap: 1rem; margin: 1rem 0;">
  <div style="background: var(--bs-light, #f8f9fa); border-left: 4px solid #E74C3C; padding: 1rem; border-radius: 4px;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Hot Tier (S3 Standard)</div>
    <div style="font-size: 1.3rem; font-weight: bold; color: #E74C3C;">${hotGB.toFixed(0)} GB</div>
    <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.6;">$${(hotGB * hotPrice).toFixed(2)}/mo @ $${hotPrice}/GB</div>
  </div>
  <div style="background: var(--bs-light, #f8f9fa); border-left: 4px solid #E67E22; padding: 1rem; border-radius: 4px;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Warm Tier (Intelligent Tiering)</div>
    <div style="font-size: 1.3rem; font-weight: bold; color: #E67E22;">${(warmGB / 1024).toFixed(1)} TB</div>
    <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.6;">$${(warmGB * warmPrice).toFixed(2)}/mo @ $${warmPrice}/GB</div>
  </div>
  <div style="background: var(--bs-light, #f8f9fa); border-left: 4px solid #3498DB; padding: 1rem; border-radius: 4px;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Cold Tier (Glacier Instant)</div>
    <div style="font-size: 1.3rem; font-weight: bold; color: #3498DB;">${(coldGB / 1024).toFixed(1)} TB</div>
    <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.6;">$${(coldGB * coldPrice).toFixed(2)}/mo @ $${coldPrice}/GB</div>
  </div>
</div>
<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(220px, 1fr)); gap: 1rem; margin: 0.5rem 0;">
  <div style="background: var(--bs-light, #f8f9fa); border: 2px solid #E74C3C; padding: 1rem; border-radius: 4px; text-align: center;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color);">Single Tier (all S3 Standard)</div>
    <div style="font-size: 1.5rem; font-weight: bold; color: #E74C3C;">$${singleTierCost.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")}/mo</div>
    <div style="font-size: 0.8rem; color: #7F8C8D;">${(totalGB / 1024).toFixed(1)} TB total</div>
  </div>
  <div style="background: var(--bs-light, #f8f9fa); border: 2px solid #16A085; padding: 1rem; border-radius: 4px; text-align: center;">
    <div style="font-size: 0.85rem; color: var(--bs-body-color);">Tiered Storage</div>
    <div style="font-size: 1.5rem; font-weight: bold; color: #16A085;">$${tieredMonthlyCost.toFixed(0).replace(/\B(?=(\d{3})+(?!\d))/g, ",")}/mo</div>
    <div style="font-size: 0.8rem; color: #7F8C8D;">${savings.toFixed(0)}% savings</div>
  </div>
</div>`

32.7 Visual Reference Gallery

Big Data Characteristics

Big data in IoT is characterized by the five Vs that distinguish it from traditional data processing. Understanding these dimensions helps architects design appropriate infrastructure and choose technologies that can handle the unique challenges of IoT scale.

Data Lake Architecture

Data lakes provide flexible storage for the variety of IoT data formats. Raw data lands in ingestion zones, gets processed through transformation pipelines, and becomes available in curated zones for analytics, machine learning, and operational dashboards.

Batch Processing Pipeline

Batch processing handles large-volume historical IoT data analysis through scheduled jobs. Data is extracted from storage, transformed through cleaning and aggregation, and loaded into analytical databases or data warehouses for reporting and machine learning.

32.8 Videos

Video: Big Data in Room Usage

Video: Visualizing Big Data at Scale

Video: Data Analytics

Video: Industrial IoT Platforms

Academic Resource: NPTEL IoT Course (IIT Kharagpur) - Multi-Layer IoT Data Architecture

Complete four-layer IoT data architecture showing Sensing Layer (RFID tags, intelligent sensors, BLE devices, WSNs), Network Layer (WSN, Internet, Mobile network, Database, WLAN), Service Layer (service division, integration, repository, business logic), and Interface Layer (application frontend, contract, interface, API)

Source: NPTEL Internet of Things Course, IIT Kharagpur - Illustrates the complete IoT data pipeline from physical sensors through network transport and cloud services to end-user applications, emphasizing how big data flows through each architectural layer.

For Kids: Meet the Sensor Squad!

The Sensor Squad went on a field trip to visit a real smart city!

“Welcome to Barcelona!” said Max the Microcontroller. “This city has over 12,000 sensors – parking sensors, trash bin sensors, air quality monitors, traffic cameras, and smart streetlights!”

Sammy the Sensor was amazed. “How much data is that?”

“About 222 gigabytes every single day,” Max replied. “That is like taking 44,000 photos on your phone – EVERY DAY!”

Lila the LED pointed at a smart parking sensor in the ground. “What does that one do?”

“It detects if a car is parked on top of it,” explained Max. “There are 1,000 of them! When you drive into the city, an app tells you exactly where to find an empty parking spot. It saves people over 230,000 hours a year of driving around looking for parking!”

Bella the Battery pointed at a smart trash bin. “And that one?”

“It tells the garbage trucks how full it is! Instead of driving to every bin every day, the trucks only visit the full ones. They save 30 percent on fuel!”

“But where does all this data GO?” Sammy asked.

“Great question! The city uses something called a DATA LAKE,” Max said. “Think of it like a giant library with three rooms: a HOT room for this week’s data that people look at all the time, a WARM room for last month’s data, and a COLD room for really old data stored in the basement. The hot room costs the most to run, so old data moves to cheaper rooms automatically.”

“Smart cities are like having thousands of us working together!” Bella said proudly.

Key lesson: Smart cities use thousands of IoT sensors to save money, time, and energy. They organize their massive data into tiers – like hot, warm, and cold storage – to keep costs manageable!

Key Takeaway

Smart city deployments demonstrate that IoT big data is not just a technology challenge but an architecture and economics problem. The combination of multi-protocol ingestion, tiered storage, materialized views, and schema-flexible design enables cities to process terabytes daily while keeping costs under control. Start with clear business value targets (like Barcelona’s parking optimization saving 230,000 hours/year) and design the architecture to deliver that value efficiently.

Interactive Quiz: Match Concepts

Interactive Quiz: Sequence the Steps

Common Pitfalls

1. Collecting everything without a use-case driver

Storing all sensor readings ‘just in case’ generates petabyte-scale datasets with no query plan to justify the cost. Define the decisions the data must support before designing the collection pipeline.

2. Ignoring data locality in distributed analytics

Running analytics jobs that move terabytes of IoT data from storage to compute clusters wastes bandwidth and increases latency. Use compute-near-storage patterns (Spark on HDFS, BigQuery, Redshift Spectrum) or edge pre-aggregation.

3. Building monolithic pipelines for heterogeneous data

A single ingestion pipeline for temperature, video, and vibration data will be brittle and hard to scale. Design separate ingestion paths for each data modality, joining them only at the analytics layer.

4. Neglecting data provenance in case studies

Production incidents often require tracing a faulty reading back to its source sensor, calibration date, and transmission path. Without provenance metadata, root-cause analysis becomes guesswork.

Label the Diagram

Code Challenge

32.9 Summary

Barcelona Smart City generates 222 GB/day from 12,000+ sensors, demonstrating how mid-sized deployments create big data challenges requiring 81 TB/year storage with tiered architectures.
Scale matters: 1 billion IoT sensors at 1 reading/second generates 3+ EB/year (over 8,600 TB/day), requiring distributed processing across hundreds of machines to handle 100 GB/second incoming data rates.
Data lake architecture for 50,000 sensors achieves sub-second queries and $12,400/month costs through multi-protocol ingestion, three-tier storage, materialized views, and schema-flexible design.
Decision framework: Use big data tools when data exceeds 100 GB, velocity exceeds 1 MB/second, or you have 3+ data types and frequent queries - otherwise traditional tools work fine.

32.10 Resources

32.10.1 Big Data Frameworks

32.10.2 Time-Series Databases

32.10.3 Cloud Platforms

32.10.4 Books and Papers

“Big Data: Principles and Best Practices of Scalable Real-Time Data Systems” by Nathan Marz
“Streaming Systems” by Tyler Akidau, Slava Chernyak, Reuven Lax
“Designing Data-Intensive Applications” by Martin Kleppmann

32.11 What’s Next

If you want to…	Read this
Understand big data pipeline fundamentals	Big Data Fundamentals
Explore processing technologies used in case studies	Big Data Technologies
Learn edge processing for distributed deployments	Big Data Edge Processing
Apply analytics to real IoT data	Modeling and Inferencing
Return to the module overview	Big Data Overview