32  Big Data Case Studies

In 60 Seconds

Big data case studies demonstrate how real-world IoT deployments like smart cities generate hundreds of gigabytes daily from thousands of sensors, requiring data lake architectures with tiered storage, multi-protocol ingestion, and cross-domain analytics to extract actionable value at manageable cost.

Learning Objectives

After completing this chapter, you will be able to:

  • Apply big data concepts to real-world smart city deployments
  • Design data lake architectures for multi-domain IoT systems
  • Calculate storage, bandwidth, and cost requirements for production systems
  • Use decision frameworks to select appropriate technologies

Key Concepts

  • Lambda architecture: A data processing architecture combining batch processing for accuracy with stream processing for speed, producing results by merging outputs from both layers.
  • Smart city data platform: An integrated IoT data infrastructure aggregating real-time feeds from traffic, utilities, and environmental sensors to enable city-wide optimisation and incident response.
  • Predictive maintenance: Using historical sensor data and ML models to forecast equipment failures before they occur, replacing time-based maintenance schedules with condition-based ones.
  • Data gravity: The tendency of data to attract applications, services, and other data towards it; large IoT data stores generate gravity that shapes where processing and analytics are deployed.
  • Polyglot persistence: Using multiple specialised database types (time-series, document, graph) within a single system, each optimised for a specific data access pattern.
  • Digital twin: A real-time virtual replica of a physical asset or system that ingests live sensor data and is used for simulation, monitoring, and optimisation.

If you have ever wondered how a city with 12,000 sensors manages 222 GB of data every single day, this chapter answers that question with real numbers. We walk through Barcelona’s smart city deployment, calculate exactly how much storage and processing power these systems need, and design a data lake architecture that organizes terabytes of sensor data into hot, warm, and cold storage tiers. No prior experience with big data tools is required – just follow the math and the architecture diagrams.

32.1 Real-World Example: Barcelona Smart City

Barcelona Smart City: Concrete Numbers

Deployment Scale (Based on Barcelona’s actual smart city initiative): - 1,000 smart parking sensors (detect available parking spaces) - 500 smart bins (measure fill level to optimize collection routes) - 200 air quality stations (monitor NO2, PM2.5, O3, temperature, humidity) - 300 traffic cameras (monitor congestion, count vehicles) - 10,000 smart streetlights (adaptive lighting, motion sensors)

Daily Data Generation:

System Sensors Frequency Size/Reading Daily Data
Parking 1,000 1/min 50 bytes 72 MB
Smart Bins 500 1/hour 100 bytes 1.2 MB
Air Quality 200 x 5 readings 1/5min 80 bytes 23 MB
Traffic Cameras 300 1/sec (metadata) 200 bytes 5.2 GB
Traffic Cameras 300 1/min (images) 500 KB 216 GB
Streetlights 10,000 1/min 60 bytes 864 MB
TOTAL ~222 GB/day

Annual Scale: 222 GB/day x 365 days = 81 TB/year

Storage Costs (AWS S3 Standard pricing ~$0.023/GB/month): - Raw storage: 81 TB x $23/TB/month = $1,863/month = $22,356/year - With 10:1 compression: $2,236/year (more realistic)

Smart City Data Pipeline Economics: Barcelona’s 10,000 streetlights generate \(10,000 \times 1440 \text{ readings/day} \times 60 \text{ bytes} = 864\text{ MB/day}\). Over 5 years: \(864 \times 365 \times 5 / 1024 = 1.54\text{ TB}\).

Storage cost at \(\$0.023/\text{GB/month}\): \(1540 \times 0.023 = \$35.42/\text{month}\) or \(\$425/\text{year}\). But energy savings from adaptive lighting: 30% reduction on 10,000 lamps × 100W × 12 hours/day × \(\$0.12/\text{kWh}\) = \((10000 \times 0.1 \times 12 \times 0.12 \times 0.3 \times 365) = \$157,680\) annual savings.

ROI calculation: Annual cloud cost (storage + compute) ≈ \(\$15,000\). Annual savings: \(\$1M+\) (energy + maintenance optimization). ROI = \((1,000,000 - 15,000) / 15,000 = 65.67\), meaning every dollar spent on cloud infrastructure returns \(\$66\) in operational savings. The data platform pays for itself in 5.5 days.

Processing Challenges:

  • Real-time traffic analysis: Must process 300 camera feeds simultaneously
  • Anomaly detection: Identify broken sensors among 12,000+ devices
  • Predictive maintenance: Forecast when streetlights or bins need service
  • Data integration: Combine parking + traffic data to optimize flow

Business Value:

  • Parking optimization saved citizens 230,000+ hours/year searching for parking
  • Waste collection efficiency reduced fuel costs by 30% through optimized routes
  • Air quality monitoring enabled targeted traffic restrictions on high-pollution days
  • Smart lighting reduced energy consumption by 30% (1M+ annual savings)

The Big Data Components in Action:

  1. Volume: 81 TB/year requires distributed storage (HDFS or cloud object storage)
  2. Velocity: Traffic cameras generate 5.2 GB/day of metadata requiring stream processing
  3. Variety: Numeric (sensors), images (cameras), logs (system events), geographic (GPS)
  4. Veracity: ~5% of sensors have occasional failures requiring automated validation
  5. Value: 1M+ annual savings from energy alone, plus citizen time savings and better air quality

This demonstrates how even a mid-sized smart city creates big data challenges requiring specialized technologies beyond traditional databases.

32.2 Interactive: IoT Data Volume Calculator

Adjust the sensor parameters below to see how data volume scales with different IoT deployments. Compare your custom deployment against Barcelona’s 222 GB/day benchmark.

32.3 Understanding Big Data Scale

Imagine 1 Billion Sensors

The Challenge: Picture this scenario to understand the scale of IoT big data:

  • 1 billion sensors deployed worldwide (smart meters, traffic cameras, air quality monitors)
  • Each sensor sends 1 reading per second (temperature, humidity, location, etc.)
  • Each reading is just 100 bytes (timestamp + sensor ID + value)

The Math:

1,000,000,000 sensors x 1 reading/second x 100 bytes = 100 GB/second
100 GB/second x 60 seconds x 60 minutes x 24 hours = 8,640 TB per day
8,640 TB per day x 365 days = 3,153,600 TB per year (~3,154 PB or ~3.15 EB/year)

That is over 3,000 PETABYTES (3+ exabytes) per year from a single simple IoT network!

Why This Matters:

  • Traditional databases like MySQL can’t handle this scale (typically max out at a few TB)
  • You can’t store this on a single hard drive (largest drives are ~20 TB)
  • Processing requires distributed systems like Hadoop or Spark running on hundreds of machines
  • Network bandwidth becomes critical: 100 GB/second = 800 Gbps (would saturate 800 standard internet connections)

The “Big Data” Part:

  • Volume: 3+ EB/year is too big for traditional systems
  • Velocity: 100 GB/second incoming data rate requires real-time processing
  • Variety: Sensors produce temperature (numeric), GPS (coordinates), images (binary), logs (text)
  • Veracity: Some sensors malfunction - how do you detect and filter bad data at this scale?
  • Value: The goal is actionable insights, not just hoarding data

This is why IoT needs big data technologies - regular tools simply break at this scale.

32.4 What Would Happen If: Disaster Scenarios

What If You Try Processing 1 TB on a Single Laptop?

Scenario: You work for a smart building company. Your boss asks you to analyze 1 TB of temperature sensor data from 1,000 buildings over the past year using your laptop.

What Happens:

Comparison diagram showing a single laptop attempting to process 1 TB of data versus a 10-node distributed Spark cluster, highlighting memory limitations, processing times, and the performance advantage of distributed systems
Figure 32.1: Single Laptop versus Distributed System Processing Comparison

32.4.1 Attempt 1: Load Everything Into Memory (RAM)

Result: CRASH - Your laptop has 16 GB RAM - 1 TB = 1,024 GB - You’re 64x over capacity - Python crashes with MemoryError - Time wasted: 10 minutes trying to load before crash

32.4.2 Attempt 2: Process Chunk by Chunk

Result: SLOW BUT WORKS (BARELY) - Laptop hard drive reads at ~200 MB/second (SSD) - 1 TB / 0.2 GB/s = 5,000 seconds = 83 minutes JUST TO READ - Processing time: Add another 2-3x for computation = 4-6 hours total - Your laptop becomes unusable for other tasks during processing

If you had used a distributed system instead:

  • 10-node Spark cluster (each with 32 GB RAM, 16 cores)
  • Reads 1 TB in parallel: 1024 GB / 10 nodes = 102.4 GB per node
  • Read time: 102.4 GB / 0.2 GB/s = ~8.5 minutes
  • Processing time with parallelism: 15-20 minutes total
  • 18x faster than laptop, and you can keep working!

32.4.3 The Lesson: Scale Matters

Approach Time Cost When to Use
Laptop (chunks) 4-6 hours $0 (your time) < 10 GB datasets, one-off analysis
Distributed System (Spark) 15-20 min $5-10/run > 100 GB datasets, frequent analysis
Time-Series DB (InfluxDB) 5-10 min $20-50/month hosting IoT sensor data, real-time queries
MySQL (traditional DB) Hours/timeout $0-100/month < 1 GB transactional data, not analytics

The Big Data Rule: > “If processing your data takes longer than drinking a cup of coffee (>20 minutes), you need distributed/specialized systems.”

When Do You Actually Need Big Data Tools?

  • Volume: > 100 GB of data (laptop RAM limitations)
  • Velocity: > 1 MB/second incoming data (real-time processing needed)
  • Variety: > 3 different data types (structured + unstructured + images)
  • Query Frequency: > 10 queries/day (need optimized storage)

If you meet 2+ of these criteria, traditional tools will cause pain. Time to learn Spark, Hadoop, or time-series databases!

32.5 Worked Example: IoT Data Lake Architecture

32.6 Worked Example: Smart City Data Lake for 50,000 Sensors

Scenario: A metropolitan city is deploying a smart city initiative with 50,000 IoT sensors across multiple domains: traffic monitoring (15,000 sensors), air quality (5,000 sensors), smart lighting (20,000 lights), water management (3,000 sensors), and public safety cameras (7,000 devices). The city needs a unified data platform that enables cross-domain analytics while managing costs and supporting future growth.

Goal: Design a data lake architecture that ingests 2 TB/day of sensor data, supports both real-time dashboards and historical analytics, maintains data governance for sensitive information, and keeps monthly costs under $15,000.

What we do: Design ingestion pipelines tailored to each sensor type’s data characteristics.

Data Volume Analysis:

traffic_sensors:
  count: 15000
  frequency: every_5_seconds
  payload_size: 200_bytes
  daily_volume: 51.8 GB

air_quality:
  count: 5000
  frequency: every_minute
  payload_size: 500_bytes
  daily_volume: 3.6 GB

smart_lighting:
  count: 20000
  frequency: every_15_minutes
  payload_size: 650_bytes
  daily_volume: 1.2 GB

water_management:
  count: 3000
  frequency: every_30_seconds
  payload_size: 300_bytes
  daily_volume: 2.6 GB

cameras:
  count: 7000
  metadata_only: true
  events_per_hour: 50000
  daily_volume: 8.5 GB

total_daily_ingestion: ~67.7 GB structured data
# Plus 1.8 TB camera footage (separate cold storage)

Ingestion Architecture:

  • MQTT broker for low-power sensors (air quality, water) - batch to Kafka
  • Kafka direct for high-frequency sensors (traffic) - 64 partitions, 3x replication
  • HTTP API for smart lighting controllers - rate limited 10K/minute
  • gRPC for camera event streaming

Why: Different sensor types have vastly different requirements. Traffic sensors need low-latency Kafka ingestion for real-time congestion alerts. Battery-powered air quality sensors use MQTT for power efficiency.

What we do: Implement a three-tier storage architecture (hot/warm/cold) optimized for access patterns.

Tiered Storage Design:

hot_tier:
  storage_class: S3_STANDARD
  data_age: 0-7_days
  access_pattern: frequent_queries
  format: Delta Lake (Parquet with ACID)
  partitioning: date/sensor_type/region
  expected_size: 500_GB
  monthly_cost: $11.50

warm_tier:
  storage_class: S3_INTELLIGENT_TIERING
  data_age: 7-90_days
  access_pattern: weekly_reports
  format: Parquet (compressed)
  expected_size: 6_TB
  monthly_cost: $72

cold_tier:
  storage_class: S3_GLACIER_INSTANT
  data_age: 90_days+
  access_pattern: compliance_audits
  retention: 7_years
  expected_size: 50_TB (growing)
  monthly_cost: $200

Why: 90% of queries access data from the last 7 days. By automatically tiering older data to cheaper storage classes, we reduce storage costs by 80% while maintaining instant access to recent data.

What we do: Configure query engines and optimize for common access patterns.

Query Engine Selection:

real_time_dashboards:
  engine: Apache Druid
  latency_target: <1_second
  data: last_24_hours
  use_cases:
    - Traffic congestion heatmap
    - Air quality alerts
    - Lighting status dashboard

historical_analytics:
  engine: Spark SQL / Databricks
  latency_target: <30_seconds
  data: any_time_range
  use_cases:
    - Weekly traffic pattern analysis
    - Seasonal air quality trends
    - Energy consumption reports

Materialized Views for Common Queries:

-- Pre-aggregated hourly summaries (reduce query cost 50x)
CREATE MATERIALIZED VIEW hourly_traffic_summary AS
SELECT
    date_trunc('hour', timestamp) as hour,
    region,
    COUNT(*) as reading_count,
    AVG(measurements['vehicle_count']) as avg_vehicles,
    MAX(measurements['vehicle_count']) as peak_vehicles
FROM smart_city.sensor_readings
WHERE sensor_type = 'traffic'
GROUP BY 1, 2;

Why: Smart city dashboards are accessed continuously. Pre-computing hourly aggregates reduces query latency from 30 seconds to under 1 second while cutting compute costs by 99%.

What we do: Implement access controls, data classification, and lineage tracking.

Data Classification Schema:

public:
  description: Anonymized aggregate statistics
  examples:
    - Hourly traffic counts by region
    - Daily air quality index
  access: Anyone (open data portal)

internal:
  description: Operational sensor data
  examples:
    - Individual sensor readings
    - Equipment status
  access: City employees with data access training

sensitive:
  description: Data that could identify individuals
  examples:
    - Camera footage
    - License plate readings
  access: Authorized personnel only
  encryption: At rest and in transit

Why: Smart city data includes sensitive information (camera footage, location patterns). Column-level access controls ensure traffic analysts cannot accidentally access camera data.

What we do: Design schema management that accommodates new sensor types.

Schema Registry Configuration:

# Base sensor schema with evolution support
sensor_schema_v1 = {
    "type": "record",
    "name": "SensorReading",
    "fields": [
        {"name": "sensor_id", "type": "string"},
        {"name": "timestamp", "type": "long"},
        {"name": "measurements", "type": {"type": "map", "values": "double"}},
        {"name": "metadata", "type": ["null", {"type": "map", "values": "string"}],
         "default": None}
    ]
}

# Evolution rules
compatibility_config = {
    "compatibility": "BACKWARD",
    "allowed_changes": [
        "add_optional_field",
        "add_field_with_default"
    ]
}

Why: Cities constantly add new sensor types (flood sensors, noise monitors, EV chargers). Using a flexible measurements map allows adding sensor types without modifying table schemas.

Outcome: A production data lake processing 2 TB/day from 50,000 sensors with sub-second dashboard queries and monthly costs of $12,400 (under budget).

Architecture Summary:

50,000 Sensors
     |
     +-- MQTT Broker (low-power) --+
     +-- Kafka (high-frequency) ---+---> Spark Streaming
     +-- HTTP API (controllers) ---+      |
     +-- gRPC (camera events) -----+      v
                                   Delta Lake (S3)
                                        |
                     +------------------+------------------+
                     |                  |                  |
                  Hot Tier          Warm Tier         Cold Tier
                  (7 days)          (90 days)         (7 years)
                  500 GB            6 TB              50+ TB
                     |                  |                  |
                     v                  v                  v
                  Druid            Spark SQL          Glacier
               (Real-time)        (Analytics)       (Compliance)

Cost Breakdown: | Component | Monthly Cost | |———–|————-| | S3 Storage (all tiers) | $283.50 | | Kafka (MSK) | $2,400 | | Spark (Databricks) | $4,800 | | Druid (real-time) | $3,200 | | Schema Registry | $200 | | Data Transfer | $1,500 | | Total | $12,383.50 |

Key Decisions Made:

  1. Multi-protocol ingestion: Match ingestion to sensor constraints
  2. Delta Lake format: ACID transactions prevent corruption
  3. Three-tier storage: 80% cost reduction vs single-tier
  4. Materialized views: 99% query cost reduction
  5. Schema-flexible measurements map: Add sensor types without migrations
  6. Column-level access controls: Compliance with privacy requirements

32.7 Interactive: Tiered Storage Cost Comparison

Compare the cost of storing all IoT data in a single tier versus using the hot/warm/cold tiered approach described in the data lake architecture above. Adjust the daily ingestion rate and see how tiering reduces costs over time.

32.9 Videos

Comprehensive IoT layered architecture diagram showing four distinct layers: (1) Sensing Layer at bottom with RFID tags, intelligent sensors, RFID readers, BLE devices, and WSNs connected via data sensing/acquisition protocols, (2) Network Layer with cloud containing WSN, Internet, Mobile network, Database, and WLAN interconnections, (3) Service Layer showing Service division, Service integration, Service bus, Business logic, and Service repository with bidirectional arrows, (4) Interface Layer at top with Application frontend, Contract, Interface, and Application API components connected via Service bus, with Security column spanning all layers on the right side

Complete four-layer IoT data architecture showing Sensing Layer (RFID tags, intelligent sensors, BLE devices, WSNs), Network Layer (WSN, Internet, Mobile network, Database, WLAN), Service Layer (service division, integration, repository, business logic), and Interface Layer (application frontend, contract, interface, API)

Source: NPTEL Internet of Things Course, IIT Kharagpur - Illustrates the complete IoT data pipeline from physical sensors through network transport and cloud services to end-user applications, emphasizing how big data flows through each architectural layer.

The Sensor Squad went on a field trip to visit a real smart city!

“Welcome to Barcelona!” said Max the Microcontroller. “This city has over 12,000 sensors – parking sensors, trash bin sensors, air quality monitors, traffic cameras, and smart streetlights!”

Sammy the Sensor was amazed. “How much data is that?”

“About 222 gigabytes every single day,” Max replied. “That is like taking 44,000 photos on your phone – EVERY DAY!”

Lila the LED pointed at a smart parking sensor in the ground. “What does that one do?”

“It detects if a car is parked on top of it,” explained Max. “There are 1,000 of them! When you drive into the city, an app tells you exactly where to find an empty parking spot. It saves people over 230,000 hours a year of driving around looking for parking!”

Bella the Battery pointed at a smart trash bin. “And that one?”

“It tells the garbage trucks how full it is! Instead of driving to every bin every day, the trucks only visit the full ones. They save 30 percent on fuel!”

“But where does all this data GO?” Sammy asked.

“Great question! The city uses something called a DATA LAKE,” Max said. “Think of it like a giant library with three rooms: a HOT room for this week’s data that people look at all the time, a WARM room for last month’s data, and a COLD room for really old data stored in the basement. The hot room costs the most to run, so old data moves to cheaper rooms automatically.”

“Smart cities are like having thousands of us working together!” Bella said proudly.

Key lesson: Smart cities use thousands of IoT sensors to save money, time, and energy. They organize their massive data into tiers – like hot, warm, and cold storage – to keep costs manageable!

Key Takeaway

Smart city deployments demonstrate that IoT big data is not just a technology challenge but an architecture and economics problem. The combination of multi-protocol ingestion, tiered storage, materialized views, and schema-flexible design enables cities to process terabytes daily while keeping costs under control. Start with clear business value targets (like Barcelona’s parking optimization saving 230,000 hours/year) and design the architecture to deliver that value efficiently.

Common Pitfalls

Storing all sensor readings ‘just in case’ generates petabyte-scale datasets with no query plan to justify the cost. Define the decisions the data must support before designing the collection pipeline.

Running analytics jobs that move terabytes of IoT data from storage to compute clusters wastes bandwidth and increases latency. Use compute-near-storage patterns (Spark on HDFS, BigQuery, Redshift Spectrum) or edge pre-aggregation.

A single ingestion pipeline for temperature, video, and vibration data will be brittle and hard to scale. Design separate ingestion paths for each data modality, joining them only at the analytics layer.

Production incidents often require tracing a faulty reading back to its source sensor, calibration date, and transmission path. Without provenance metadata, root-cause analysis becomes guesswork.

32.10 Summary

  • Barcelona Smart City generates 222 GB/day from 12,000+ sensors, demonstrating how mid-sized deployments create big data challenges requiring 81 TB/year storage with tiered architectures.
  • Scale matters: 1 billion IoT sensors at 1 reading/second generates 3+ EB/year (over 8,600 TB/day), requiring distributed processing across hundreds of machines to handle 100 GB/second incoming data rates.
  • Data lake architecture for 50,000 sensors achieves sub-second queries and $12,400/month costs through multi-protocol ingestion, three-tier storage, materialized views, and schema-flexible design.
  • Decision framework: Use big data tools when data exceeds 100 GB, velocity exceeds 1 MB/second, or you have 3+ data types and frequent queries - otherwise traditional tools work fine.

32.11 Resources

32.11.1 Big Data Frameworks

32.11.2 Time-Series Databases

32.11.3 Cloud Platforms

32.11.4 Books and Papers

  • “Big Data: Principles and Best Practices of Scalable Real-Time Data Systems” by Nathan Marz
  • “Streaming Systems” by Tyler Akidau, Slava Chernyak, Reuven Lax
  • “Designing Data-Intensive Applications” by Martin Kleppmann

32.12 What’s Next

If you want to… Read this
Understand big data pipeline fundamentals Big Data Fundamentals
Explore processing technologies used in case studies Big Data Technologies
Learn edge processing for distributed deployments Big Data Edge Processing
Apply analytics to real IoT data Modeling and Inferencing
Return to the module overview Big Data Overview