2  Data Storage Overview

In 60 Seconds

IoT devices generate massive volumes of timestamped data that require specialized storage strategies – time-series databases for sensor readings, relational databases for device metadata, and document stores for flexible event logs. The key to managing cost and performance at scale is tiered storage (hot/warm/cold), where recent data lives in fast databases for real-time queries while historical data is compressed and archived in cheaper object storage, reducing costs by 90% or more.

Learning Objectives

After completing this chapter series, you will be able to:

  • Select appropriate database types for different IoT use cases
  • Analyze CAP theorem trade-offs and apply them to IoT system design
  • Implement relational, NoSQL, and time-series databases for IoT data
  • Design data partitioning and sharding strategies
  • Configure data quality monitoring and retention policies
  • Compare tiered storage architectures to optimize costs
MVU: Minimum Viable Understanding

Core concept: Database selection depends on your data shape (relational vs. document vs. time-series) and the CAP theorem trade-offs between consistency, availability, and partition tolerance. Why it matters: The wrong database choice leads to 10x higher costs, query timeouts, and scaling nightmares; the right choice enables real-time analytics at IoT scale. Key takeaway: Use relational databases for transactional data with relationships, time-series databases for sensor readings, and document stores for flexible device metadata–then implement multi-tier retention (hot/warm/cold) to control storage costs.

2.1 Prerequisites

Before diving into this chapter series, you should be familiar with:

  • Basic programming concepts: Understanding of data structures, variables, and simple queries
  • IoT Fundamentals: Basic understanding of IoT devices and data flow
  • Edge, Fog, and Cloud Architecture: Where data is generated and processed in IoT systems

Imagine your bedroom is full of LEGOs!

  • Without organization: LEGOs are scattered everywhere. Finding a specific piece takes forever, and you might step on them!
  • With a database: It’s like having a special LEGO organizer with labeled drawers. Red bricks go here, wheels go there, and special pieces have their own spot.

A database is like a super-smart organizer for computer information. When your smart home collects data from sensors (like how hot it is or when someone opened the door), it needs a place to store all that information so it can find it again quickly.

Think of it this way:

Real Life Database World
Photo album Stores pictures from security cameras
Diary with dates Stores temperature readings over time
Address book Stores information about your smart devices
Toy box with sections Different databases for different types of data

Why does IoT need good databases? Your smart home might record the temperature every minute. That’s 1,440 readings per day, or over 525,000 per year–just from ONE sensor! Without a good database, finding “what was the temperature last Tuesday at 3pm?” would be like finding a specific LEGO in a huge pile.

Sammy the Sensor says: “Hey friends! Let me tell you about the coolest library in the world–one that stores data from sensors!”

2.1.1 The Sensor Squad Builds a Weather Station

Sammy and the squad were building a weather station for their school. Soon they had a problem: too much data!

Lila checked the thermometer: “It takes the temperature every minute. That’s 1,440 readings per day!”

Max looked at the rain gauge: “And we have humidity, wind speed, and rainfall too. That’s thousands of numbers!”

Bella asked the big question: “Where do we put all this data so we can find it later?”

Sammy had an idea: “We need a database! It’s like a special library for numbers!”

2.1.2 Three Types of Data Libraries

The squad learned that different data needs different storage:

Data Type Best Storage Real-World Example
Who owns which sensor Spreadsheet-style (Relational) Like a class roster with names and seat numbers
Sensor readings over time Time-diary (Time-Series) Like a captain’s log on a ship
Flexible notes and configs Folder system (Document) Like a folder with different papers inside

Max figured it out: “So temperature readings go in the time-diary, but information about the sensor goes in the spreadsheet!”

Bella nodded: “And if we want to add new types of information later, the folder system is flexible!”

The Sensor Squad learned: Different data needs different homes, just like different toys need different organizers!

Try This at Home: Think about data in your house. Your family photos are like documents (flexible). Your calendar is like a time-diary (ordered by date). Your contact list is like a spreadsheet (organized rows and columns). What other examples can you find?

2.2 Introduction

IoT systems generate diverse data types requiring different storage strategies. Sensor readings demand time-series optimization, device metadata needs relational structure, and multimedia content requires object storage. Choosing the right database technology is crucial for performance, scalability, and cost-effectiveness.

This chapter series covers all aspects of IoT data storage:

2.2.1 Chapter Guide

Chapter Focus Best For
Database Selection Framework Choosing the right database type Starting a new IoT project
CAP Theorem and Database Categories Distributed systems trade-offs Designing for scale and reliability
Time-Series Databases TimescaleDB, InfluxDB optimization Sensor data storage
Data Quality Monitoring Quality metrics and validation Production systems
Sharding Strategies Horizontal scaling patterns Large-scale deployments
Worked Examples Fleet management, data lake design Learning by example

2.3 Key Terminology

Before exploring IoT data storage, familiarize yourself with these essential terms:

Term Definition IoT Example
Database Organized collection of data that can be easily accessed, managed, and updated Storing all temperature readings from a smart thermostat
Query A request for data from a database “Show me all temperatures above 25C from last week”
Schema Structure that defines how data is organized Columns: timestamp, device_id, temperature, humidity
CRUD Create, Read, Update, Delete–the four basic database operations Adding a new sensor reading (Create), viewing historical data (Read)
ACID Atomicity, Consistency, Isolation, Durability–properties ensuring reliable transactions Ensuring a payment transaction completes fully or not at all
Partition Dividing data into smaller, manageable pieces Storing each month’s sensor data in separate sections
Replication Copying data to multiple locations for redundancy Keeping backups in different data centers
Time-series Data points indexed in time order Temperature readings every 5 minutes for a year

2.3.1 IoT Storage Challenges

IoT systems face unique storage challenges compared to traditional applications:

  • Scale: Billions of devices generating petabytes of data annually
  • Velocity: High-frequency writes (thousands of writes/second aggregated across a fleet)
  • Variety: Structured metadata, semi-structured logs, unstructured video
  • Retention: Long-term compliance storage (years) vs real-time buffers (hours)
  • Cost: Storage costs can exceed compute costs at IoT scale
  • Access patterns: Write-heavy ingestion, time-range analytical queries
The 3 Vs of IoT Data

IoT data is often characterized by Volume (massive amounts), Velocity (high speed), and Variety (many formats). Traditional databases designed for business applications often struggle with these characteristics, which is why specialized IoT databases have emerged.

2.3.2 IoT Data Flow and Storage Architecture

IoT data storage architecture diagram showing the flow from edge devices through an ingestion layer into tiered storage with hot, warm, and cold tiers for cost optimization
Figure 2.1: IoT Data Storage Architecture showing the flow from devices through ingestion to tiered storage with hot, warm, and cold tiers for cost optimization.

2.3.3 The Three Main Database Types for IoT

Comparison diagram of three IoT database types: relational for structured metadata, NoSQL for flexible documents, and time-series for sensor readings
Figure 2.2: IoT Database Type Selection: Choose relational for structured metadata, NoSQL for flexible documents, or time-series for sensor readings based on your data characteristics.
Type Analogy Best For Examples
Relational Spreadsheet with strict columns Device metadata, user accounts PostgreSQL, MySQL
NoSQL (Document) Flexible JSON file Event logs, config data MongoDB, DynamoDB
Key-Value / Cache Dictionary for instant lookup Latest sensor values, session data Redis, Memcached
Time-Series Optimized data log Sensor readings, metrics InfluxDB, TimescaleDB

2.3.4 Quick Reference: Database Comparison

Database Type Best For Scalability Query Complexity Write Speed
Relational (SQL) Structured data, ACID Vertical High (SQL) Medium
Document (NoSQL) Semi-structured Horizontal Medium High
Key-Value Simple lookups Horizontal Low Very High
Time-Series Time-stamped data Horizontal Medium Very High
Graph Relationships Horizontal High (traversals) Medium

2.4 Where to Start

Choose your learning path based on your current needs:

Start here: Database Selection Framework

Learn the fundamentals of choosing the right database for your IoT use case. Covers SQL vs NoSQL, time-series databases, and decision frameworks.

Start here: CAP Theorem and Database Categories

Understand distributed systems trade-offs and how consistency, availability, and partition tolerance affect your architecture decisions.

Start here: Time-Series Databases

Deep dive into InfluxDB, TimescaleDB, and optimization techniques for high-frequency sensor data.

Start here: Worked Examples

Complete case studies including fleet management systems and smart city data lake architectures.

Scenario: A smart agriculture company deploys 2,000 soil sensors (moisture, temperature, pH) reporting every 10 minutes across 500 hectares.

Note: The self-assessment above used 50 bytes for moisture-only readings. This expanded scenario includes moisture, temperature, and pH fields (80 bytes per record).

Step 1: Calculate Data Volume

To estimate storage needs, multiply sensors by readings per day by record size. For this agriculture deployment, \(\text{Daily volume} = 2{,}000 \times 144 \times 80 = 23{,}040{,}000\) bytes. Worked example: 2,000 sensors x 144 readings/day x 80 bytes = 23 MB/day, scaling to 8.4 GB/year before compression or aggregation.

Use the calculator below to explore how sensor count, reporting frequency, and record size affect storage requirements:

Step 2: Identify Data Types and Access Patterns

Data Type Volume Query Pattern Retention
Sensor readings 8.4 GB/year Time-range, by field/sensor 3 years
Sensor metadata 500 KB Lookup by ID, join with readings Permanent
Irrigation events 50 MB/year Time-range, by field 3 years
Weather data 100 MB/year Time-range correlation 10 years
User accounts 10 KB Authentication, billing Permanent

Step 3: Select Databases (Polyglot Approach)

Data Type Database Reasoning
Sensor readings TimescaleDB Time-series optimized, SQL for analytics, 10:1 compression
Sensor metadata PostgreSQL Relational (sensor → field → farm), ACID for config changes
Irrigation events TimescaleDB Same as readings for easy correlation
Weather data TimescaleDB Join with sensor data for ML models
User accounts PostgreSQL ACID transactions for billing

Step 4: Design Tiered Storage

Tier Age Data Compression Storage Unit Cost Monthly Cost
Hot 0-30 days 690 MB 1:1 690 MB $0.10 $0.07
Warm 30-365 days 7.7 GB 10:1 770 MB $0.02 $0.02
Cold 1-3 years 16.8 GB 10:1 + downsample 6:1 280 MB $0.004 $0.001
Total 3 years ~25 GB raw ~1.7 GB effective - - $0.09/month

Note: The $0.09/month above covers storage costs only. The cost comparison in Step 6 includes managed database compute overhead ($1.08/month total).

Step 5: Implement Schema

-- TimescaleDB hypertable for sensor readings
CREATE TABLE sensor_readings (
    time TIMESTAMPTZ NOT NULL,
    sensor_id TEXT NOT NULL,  field_id TEXT NOT NULL,
    moisture NUMERIC(5,2),    temperature NUMERIC(4,1),
    ph NUMERIC(3,1),          battery_pct SMALLINT
);
SELECT create_hypertable('sensor_readings', 'time',
       chunk_time_interval => INTERVAL '1 week');

-- Compress old data automatically
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'sensor_id, field_id');
SELECT add_compression_policy('sensor_readings', INTERVAL '30 days');

-- Metadata tables (standard PostgreSQL)
CREATE TABLE fields (
    field_id TEXT PRIMARY KEY, farm_id TEXT NOT NULL,
    name TEXT, area_hectares NUMERIC(6,2));
CREATE TABLE sensors (
    sensor_id TEXT PRIMARY KEY,
    field_id TEXT NOT NULL REFERENCES fields(field_id),
    location POINT, sensor_model TEXT);

-- Continuous aggregate for warm-tier daily summaries
CREATE MATERIALIZED VIEW daily_soil_summary
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 day', time) AS day, field_id,
       AVG(moisture) as avg_moisture,
       MIN(temperature) as min_temp, MAX(temperature) as max_temp
FROM sensor_readings GROUP BY day, field_id;

Step 6: Cost Comparison

Approach Storage Cost Query Performance Complexity Total Annual Cost
All PostgreSQL (no optimization) $30/month Slow after 6 months Low $360
TimescaleDB + compression $5/month Fast (time partitions) Medium $60
TimescaleDB + tiered storage $1.08/month Fast hot, slower cold High $13

Costs include managed database compute + storage. Storage-only costs are lower (see Step 4 above).

Result: Tiered TimescaleDB architecture saves $347/year (96%) vs. naive PostgreSQL while maintaining fast queries on recent data.

Key Insight: Even for modest IoT deployments (2,000 sensors), proper database selection and tiering reduce costs by 95%+ without sacrificing performance.

Start here when designing IoT storage. Answer these questions to select your database types:

Question 1: Is your data timestamped and queried by time ranges?

  • Yes → Time-series database (TimescaleDB, InfluxDB)
  • No → Continue to Q2

Question 2: Does your data have fixed structure with relationships (foreign keys)?

  • Yes → Relational database (PostgreSQL, MySQL)
  • No → Continue to Q3

Question 3: Is your schema flexible or rapidly changing?

  • Yes → Document database (MongoDB, DynamoDB)
  • No → Continue to Q4

Question 4: Do you need sub-millisecond read access to latest values?

  • Yes → Key-value cache (Redis, Memcached)
  • No → Continue to Q5

Question 5: Are you storing large binary files (images, video)?

  • Yes → Object storage (S3, Azure Blob, MinIO)
  • No → Reconsider Q1-Q4

Real-World Mapping Example (smart building):

Data Q1 Time? Q2 Relational? Q3 Flexible? Q4 Cache? Q5 Binary? Database
Sensor readings - - - - TimescaleDB
Device registry - - - PostgreSQL
Event logs 1 - - MongoDB
Latest temps - - - - Redis
Security footage - - - - S3

1 Event logs carry timestamps but are primarily queried by event type, severity, or content rather than time-range aggregation. Schema flexibility (Q3) is the dominant factor since log structures vary widely across device types.

Complexity Guide:

  • 1 database type: OK for hobby projects (<100 devices)
  • 2-3 database types: Standard for production IoT (1K-100K devices)
  • 4-5 database types: Large-scale platforms (100K+ devices)
  • 6+ database types: Complex ecosystems (avoid unless necessary)

2.4.1 Tiered Storage Cost Estimator

Explore how tiered storage reduces costs compared to storing all data in a single high-performance tier. Adjust the parameters to match your deployment:

2.5 Try It Yourself

Exercise: Design Storage for Your Smart Home

Scenario: You have a smart home with 15 sensors (temperature, humidity, motion, door/window contacts, energy meters). Design a storage architecture:

  1. Calculate daily data volume: Each sensor reports every 60 seconds, each reading is 40 bytes. How much data per day? Per year?
  2. Identify your data types: Which sensors produce time-series data (continuous readings)? Which produce event data (state changes only)?
  3. Select databases: Use the Q1-Q5 Decision Framework above to pick database types for each data category. Would you use one database or multiple?
  4. Design retention: How long do you need each type of data? What goes in hot storage vs cold storage? Does motion sensor data need the same retention as energy meter data?

Challenge: Calculate the monthly storage cost using the Tiered Storage Cost Estimator above. How much would you save with tiered storage vs storing everything in a single PostgreSQL database?

Bonus: Motion sensors and door contacts generate events (not continuous readings). How does this affect your database choice compared to temperature sensors that report every minute?

Common Mistake: Ignoring Time Zone Complexity

The Error: Storing sensor timestamps in local time without timezone offset, leading to daylight saving time bugs and aggregation errors.

What Goes Wrong:

Scenario: Smart home sensors across the US report temperature in local time without timezone info:

Sensor A (New York):  2024-11-03 01:30:00 (first occurrence, before 2 AM "fall back")
Sensor B (New York):  2024-11-03 01:30:00 (second occurrence, after 2 AM "falls back" to 1 AM)
Sensor C (Chicago):   2024-11-03 01:30:00 (first occurrence, DST transition hasn't happened yet)

Aggregation Query Problem:

SELECT AVG(temperature)
FROM sensor_readings
WHERE timestamp >= '2024-11-03 01:00:00'
  AND timestamp < '2024-11-03 02:00:00'

What happens:

  • The 1:00-2:00 AM hour occurs TWICE on DST “fall back” night (2 AM resets to 1 AM)
  • Without timezone offsets, both occurrences look identical in the database
  • Sensors report twice as many readings for the “ambiguous hour,” skewing averages
  • Dashboard shows an inexplicable “spike” during DST transition

Real-World Impact:

  • HVAC system sees false temperature spike, triggers cooling → wasted energy
  • Anomaly detection flags DST hours as “data quality issues” → 2x false alarms/year
  • Historical analysis comparing Oct vs Nov has discontinuity → broken ML models

The Fix: ALWAYS store UTC, convert to local time only at presentation layer:

-- Correct: Store in UTC (TIMESTAMPTZ in PostgreSQL)
CREATE TABLE sensor_readings (
    time TIMESTAMPTZ NOT NULL,  -- Stores UTC internally
    sensor_id TEXT,
    temperature NUMERIC(4,1)
);

INSERT INTO sensor_readings VALUES
    ('2024-11-03 06:30:00+00'::TIMESTAMPTZ, 'sensor_A', 23.5);  -- 01:30 AM EST = 06:30 UTC

-- Display in user's timezone
SELECT
    time AT TIME ZONE 'America/New_York' as local_time,
    temperature
FROM sensor_readings
WHERE time >= '2024-11-03 06:00:00+00'::TIMESTAMPTZ;  -- 01:00 AM EST = 06:00 UTC

Best Practices:

  1. Sensors send UTC: Configure devices to use NTP and report in UTC (append ‘Z’)
  2. Database stores UTC: Use TIMESTAMPTZ (PostgreSQL) or store Unix epoch + UTC
  3. Application converts: Only convert to local time at the UI layer for display
  4. Never do timezone math manually: Use database functions (AT TIME ZONE) or libraries (Python zoneinfo/dateutil, JavaScript Intl.DateTimeFormat or luxon)

Test Your System:

  • Insert test data spanning DST transitions (first Sunday of November, second Sunday of March in the US; e.g., Nov 3 and Mar 10 in 2024)
  • Run aggregation queries across the transition
  • Verify no duplicate/missing hours

Key Insight: UTC has no DST transitions. Store UTC, convert late.

2.6 Self-Assessment: Check Your Understanding

Common Pitfalls

PostgreSQL or MySQL can handle early prototypes, but inserting thousands of sensor readings per second quickly saturates write throughput due to index maintenance and WAL flushing. Plan for time-series storage from day one by using TimescaleDB (PostgreSQL-compatible) or InfluxDB, which are designed for this exact workload pattern.

Key Concepts
  • Polyglot Persistence: Using multiple database types within one IoT system – relational for device metadata, time-series for sensor telemetry, document stores for event logs – to match each data type to its optimal storage engine
  • Time-Series Database: A database engine optimized for high-velocity timestamped data, supporting automatic chunk partitioning, columnar compression, and continuous aggregates for IoT sensor streams
  • Storage Tiering: A cost-optimization architecture dividing data into hot (SSD, recent days), warm (HDD, weeks), and cold (object storage, months/years) tiers based on access frequency
  • CAP Theorem: The distributed systems constraint that limits databases to two of Consistency, Availability, and Partition tolerance – partition tolerance is mandatory in IoT, requiring a CP vs AP choice per data type
  • Data Sharding: Horizontal partitioning of data across multiple database nodes using device ID, time range, or geography as the partition key to scale write throughput beyond single-node limits
  • Retention Policy: An automated rule that compresses, downsample, or deletes data after a configured age, preventing unbounded storage growth while preserving aggregate summaries of historical data
  • Data Quality Dimension: A measurable attribute of data reliability – including completeness, accuracy, consistency, timeliness, and validity – used to score incoming sensor readings before storage
  • Hypertable: A TimescaleDB abstraction that automatically partitions a table into time-based chunks, enabling efficient parallel queries and background compression without changing the SQL interface

IoT storage costs grow faster than expected. 100 sensors at 1-second intervals generate 3 million rows per day, roughly 1 TB per year at typical row sizes. Without storage tiering and retention policies in place from day one, storage costs compound rapidly. Calculate your data volume before choosing a storage tier strategy.

Control commands for actuators (turn off motor) require strong consistency to prevent conflicting state; sensor telemetry (temperature readings) tolerates eventual consistency. Applying CP everywhere causes unnecessary availability loss; applying AP everywhere risks dangerous conflicting commands. Classify data by business impact of inconsistency.

2.7 Summary

This chapter introduced the fundamental concepts of IoT data storage:

  • IoT data challenges include massive scale (billions of devices), high velocity (thousands of writes/second), data variety (structured, semi-structured, unstructured), and cost optimization (storage can exceed compute costs)
  • Three main database types serve different IoT needs:
    • Relational (SQL): Device metadata, user accounts, relationships (PostgreSQL, MySQL)
    • NoSQL (Document/Key-Value): Flexible schemas, event logs, configuration (MongoDB, Redis)
    • Time-Series: Timestamped sensor readings, metrics (InfluxDB, TimescaleDB)
  • Database selection depends on data characteristics, write patterns, query patterns, and consistency requirements (CAP theorem trade-offs)
  • Multi-tier storage (hot/warm/cold) balances performance with cost-effectiveness–keeping recent data in fast storage while archiving historical data to cheaper object storage
  • Polyglot persistence uses multiple database technologies together, each optimized for specific data types
Quick Decision Framework

Use the Q1-Q5 Decision Framework in the collapsible section above to select your database type. The key questions address timestamped data, relational structure, schema flexibility, caching needs, and binary storage.

2.8 Concept Relationships

Data Storage Overview provides foundational context for all IoT data management:

2.9 What’s Next

Your next step depends on what aspect of data storage you want to explore:

Topic Chapter What You’ll Learn
Database selection Database Selection Framework How to choose SQL vs NoSQL vs time-series databases for specific IoT workloads
Distributed trade-offs CAP Theorem and Database Categories Consistency, availability, and partition tolerance trade-offs in distributed IoT systems
Time-series deep dive Time-Series Databases InfluxDB, TimescaleDB, and Prometheus for high-velocity sensor data
Data integrity Data Quality Monitoring Detecting and handling missing, duplicate, and out-of-range sensor readings
Horizontal scaling Sharding Strategies Partitioning data across nodes for millions of IoT devices
End-to-end design Worked Examples Fleet management and smart city data lake architecture case studies

External Resources: