3  Data Storage and Databases

Learning Objectives

After completing this chapter and its companion chapters, you will be able to:

  • Select the appropriate database type (SQL, NoSQL, time-series) for different IoT data patterns
  • Apply CAP theorem trade-offs to design distributed IoT storage architectures
  • Implement time-series databases with compression and retention policies for sensor telemetry
  • Design multi-tier storage strategies (hot/warm/cold) for cost-effective data lifecycle management
  • Evaluate sharding strategies for horizontally scaling IoT data storage
  • Validate and assess data quality in production IoT systems
In 60 Seconds

IoT systems cannot rely on a single database type because sensor telemetry, device metadata, and event logs have fundamentally different shapes and query patterns. Time-series databases handle high-velocity sensor streams, relational databases manage structured device registries, and document stores accommodate flexible event schemas – this combination is called polyglot persistence. The key architectural decision is matching each data type to its optimal store, then connecting them through a unified data pipeline.

3.1 Key Concepts

  • Polyglot Persistence: Using multiple database types within one system, each optimized for a specific data shape – relational for metadata, time-series for telemetry, document for events
  • Hot/Warm/Cold Storage Tiers: A cost-optimization strategy that keeps recent data on fast SSD (hot), older data on HDD (warm), and archival data on object storage (cold), reducing storage costs by 80-95%
  • Time-Series Database: A database optimized for timestamped sequential data, supporting automatic compression, retention policies, and time-windowed aggregations common in IoT telemetry
  • CAP Theorem: States that distributed databases can only guarantee two of three properties – Consistency, Availability, and Partition tolerance – simultaneously; IoT systems must choose per data type
  • Hypertable: A TimescaleDB abstraction that automatically partitions a table into chunks by time range, enabling efficient time-windowed queries across billions of rows
  • Sharding: Horizontal partitioning that distributes data across multiple database nodes by device ID, time range, or geographic region to scale write throughput beyond a single machine
  • Data Lifecycle Management: Automated policies that migrate, compress, or delete data based on age and access frequency, preventing unbounded storage growth in long-running IoT deployments
  • Write-Ahead Log (WAL): A durability mechanism where changes are first written to a sequential log before being applied to the main data store, enabling crash recovery and replication

3.2 Minimum Viable Understanding

  • Match the database to the data: Use relational databases for device metadata, time-series databases for sensor telemetry, and NoSQL for flexible schemas – most IoT systems need all three (“polyglot persistence”).
  • Storage tiers save money: Keeping recent data in fast “hot” storage and moving older data to cheaper “cold” storage reduces costs by 80-95% without losing access to historical records.
  • CAP theorem forces trade-offs: In distributed IoT systems, you cannot have perfect consistency, availability, and partition tolerance simultaneously – choose based on whether stale data or downtime is more dangerous for your use case.

Sammy the Sensor collects a temperature reading every 5 minutes. That is 288 readings per day, over 100,000 per year – and that is just ONE sensor!

Lila the Lab Technician explains: “Think of databases like different types of notebooks. A lined notebook (relational database) is great when you need neat rows and columns – like keeping a list of all your sensors and where they are. A lab journal (time-series database) is perfect when you write down measurements with timestamps – like recording the temperature every few minutes. And a scrapbook (NoSQL database) is best when every page can look different – like storing photos, notes, and drawings all together.”

Max the Maker asks: “But what happens when you run out of pages?” Lila smiles: “That is where sharding comes in – you get more notebooks and split the work! Some notebooks handle sensors A through M, others handle N through Z. Together, they can store everything!”

Bella the Builder adds: “And old notebooks you do not look at much? You put them on a high shelf (cold storage) to save space on your desk (hot storage). You can still get them – it just takes a bit longer!”

3.3 Overview

IoT systems generate diverse data types requiring different storage strategies. This comprehensive chapter series covers database selection, distributed systems trade-offs, time-series optimization, data quality monitoring, sharding strategies, and complete worked examples.

This chapter has been split into focused topics for better learning:

3.3.1 Chapter Series

Chapter Topics Covered Approx Words
Database Selection Framework Choosing SQL vs NoSQL vs time-series, decision framework, real-world examples ~4,500
CAP Theorem and Database Categories Distributed systems, consistency vs availability, database trade-offs ~3,800
Time-Series Databases TimescaleDB, InfluxDB, hypertables, compression, retention policies ~3,200
Data Quality Monitoring Quality dimensions, validation, monitoring dashboards, handling bad data ~3,500
Sharding Strategies Horizontal scaling, time vs device vs hybrid sharding, implementation ~2,800
Worked Examples Fleet management (10K vehicles), smart city data lake (50K sensors) ~4,200

3.3.2 Quick Start Guide

New to databases? Start with Database Selection Framework

Building for scale? Read CAP Theorem for distributed systems

Working with sensors? Jump to Time-Series Databases

Production systems? See Data Quality Monitoring

Massive scale? Learn Sharding Strategies

Need examples? Check Worked Examples

A database is an organized system for storing, retrieving, and managing data – think of it as a very smart filing cabinet that can instantly find any piece of information you ask for.

Why can’t IoT just use one regular database?

Regular databases (like MySQL or PostgreSQL) are designed for things like user accounts, product catalogs, and orders – data that changes occasionally and has a well-defined structure. IoT is different because:

  1. Volume: A single factory with 1,000 sensors reporting every second produces 86.4 million data points per day. Regular databases struggle with this write speed.
  2. Velocity: Data arrives continuously and must be stored immediately – you cannot tell a sensor to “wait” while the database catches up.
  3. Variety: Some data is structured (temperature = 23.5C), some is semi-structured (a JSON event with variable fields), and some is unstructured (camera images).
  4. Time-centricity: Almost all IoT queries are time-based (“show me the last hour of readings”), which regular databases handle inefficiently.

That is why IoT systems typically use multiple database types together – a strategy called “polyglot persistence.” Each database type handles what it does best.

3.3.3 How Much Data Does Your IoT System Generate?

Use this calculator to understand how quickly IoT data accumulates. Try different sensor counts and frequencies to see why specialized storage is essential.

3.4 IoT Data Storage Landscape

The IoT data storage landscape spans multiple database technologies, each optimized for different data shapes and access patterns. Understanding how these fit together is the first step toward designing a robust storage architecture.

Flowchart showing IoT data flowing from sensors and devices through an ingestion layer into three database categories: relational databases for device metadata, time-series databases for sensor telemetry, and NoSQL databases for event logs and configurations. Each category connects to appropriate storage tiers and query patterns.
Figure 3.1: IoT data storage landscape

3.5 Database Selection Decision Framework

Choosing the right database for each data type is one of the most impactful architectural decisions in an IoT project. The wrong choice can lead to 10x higher costs, unacceptable query latency, or system failures under load.

Decision tree diagram for selecting an IoT database type. Starting from data characteristics, the tree branches based on whether data is time-stamped telemetry, structured with relationships, or flexible schema, leading to recommendations for time-series, relational, or NoSQL databases respectively. Additional decision points cover write frequency and consistency requirements.
Figure 3.2: Database selection decision tree

3.6 Multi-Tier Storage Architecture

IoT data has a natural lifecycle: recent data is queried frequently and must be fast, while older data is accessed rarely but must remain available. A tiered approach balances performance with cost.

Diagram showing three storage tiers for IoT data lifecycle. The hot tier uses SSDs and in-memory caches for the last 7 days of data with sub-second query latency. The warm tier uses standard disks for 7-90 day data with second-level latency. The cold tier uses object storage like S3 for data older than 90 days with minute-level latency. Arrows show data flowing from hot to warm to cold over time, with cost decreasing at each tier.
Figure 3.3: Multi-tier storage architecture for IoT data

3.7 CAP Theorem for IoT

When IoT databases are distributed across multiple nodes (essential for reliability and scale), the CAP theorem describes the fundamental trade-off you must make:

Diagram representing the CAP theorem showing three properties: Consistency, Availability, and Partition Tolerance. Each pair is labeled with example database technologies. CP systems like MongoDB and HBase prioritize consistency, becoming unavailable during partitions. AP systems like Cassandra and DynamoDB prioritize availability, accepting temporary inconsistency. CA behavior applies only to non-distributed setups such as single-node PostgreSQL, where network partitions do not occur.
Figure 3.4: CAP theorem trade-offs for IoT databases

3.8 IoT Database Comparison at a Glance

Feature Relational (PostgreSQL) Time-Series (InfluxDB) Document (MongoDB) Key-Value (Redis)
Best for Device metadata, accounts Sensor telemetry Event logs, configs Caching, latest values
Write speed ~10K rows/sec ~500K points/sec ~50K docs/sec ~100K ops/sec
Query strength Complex joins, aggregations Time-range, downsampling Flexible filters Key lookup
Schema Fixed (ALTER TABLE) Tag-based, flexible Schema-free JSON Key-value pairs
Scaling Vertical (bigger server) Horizontal (clustering) Horizontal (sharding) Horizontal (clustering)
IoT data share 5-10% of data 70-80% of data 10-15% of data Cache layer (overlaps)

Data Characteristic Recommended Database Alternative Reasoning
Timestamped sensor readings TimescaleDB or InfluxDB PostgreSQL (small scale) Time-series optimization essential at scale
Device metadata (ID, location, owner) PostgreSQL TimescaleDB Relational structure, ACID transactions
Event logs (variable schema) MongoDB Elasticsearch Flexible schema, document-oriented
Real-time latest values Redis Memcached In-memory, sub-millisecond reads
Video/images Object storage (S3) MinIO (self-hosted) Cost-effective blob storage
User session data Redis DynamoDB Fast key-value access
Compliance archives (7+ years) S3 Glacier Tape backup Lowest cost per GB

How to use this table:

  1. Identify your data types: List all data your IoT system generates
  2. Match to characteristics: Find the row that best describes each data type
  3. Choose primary database: Use “Recommended” column for production
  4. Consider alternatives: Use “Alternative” if you have specific constraints
  5. Implement polyglot: Most systems need 3-5 different databases working together

Example Mapping (fleet management): - GPS telemetry -> TimescaleDB (time-series) - Vehicle metadata -> PostgreSQL (relational) - Driver behavior events -> MongoDB (flexible) - Real-time map positions -> Redis (cache) - Dashcam footage -> S3 (object storage)

3.9 Worked Example: Smart Building Storage Architecture

Worked Example: Designing Storage for a 500-Sensor Smart Building

Scenario: You are architecting the data storage layer for a commercial smart building with 500 sensors (temperature, humidity, occupancy, energy, air quality) across 10 floors. The system must support real-time dashboards, monthly energy reports, and 2 years of historical data for compliance.

Step 1: Characterize the Data

Data Type Source Write Rate Schema Query Pattern
Sensor telemetry 500 sensors @ 1/min 500 writes/min Fixed (timestamp, value, unit) Time-range, aggregation
Device metadata 500 devices Rare updates Structured, relational Lookup by ID, joins
Occupancy events 50 motion sensors ~2,000/day Semi-structured JSON Time-range, floor filter
Energy reports Aggregated Monthly batch Structured Full table scans
Maintenance logs Technicians ~10/day Free-form text + metadata Full-text search

Step 2: Select Databases

  • TimescaleDB for sensor telemetry: 500 writes/min is well within capacity, and SQL compatibility means the team already knows the query language. Time-based partitioning handles the 2-year retention requirement.
  • PostgreSQL for device metadata and energy reports: ACID transactions ensure device registry consistency. Foreign keys link devices to floors and zones.
  • MongoDB for maintenance logs and occupancy events: Flexible schema accommodates varying event structures. Full-text search index for maintenance log queries.

Step 3: Design Storage Tiers

Tier Data Age Storage Query Latency
Hot 0-7 days TimescaleDB on SSD Sub-second
Warm 7-90 days TimescaleDB on HDD (compressed) Seconds
Cold 90 days - 2 years AWS S3 Glacier Minutes

See the “Putting Numbers to It” section below for detailed cost calculations.

Step 4: Calculate Storage Requirements

  • Sensor telemetry: 500 sensors x 1 reading/min x 60 min x 24 hr = 720,000 readings/day
  • Each reading: ~100 bytes (timestamp + device_id + value + unit + quality_flag)
  • Daily raw storage: 720,000 x 100 bytes = 72 MB/day = ~2.16 GB/month
  • With 10:1 compression on warm/cold: 2-year total = ~5.2 GB compressed
  • 2-year all-hot storage cost: 51.84 GB x $0.15/GB = $7.78/month at peak (~$93 total over 2 years)
  • 2-year tiered storage cost: ~$18 total over 2 years (hot + warm + compressed cold)

Result: A polyglot architecture using three databases, tiered storage, and automated data lifecycle policies delivers approximately 80% cost savings while meeting all performance and compliance requirements.

Tiered storage dramatically reduces costs by moving older data from expensive hot storage (SSD) to cheaper cold storage (S3 Glacier).

$ = + + $

Worked example: Our 500-sensor smart building generates 2.16 GB/month. After 2 years (51.84 GB total), calculate steady-state storage costs for tiered vs all-hot storage.

All-hot storage (all 51.84 GB on SSD at $0.15/GB/month):

  • Monthly cost at 2-year mark: 51.84 GB x $0.15 = $7.78/month
  • Average monthly cost over 2 years (data grows linearly): ~$3.89/month
  • 2-year total: ~$93

Tiered storage (steady state at 2-year mark):

  • Hot (last 7 days): 0.50 GB x $0.15/GB = $0.08/month
  • Warm (7-90 days, 10:1 compressed): 5.98 GB raw / 10 = 0.60 GB x $0.03/GB = $0.02/month
  • Cold (90+ days, 10:1 compressed): 45.36 GB raw / 10 = 4.54 GB x $0.004/GB = $0.02/month
  • Monthly at steady state: $0.12/month
  • 2-year total (accounting for growth): ~$18

Savings: ($93 - $18) / $93 = ~81% cost reduction with tiered storage and compression.

Note: These are storage costs only. Production systems also incur compute, retrieval, and data transfer costs, which vary by cloud provider and access patterns.

Objective: Illustrate the conceptual difference between row-oriented and column-oriented storage layouts for IoT queries.

Note: This Python simulation demonstrates the data layout differences, not true storage-engine performance. In Python, both dict and list access are O(1), so timing differences are minimal. The real advantage of columnar storage (sequential disk I/O, CPU cache locality, vectorized compression) manifests at the storage engine level with millions of rows on disk – typically yielding 10-100x speedups over row-oriented scans for analytical queries.

import time
import random
from collections import defaultdict

# Simulate IoT sensor data (10 devices, 1000 readings each)
random.seed(42)
NUM_DEVICES = 10
NUM_READINGS = 1000

# Row-oriented storage (like PostgreSQL)
row_storage = []
for device_id in range(NUM_DEVICES):
    for i in range(NUM_READINGS):
        row_storage.append({
            "device_id": f"sensor_{device_id:03d}",
            "timestamp": 1700000000 + i * 30,  # Every 30 seconds
            "temperature": round(20 + random.gauss(0, 2) + device_id * 0.5, 2),
            "humidity": round(50 + random.gauss(0, 5), 1),
        })

# Column-oriented storage (like InfluxDB/TimescaleDB)
col_storage = {
    "device_id": [r["device_id"] for r in row_storage],
    "timestamp": [r["timestamp"] for r in row_storage],
    "temperature": [r["temperature"] for r in row_storage],
    "humidity": [r["humidity"] for r in row_storage],
}

print(f"Total records: {len(row_storage)}")
print(f"Row storage size estimate: ~{len(row_storage) * 100} bytes (dict overhead per row)")
print(f"Column storage size estimate: ~{len(row_storage) * 20} bytes (arrays, no per-row overhead)\n")

# Query 1: Average temperature for one device (last ~50 minutes)
target_device = "sensor_005"
time_start = 1700000000 + 900 * 30  # Last 100 readings (~50 min at 30s intervals)

# Row scan: must read every field of every row
start = time.time()
for _ in range(100):
    total, count = 0, 0
    for row in row_storage:
        if row["device_id"] == target_device and row["timestamp"] >= time_start:
            total += row["temperature"]
            count += 1
    avg_row = total / count if count else 0
row_time = (time.time() - start) / 100

# Column scan: accesses arrays (conceptually reads only needed columns)
start = time.time()
for _ in range(100):
    total, count = 0, 0
    for i in range(len(col_storage["device_id"])):
        if col_storage["device_id"][i] == target_device and \
           col_storage["timestamp"][i] >= time_start:
            total += col_storage["temperature"][i]
            count += 1
    avg_col = total / count if count else 0
col_time = (time.time() - start) / 100

print("Query: AVG(temperature) WHERE device='sensor_005' AND time > T")
print(f"  Row-oriented: {row_time*1000:.2f} ms (reads ALL columns per row)")
print(f"  Column-oriented: {col_time*1000:.2f} ms (reads only needed columns)")
print(f"  Results match: {abs(avg_row - avg_col) < 0.001}")

# Query 2: Downsampling (5-minute averages)
start = time.time()
for _ in range(50):
    buckets = defaultdict(list)
    for row in row_storage:
        if row["device_id"] == target_device:
            bucket = row["timestamp"] // 300 * 300
            buckets[bucket].append(row["temperature"])
    avgs = {k: sum(v)/len(v) for k, v in buckets.items()}
row_ds_time = (time.time() - start) / 50

print(f"\nQuery: 5-minute downsampled averages for sensor_005")
print(f"  Row-oriented: {row_ds_time*1000:.2f} ms")
print(f"  Buckets: {len(avgs)} (5-min windows)")
print(f"\nKey insight: Time-series DBs optimize these patterns with")
print(f"columnar storage, time-based indexing, and built-in downsampling.")

What to Observe:

  1. Row storage reads all columns even when the query only needs temperature
  2. Column storage reads only the needed columns, reducing I/O
  3. Time-range queries benefit from sorted timestamp indexes (not shown but critical at scale)
  4. Downsampling (aggregating to lower resolution) is a core time-series operation
  5. At scale (millions of rows), the difference is 10-100x, not just 2-3x

3.10 IoT Storage Cost Calculator

Use this interactive calculator to estimate storage requirements and compare all-hot versus tiered storage costs for your IoT deployment.

3.11 Common Pitfalls

Common Pitfalls in IoT Data Storage

1. Using a Single Database for Everything Many teams start with PostgreSQL for all data because the team already knows SQL. This works initially but fails at scale: 10,000 sensors at 1 reading/sec produce 864M rows/day, and after 3 months the B-tree indexes exceed RAM, degrading every query. Storing video as BLOBs in SQL wastes money ($0.15/GB vs $0.02/GB in object storage) and prevents CDN delivery. Flexible event logs require constant ALTER TABLE migrations. Plan for polyglot persistence from the start – use each database for what it does best.

2. Ignoring Storage Costs Until the Bill Arrives Storing all sensor data at full resolution in hot storage seems reasonable in development (100 sensors, 1 week of data = trivial). In production (10,000 sensors, 2 years = terabytes), the monthly bill can exceed $10,000. Design tiered storage and retention policies before launch.

3. Skipping Data Quality Validation “Garbage in, garbage out” is amplified in IoT. A temperature sensor reporting -999C or a GPS sensor stuck at (0, 0) will corrupt aggregations and trigger false alerts. Implement validation at ingestion: range checks, rate-of-change limits, and missing data detection.

4. Choosing AP When You Need CP (or Vice Versa) Using an eventually-consistent database (like Cassandra) for firmware version tracking can lead to devices running different firmware versions. Conversely, using a strongly-consistent database (like MongoDB with majority write concern) for non-critical sensor data adds unnecessary latency. Map each data type to its consistency requirement.

5. Not Planning for Schema Evolution IoT devices get firmware updates that add new sensor fields. If your schema is rigid (e.g., a fixed SQL table), every new field requires an ALTER TABLE migration and application redeployment. Consider semi-structured storage (JSONB columns in PostgreSQL, or MongoDB documents) for data types that evolve frequently.

6. Underestimating Time-Zone Complexity Sensors in different time zones reporting in local time (instead of UTC) create aggregation nightmares. Always store timestamps in UTC and convert to local time only at the presentation layer.

3.12 Knowledge Checks

Test your understanding of IoT data storage concepts:

3.13 Summary and Key Takeaways

This chapter series equips you to design production-grade IoT data storage architectures. Here are the essential principles:

Database Selection:

  • Use relational databases (PostgreSQL, MySQL) for device metadata, user accounts, and data requiring ACID transactions and complex joins
  • Use time-series databases (TimescaleDB, InfluxDB) for sensor telemetry – they provide 10-100x faster queries than generic SQL for time-range operations
  • Use NoSQL databases (MongoDB, Cassandra) for flexible schemas, high write throughput, and geographic distribution
  • Adopt polyglot persistence – most IoT systems need at least 2-3 database types working together

Cost Optimization:

  • Implement multi-tier storage (hot/warm/cold) for 80-95% cost reduction compared to all-hot storage
  • Apply retention policies to automatically downsample or delete old data
  • Use compression (10:1 or better is common for time-series data) to reduce storage footprint

Distributed Systems:

  • Apply CAP theorem to choose consistency vs availability for each data type individually
  • Use CP databases for safety-critical data (firmware versions, billing) and AP databases for telemetry
  • Design hybrid sharding (device + time) for balanced write/query performance at scale

Data Quality:

  • Validate data at ingestion with range checks, rate-of-change limits, and completeness checks
  • Store timestamps in UTC always – convert to local time only at the presentation layer
  • Monitor data quality continuously with automated dashboards and alerts

3.14 What’s Next

After completing the data storage series, explore these related topics based on your needs:

If you want to… Read this next
Clean and validate data before storing it Data Quality and Preprocessing
Build real-time data pipelines that feed into storage Stream Processing
Process and filter data at the edge before sending to databases Edge Compute Patterns
Learn about distributed systems trade-offs in depth CAP Theorem and Database Categories
Scale your storage to handle millions of devices Sharding Strategies

3.15 Concept Relationships

Data Storage and Databases underpins all IoT data-driven systems:

The critical insight is polyglot persistence—using specialized databases for different data types (relational for metadata, time-series for telemetry, object storage for media) rather than forcing everything into one technology.

3.16 See Also

Related Chapters:

External Resources: