3 Data Storage and Databases
Learning Objectives
After completing this chapter and its companion chapters, you will be able to:
- Select the appropriate database type (SQL, NoSQL, time-series) for different IoT data patterns
- Apply CAP theorem trade-offs to design distributed IoT storage architectures
- Implement time-series databases with compression and retention policies for sensor telemetry
- Design multi-tier storage strategies (hot/warm/cold) for cost-effective data lifecycle management
- Evaluate sharding strategies for horizontally scaling IoT data storage
- Validate and assess data quality in production IoT systems
3.2 Minimum Viable Understanding
- Match the database to the data: Use relational databases for device metadata, time-series databases for sensor telemetry, and NoSQL for flexible schemas – most IoT systems need all three (“polyglot persistence”).
- Storage tiers save money: Keeping recent data in fast “hot” storage and moving older data to cheaper “cold” storage reduces costs by 80-95% without losing access to historical records.
- CAP theorem forces trade-offs: In distributed IoT systems, you cannot have perfect consistency, availability, and partition tolerance simultaneously – choose based on whether stale data or downtime is more dangerous for your use case.
Sensor Squad: Where Do All the Sensor Readings Go?
Sammy the Sensor collects a temperature reading every 5 minutes. That is 288 readings per day, over 100,000 per year – and that is just ONE sensor!
Lila the Lab Technician explains: “Think of databases like different types of notebooks. A lined notebook (relational database) is great when you need neat rows and columns – like keeping a list of all your sensors and where they are. A lab journal (time-series database) is perfect when you write down measurements with timestamps – like recording the temperature every few minutes. And a scrapbook (NoSQL database) is best when every page can look different – like storing photos, notes, and drawings all together.”
Max the Maker asks: “But what happens when you run out of pages?” Lila smiles: “That is where sharding comes in – you get more notebooks and split the work! Some notebooks handle sensors A through M, others handle N through Z. Together, they can store everything!”
Bella the Builder adds: “And old notebooks you do not look at much? You put them on a high shelf (cold storage) to save space on your desk (hot storage). You can still get them – it just takes a bit longer!”
3.3 Overview
IoT systems generate diverse data types requiring different storage strategies. This comprehensive chapter series covers database selection, distributed systems trade-offs, time-series optimization, data quality monitoring, sharding strategies, and complete worked examples.
This chapter has been split into focused topics for better learning:
3.3.1 Chapter Series
| Chapter | Topics Covered | Approx Words |
|---|---|---|
| Database Selection Framework | Choosing SQL vs NoSQL vs time-series, decision framework, real-world examples | ~4,500 |
| CAP Theorem and Database Categories | Distributed systems, consistency vs availability, database trade-offs | ~3,800 |
| Time-Series Databases | TimescaleDB, InfluxDB, hypertables, compression, retention policies | ~3,200 |
| Data Quality Monitoring | Quality dimensions, validation, monitoring dashboards, handling bad data | ~3,500 |
| Sharding Strategies | Horizontal scaling, time vs device vs hybrid sharding, implementation | ~2,800 |
| Worked Examples | Fleet management (10K vehicles), smart city data lake (50K sensors) | ~4,200 |
3.3.2 Quick Start Guide
New to databases? Start with Database Selection Framework
Building for scale? Read CAP Theorem for distributed systems
Working with sensors? Jump to Time-Series Databases
Production systems? See Data Quality Monitoring
Massive scale? Learn Sharding Strategies
Need examples? Check Worked Examples
For Beginners: What Is a Database and Why Does IoT Need Special Ones?
A database is an organized system for storing, retrieving, and managing data – think of it as a very smart filing cabinet that can instantly find any piece of information you ask for.
Why can’t IoT just use one regular database?
Regular databases (like MySQL or PostgreSQL) are designed for things like user accounts, product catalogs, and orders – data that changes occasionally and has a well-defined structure. IoT is different because:
- Volume: A single factory with 1,000 sensors reporting every second produces 86.4 million data points per day. Regular databases struggle with this write speed.
- Velocity: Data arrives continuously and must be stored immediately – you cannot tell a sensor to “wait” while the database catches up.
- Variety: Some data is structured (temperature = 23.5C), some is semi-structured (a JSON event with variable fields), and some is unstructured (camera images).
- Time-centricity: Almost all IoT queries are time-based (“show me the last hour of readings”), which regular databases handle inefficiently.
That is why IoT systems typically use multiple database types together – a strategy called “polyglot persistence.” Each database type handles what it does best.
3.3.3 How Much Data Does Your IoT System Generate?
Use this calculator to understand how quickly IoT data accumulates. Try different sensor counts and frequencies to see why specialized storage is essential.
3.4 IoT Data Storage Landscape
The IoT data storage landscape spans multiple database technologies, each optimized for different data shapes and access patterns. Understanding how these fit together is the first step toward designing a robust storage architecture.
3.5 Database Selection Decision Framework
Choosing the right database for each data type is one of the most impactful architectural decisions in an IoT project. The wrong choice can lead to 10x higher costs, unacceptable query latency, or system failures under load.
3.6 Multi-Tier Storage Architecture
IoT data has a natural lifecycle: recent data is queried frequently and must be fast, while older data is accessed rarely but must remain available. A tiered approach balances performance with cost.
3.7 CAP Theorem for IoT
When IoT databases are distributed across multiple nodes (essential for reliability and scale), the CAP theorem describes the fundamental trade-off you must make:
3.8 IoT Database Comparison at a Glance
| Feature | Relational (PostgreSQL) | Time-Series (InfluxDB) | Document (MongoDB) | Key-Value (Redis) |
|---|---|---|---|---|
| Best for | Device metadata, accounts | Sensor telemetry | Event logs, configs | Caching, latest values |
| Write speed | ~10K rows/sec | ~500K points/sec | ~50K docs/sec | ~100K ops/sec |
| Query strength | Complex joins, aggregations | Time-range, downsampling | Flexible filters | Key lookup |
| Schema | Fixed (ALTER TABLE) | Tag-based, flexible | Schema-free JSON | Key-value pairs |
| Scaling | Vertical (bigger server) | Horizontal (clustering) | Horizontal (sharding) | Horizontal (clustering) |
| IoT data share | 5-10% of data | 70-80% of data | 10-15% of data | Cache layer (overlaps) |
Decision Framework: Database Selection Table
| Data Characteristic | Recommended Database | Alternative | Reasoning |
|---|---|---|---|
| Timestamped sensor readings | TimescaleDB or InfluxDB | PostgreSQL (small scale) | Time-series optimization essential at scale |
| Device metadata (ID, location, owner) | PostgreSQL | TimescaleDB | Relational structure, ACID transactions |
| Event logs (variable schema) | MongoDB | Elasticsearch | Flexible schema, document-oriented |
| Real-time latest values | Redis | Memcached | In-memory, sub-millisecond reads |
| Video/images | Object storage (S3) | MinIO (self-hosted) | Cost-effective blob storage |
| User session data | Redis | DynamoDB | Fast key-value access |
| Compliance archives (7+ years) | S3 Glacier | Tape backup | Lowest cost per GB |
How to use this table:
- Identify your data types: List all data your IoT system generates
- Match to characteristics: Find the row that best describes each data type
- Choose primary database: Use “Recommended” column for production
- Consider alternatives: Use “Alternative” if you have specific constraints
- Implement polyglot: Most systems need 3-5 different databases working together
Example Mapping (fleet management): - GPS telemetry -> TimescaleDB (time-series) - Vehicle metadata -> PostgreSQL (relational) - Driver behavior events -> MongoDB (flexible) - Real-time map positions -> Redis (cache) - Dashcam footage -> S3 (object storage)
3.9 Worked Example: Smart Building Storage Architecture
Worked Example: Designing Storage for a 500-Sensor Smart Building
Scenario: You are architecting the data storage layer for a commercial smart building with 500 sensors (temperature, humidity, occupancy, energy, air quality) across 10 floors. The system must support real-time dashboards, monthly energy reports, and 2 years of historical data for compliance.
Step 1: Characterize the Data
| Data Type | Source | Write Rate | Schema | Query Pattern |
|---|---|---|---|---|
| Sensor telemetry | 500 sensors @ 1/min | 500 writes/min | Fixed (timestamp, value, unit) | Time-range, aggregation |
| Device metadata | 500 devices | Rare updates | Structured, relational | Lookup by ID, joins |
| Occupancy events | 50 motion sensors | ~2,000/day | Semi-structured JSON | Time-range, floor filter |
| Energy reports | Aggregated | Monthly batch | Structured | Full table scans |
| Maintenance logs | Technicians | ~10/day | Free-form text + metadata | Full-text search |
Step 2: Select Databases
- TimescaleDB for sensor telemetry: 500 writes/min is well within capacity, and SQL compatibility means the team already knows the query language. Time-based partitioning handles the 2-year retention requirement.
- PostgreSQL for device metadata and energy reports: ACID transactions ensure device registry consistency. Foreign keys link devices to floors and zones.
- MongoDB for maintenance logs and occupancy events: Flexible schema accommodates varying event structures. Full-text search index for maintenance log queries.
Step 3: Design Storage Tiers
| Tier | Data Age | Storage | Query Latency |
|---|---|---|---|
| Hot | 0-7 days | TimescaleDB on SSD | Sub-second |
| Warm | 7-90 days | TimescaleDB on HDD (compressed) | Seconds |
| Cold | 90 days - 2 years | AWS S3 Glacier | Minutes |
See the “Putting Numbers to It” section below for detailed cost calculations.
Step 4: Calculate Storage Requirements
- Sensor telemetry: 500 sensors x 1 reading/min x 60 min x 24 hr = 720,000 readings/day
- Each reading: ~100 bytes (timestamp + device_id + value + unit + quality_flag)
- Daily raw storage: 720,000 x 100 bytes = 72 MB/day = ~2.16 GB/month
- With 10:1 compression on warm/cold: 2-year total = ~5.2 GB compressed
- 2-year all-hot storage cost: 51.84 GB x $0.15/GB = $7.78/month at peak (~$93 total over 2 years)
- 2-year tiered storage cost: ~$18 total over 2 years (hot + warm + compressed cold)
Result: A polyglot architecture using three databases, tiered storage, and automated data lifecycle policies delivers approximately 80% cost savings while meeting all performance and compliance requirements.
Putting Numbers to It
Tiered storage dramatically reduces costs by moving older data from expensive hot storage (SSD) to cheaper cold storage (S3 Glacier).
$ = + + $
Worked example: Our 500-sensor smart building generates 2.16 GB/month. After 2 years (51.84 GB total), calculate steady-state storage costs for tiered vs all-hot storage.
All-hot storage (all 51.84 GB on SSD at $0.15/GB/month):
- Monthly cost at 2-year mark: 51.84 GB x $0.15 = $7.78/month
- Average monthly cost over 2 years (data grows linearly): ~$3.89/month
- 2-year total: ~$93
Tiered storage (steady state at 2-year mark):
- Hot (last 7 days): 0.50 GB x $0.15/GB = $0.08/month
- Warm (7-90 days, 10:1 compressed): 5.98 GB raw / 10 = 0.60 GB x $0.03/GB = $0.02/month
- Cold (90+ days, 10:1 compressed): 45.36 GB raw / 10 = 4.54 GB x $0.004/GB = $0.02/month
- Monthly at steady state: $0.12/month
- 2-year total (accounting for growth): ~$18
Savings: ($93 - $18) / $93 = ~81% cost reduction with tiered storage and compression.
Note: These are storage costs only. Production systems also incur compute, retrieval, and data transfer costs, which vary by cloud provider and access patterns.
Try It: Time-Series vs Relational Storage Performance
Objective: Illustrate the conceptual difference between row-oriented and column-oriented storage layouts for IoT queries.
Note: This Python simulation demonstrates the data layout differences, not true storage-engine performance. In Python, both dict and list access are O(1), so timing differences are minimal. The real advantage of columnar storage (sequential disk I/O, CPU cache locality, vectorized compression) manifests at the storage engine level with millions of rows on disk – typically yielding 10-100x speedups over row-oriented scans for analytical queries.
import time
import random
from collections import defaultdict
# Simulate IoT sensor data (10 devices, 1000 readings each)
random.seed(42)
NUM_DEVICES = 10
NUM_READINGS = 1000
# Row-oriented storage (like PostgreSQL)
row_storage = []
for device_id in range(NUM_DEVICES):
for i in range(NUM_READINGS):
row_storage.append({
"device_id": f"sensor_{device_id:03d}",
"timestamp": 1700000000 + i * 30, # Every 30 seconds
"temperature": round(20 + random.gauss(0, 2) + device_id * 0.5, 2),
"humidity": round(50 + random.gauss(0, 5), 1),
})
# Column-oriented storage (like InfluxDB/TimescaleDB)
col_storage = {
"device_id": [r["device_id"] for r in row_storage],
"timestamp": [r["timestamp"] for r in row_storage],
"temperature": [r["temperature"] for r in row_storage],
"humidity": [r["humidity"] for r in row_storage],
}
print(f"Total records: {len(row_storage)}")
print(f"Row storage size estimate: ~{len(row_storage) * 100} bytes (dict overhead per row)")
print(f"Column storage size estimate: ~{len(row_storage) * 20} bytes (arrays, no per-row overhead)\n")
# Query 1: Average temperature for one device (last ~50 minutes)
target_device = "sensor_005"
time_start = 1700000000 + 900 * 30 # Last 100 readings (~50 min at 30s intervals)
# Row scan: must read every field of every row
start = time.time()
for _ in range(100):
total, count = 0, 0
for row in row_storage:
if row["device_id"] == target_device and row["timestamp"] >= time_start:
total += row["temperature"]
count += 1
avg_row = total / count if count else 0
row_time = (time.time() - start) / 100
# Column scan: accesses arrays (conceptually reads only needed columns)
start = time.time()
for _ in range(100):
total, count = 0, 0
for i in range(len(col_storage["device_id"])):
if col_storage["device_id"][i] == target_device and \
col_storage["timestamp"][i] >= time_start:
total += col_storage["temperature"][i]
count += 1
avg_col = total / count if count else 0
col_time = (time.time() - start) / 100
print("Query: AVG(temperature) WHERE device='sensor_005' AND time > T")
print(f" Row-oriented: {row_time*1000:.2f} ms (reads ALL columns per row)")
print(f" Column-oriented: {col_time*1000:.2f} ms (reads only needed columns)")
print(f" Results match: {abs(avg_row - avg_col) < 0.001}")
# Query 2: Downsampling (5-minute averages)
start = time.time()
for _ in range(50):
buckets = defaultdict(list)
for row in row_storage:
if row["device_id"] == target_device:
bucket = row["timestamp"] // 300 * 300
buckets[bucket].append(row["temperature"])
avgs = {k: sum(v)/len(v) for k, v in buckets.items()}
row_ds_time = (time.time() - start) / 50
print(f"\nQuery: 5-minute downsampled averages for sensor_005")
print(f" Row-oriented: {row_ds_time*1000:.2f} ms")
print(f" Buckets: {len(avgs)} (5-min windows)")
print(f"\nKey insight: Time-series DBs optimize these patterns with")
print(f"columnar storage, time-based indexing, and built-in downsampling.")What to Observe:
- Row storage reads all columns even when the query only needs temperature
- Column storage reads only the needed columns, reducing I/O
- Time-range queries benefit from sorted timestamp indexes (not shown but critical at scale)
- Downsampling (aggregating to lower resolution) is a core time-series operation
- At scale (millions of rows), the difference is 10-100x, not just 2-3x
3.10 IoT Storage Cost Calculator
Use this interactive calculator to estimate storage requirements and compare all-hot versus tiered storage costs for your IoT deployment.
3.11 Common Pitfalls
Common Pitfalls in IoT Data Storage
1. Using a Single Database for Everything Many teams start with PostgreSQL for all data because the team already knows SQL. This works initially but fails at scale: 10,000 sensors at 1 reading/sec produce 864M rows/day, and after 3 months the B-tree indexes exceed RAM, degrading every query. Storing video as BLOBs in SQL wastes money ($0.15/GB vs $0.02/GB in object storage) and prevents CDN delivery. Flexible event logs require constant ALTER TABLE migrations. Plan for polyglot persistence from the start – use each database for what it does best.
2. Ignoring Storage Costs Until the Bill Arrives Storing all sensor data at full resolution in hot storage seems reasonable in development (100 sensors, 1 week of data = trivial). In production (10,000 sensors, 2 years = terabytes), the monthly bill can exceed $10,000. Design tiered storage and retention policies before launch.
3. Skipping Data Quality Validation “Garbage in, garbage out” is amplified in IoT. A temperature sensor reporting -999C or a GPS sensor stuck at (0, 0) will corrupt aggregations and trigger false alerts. Implement validation at ingestion: range checks, rate-of-change limits, and missing data detection.
4. Choosing AP When You Need CP (or Vice Versa) Using an eventually-consistent database (like Cassandra) for firmware version tracking can lead to devices running different firmware versions. Conversely, using a strongly-consistent database (like MongoDB with majority write concern) for non-critical sensor data adds unnecessary latency. Map each data type to its consistency requirement.
5. Not Planning for Schema Evolution IoT devices get firmware updates that add new sensor fields. If your schema is rigid (e.g., a fixed SQL table), every new field requires an ALTER TABLE migration and application redeployment. Consider semi-structured storage (JSONB columns in PostgreSQL, or MongoDB documents) for data types that evolve frequently.
6. Underestimating Time-Zone Complexity Sensors in different time zones reporting in local time (instead of UTC) create aggregation nightmares. Always store timestamps in UTC and convert to local time only at the presentation layer.
3.12 Knowledge Checks
Test your understanding of IoT data storage concepts:
3.13 Summary and Key Takeaways
This chapter series equips you to design production-grade IoT data storage architectures. Here are the essential principles:
Database Selection:
- Use relational databases (PostgreSQL, MySQL) for device metadata, user accounts, and data requiring ACID transactions and complex joins
- Use time-series databases (TimescaleDB, InfluxDB) for sensor telemetry – they provide 10-100x faster queries than generic SQL for time-range operations
- Use NoSQL databases (MongoDB, Cassandra) for flexible schemas, high write throughput, and geographic distribution
- Adopt polyglot persistence – most IoT systems need at least 2-3 database types working together
Cost Optimization:
- Implement multi-tier storage (hot/warm/cold) for 80-95% cost reduction compared to all-hot storage
- Apply retention policies to automatically downsample or delete old data
- Use compression (10:1 or better is common for time-series data) to reduce storage footprint
Distributed Systems:
- Apply CAP theorem to choose consistency vs availability for each data type individually
- Use CP databases for safety-critical data (firmware versions, billing) and AP databases for telemetry
- Design hybrid sharding (device + time) for balanced write/query performance at scale
Data Quality:
- Validate data at ingestion with range checks, rate-of-change limits, and completeness checks
- Store timestamps in UTC always – convert to local time only at the presentation layer
- Monitor data quality continuously with automated dashboards and alerts
3.14 What’s Next
After completing the data storage series, explore these related topics based on your needs:
| If you want to… | Read this next |
|---|---|
| Clean and validate data before storing it | Data Quality and Preprocessing |
| Build real-time data pipelines that feed into storage | Stream Processing |
| Process and filter data at the edge before sending to databases | Edge Compute Patterns |
| Learn about distributed systems trade-offs in depth | CAP Theorem and Database Categories |
| Scale your storage to handle millions of devices | Sharding Strategies |
3.15 Concept Relationships
Data Storage and Databases underpins all IoT data-driven systems:
- Foundation For: Analytics & ML requires stored training data and historical context
- Enables: Stream Processing writes real-time data to persistent storage
- Prerequisite: Edge/Fog/Cloud Architecture determines where databases are deployed
- Informs: Data Quality Monitoring validates data integrity at storage layer
- Connects To: Visualization queries stored data for real-time dashboards
The critical insight is polyglot persistence—using specialized databases for different data types (relational for metadata, time-series for telemetry, object storage for media) rather than forcing everything into one technology.
3.16 See Also
Related Chapters:
- Big Data Overview - Scaling storage beyond single-node databases
- CAP Theorem - Consistency vs availability trade-offs for distributed systems
- Sharding Strategies - Horizontal partitioning for massive data volumes
- Data Retention Policies - Automated lifecycle management
External Resources:
- Database Comparison: db-engines.com - Rankings and feature comparisons
- TimescaleDB Documentation: docs.timescale.com
- InfluxDB Best Practices: docs.influxdata.com/influxdb/
- MongoDB IoT Guide: mongodb.com/use-cases/internet-of-things