Selecting the right database for IoT depends on your data shape and access patterns: use relational databases for structured metadata with ACID needs, time-series databases for timestamped sensor readings, document stores for flexible schemas, and key-value stores for fast lookups. Most IoT systems need polyglot persistence, combining multiple database types for different workloads.
Learning Objectives
After completing this chapter, you will be able to:
Apply a decision framework for selecting IoT database technologies
Evaluate trade-offs between SQL, NoSQL, and time-series databases for specific IoT workloads
Classify IoT data characteristics and map them to appropriate database types
Design hybrid polyglot persistence architectures for complex IoT systems
Calculate required write throughput and storage capacity for IoT deployments
MVU: Database Type Selection
Core Concept: Match database type to data shape–relational (SQL) for structured data with relationships and ACID needs, document stores for flexible schemas and horizontal scaling, time-series for timestamped sensor readings, and key-value for simple high-speed lookups. Why It Matters: Using a relational database for 100K sensor writes/second will fail; using a key-value store for complex joins will require application-level gymnastics – the right fit can eliminate hundreds of lines of application-level data transformation code. Key Takeaway: Most IoT systems use polyglot persistence: a relational database for device registry and user accounts, a time-series database for sensor telemetry, a document store for device configurations and event logs, and a cache for latest readings.
4.1 Prerequisites
Before diving into this chapter, you should be familiar with:
Edge, Fog, and Cloud Overview: Understanding the three-tier architecture helps determine where data should be stored
Networking Basics: Knowledge of network protocols and data transmission is essential for understanding how databases synchronize
Basic programming concepts: Familiarity with data structures, queries, and basic database operations (CRUD)
Common Pitfalls
1. Using a relational database for all IoT data
Inserting 10,000 sensor readings per second into a PostgreSQL table with full ACID transactions causes write amplification from index maintenance, WAL flushing, and lock contention. Time-series databases like TimescaleDB and InfluxDB are designed for this workload and achieve 10-100x higher ingest rates through batch compression and minimal indexing.
Key Concepts
Relational Database: A structured database using tables with fixed schemas and SQL queries, best for IoT device registries and configuration data where relationships and ACID transactions matter
Document Store: A NoSQL database storing flexible JSON/BSON documents, suited for IoT event logs and device shadow states where schema varies across device types
Time-Series Database: A database purpose-built for timestamped sequential data with automatic chunking, compression, and retention – the optimal choice for sensor telemetry at scale
Key-Value Store: An ultra-fast in-memory or disk-based store mapping keys to values, used for IoT device session state, real-time counters, and pub/sub message brokering
Wide-Column Store: A NoSQL database organizing data in column families, enabling efficient reads of specific attributes across millions of devices – used by Cassandra for high-write IoT workloads
Schema-on-Read: A NoSQL approach that defers schema enforcement to query time, allowing flexible ingestion of heterogeneous IoT device data before structure is fully known
ACID Transactions: Atomicity, Consistency, Isolation, Durability – the four properties guaranteeing reliable database operations, required for billing and configuration changes but often sacrificed in time-series databases for write speed
Query Pattern Analysis: The process of identifying the most frequent read and write access patterns before selecting a database, because different databases optimize for fundamentally different access shapes
2. Selecting a database before analyzing query patterns
Teams often choose ‘the database they know’ before understanding how data will be queried. A fleet management system that needs per-device 30-day history queries requires a completely different schema design than one that primarily needs cross-fleet hourly aggregations. Analyze top-5 query patterns first; let them drive the schema and database choice.
3. Over-engineering with multiple databases from day one
Introducing five database technologies simultaneously multiplies operational complexity. Start with one or two databases that cover 80% of use cases. Add specialized stores only when a clear performance bottleneck or capability gap emerges. Premature polyglot persistence causes maintenance burden that outweighs any theoretical performance benefit.
4.2 What is a Database? (For Beginners)
Analogy: Think of a database as a super-organized filing cabinet for your IoT data.
Without a database, your IoT data is scattered across flat files, CSV logs, and ad-hoc scripts. Searching means scanning entire files sequentially. Multiple applications cannot safely write simultaneously. A power failure mid-write can corrupt everything. There is no standardized way to query, filter, or aggregate.
With a database, your data is indexed for fast retrieval. Concurrent applications read and write safely through transaction isolation. Built-in integrity constraints prevent invalid data. Standard query languages (SQL, InfluxQL) let you ask complex questions in a single statement.
4.2.1 Why Do IoT Systems Need Databases?
Your smart home generates data constantly:
Device
Data Generated
Per Day
Smart thermostat
Temperature every 5 min
288 readings
Motion sensor
Events when triggered
~50 events
Smart meter
Energy usage every 15 min
96 readings
Security camera
Video clips on motion
~2 GB
1 year x 10 devices = over half a million data points (not counting video). With a realistic mix – say 4 thermostats (288/day), 3 motion sensors (50/day), and 3 smart meters (96/day) – that is roughly (1,152 + 150 + 288) x 365 = 580,000 readings per year. You need a database to store, organize, and query this data efficiently.
Interactive: Database Selection Tool
Interactive: Database Detective Game
4.3 Database Selection Decision Tree
Selecting the right database technology for IoT data is crucial for performance, scalability, and cost-effectiveness. This framework guides you through key decision points based on your data characteristics and access patterns.
4.3.1 How It Works: Database Selection Process
Database selection for IoT follows a systematic evaluation of data characteristics and access patterns:
Step 1: Identify Data Type
Time-stamped sensor readings (temperature every 5 min) → Time-series database
Outcome: Most IoT systems require polyglot persistence—TimescaleDB for telemetry + PostgreSQL for metadata + Redis for caching + S3 for media.
Figure 4.1: Database Selection Decision Tree: Choose database technology based on data type, write frequency, and consistency requirements to optimize performance and cost.
4.4 Decision Criteria Explained
The five steps above give you a quick path to a database recommendation. The sections below explore each criterion in depth, with guidelines on where the boundaries lie and why.
4.4.1 1. Primary Data Type
Your data structure fundamentally determines database choice:
High velocity (>1000 writes/sec): Requires write-optimized storage. InfluxDB excels with in-memory buffering and batch writes. Example: Factory with 1000 sensors at 1Hz = 1000 writes/sec.
Moderate (100-1000 writes/sec): TimescaleDB provides good balance of write performance and SQL queryability. Example: Building with 200 sensors at 5Hz = 1000 writes/sec.
Low velocity (<100 writes/sec): Any time-series DB works. Even standard PostgreSQL with partitioning is viable. Example: Home with 20 sensors at 1/min (~0.3 writes/sec).
4.4.3 3. Data Retention and Tiering
Storage duration determines architecture:
Hot storage (<30 days): Keep in high-performance database for real-time queries. Fast access required. Examples: Real-time dashboards, alerting.
Warm storage (30-365 days): Balance performance and cost. Use compression and continuous aggregates. Examples: Monthly reports, trend analysis.
Implementation: TimescaleDB supports automatic tiering with continuous aggregates and data retention policies. Raw data deleted after 30 days, but hourly aggregates retained for years.
4.4.4 4. Consistency Requirements
Transaction guarantees impact database choice:
ACID critical: Financial transactions, inventory updates, user accounts. Requires relational database (PostgreSQL, MySQL) for strong consistency and transactions.
Eventually consistent acceptable: Sensor readings, logs, analytics. NoSQL options viable (MongoDB, Cassandra) for higher scalability and availability.
4.4.5 5. Query Patterns
How you access data determines optimal database:
Time-range queries (last hour, last week): Time-series databases optimize for this with automatic indexing by time.
Complex aggregations (avg, percentiles, correlations): TimescaleDB continuous aggregates or InfluxDB queries (SQL in v3, Flux in v2).
Full-text search (log analysis): Elasticsearch or MongoDB with text indexes.
Graph traversals (find all connected devices): Neo4j or JanusGraph with native graph storage.
4.4.6 Worked Example: Polyglot Database Architecture for Smart Building
Scenario: A 50-floor commercial building with 12,000 sensors (temperature, humidity, occupancy, energy meters, access control, elevator status).
Data streams and their database assignments:
Data Type
Volume
Access Pattern
Database
Monthly Cost
Sensor telemetry (temp, humidity, energy)
10,000 sensors x 1/min = 14.4M writes/day
Time-range queries, downsampling
TimescaleDB (3-node cluster)
$450
Device registry (12,000 devices)
~100 updates/day
CRUD + relationships
PostgreSQL (single node)
$80
Access control events (with badge photos)
~50,000 events/day (~10 KB each incl. thumbnail)
Time-range + full-text search
Elasticsearch
$200
Real-time dashboard cache
Latest reading per sensor
Key-value lookups by device ID
Redis (single node, 2GB)
$30
Elevator maintenance logs
~200 documents/day
Flexible schema, nested JSON
MongoDB (single node)
$60
Building network topology
12,000 nodes, 25,000 edges
Graph traversal (find affected devices)
Neo4j (single node)
$120
Total storage architecture cost: $940/month for a building generating $2.1M in annual energy costs. The database layer is a critical enabler of occupancy-based HVAC optimization that can save 15-22% on energy costs ($315K-$462K/year). At $940/month for database infrastructure, the storage cost is a small fraction of the overall system value.
Storage sizing over 1 year:
Database
Hot (30 days)
Warm (1 year)
Cold (archive)
Total
TimescaleDB
43 GB (raw)
15 GB (hourly aggregates)
Parquet on S3
58 GB active
PostgreSQL
500 MB
N/A
N/A
500 MB
Elasticsearch
15 GB
50 GB (compressed)
Snapshots on S3
65 GB
Redis
50 MB (volatile)
N/A
N/A
50 MB
Key insight: No single database handles all these workloads well. PostgreSQL could technically store everything, but it would require 4x the hardware and still underperform on time-range queries by 10-50x compared to TimescaleDB.
4.5 Tradeoff: SQL vs NoSQL Databases for IoT
SQL (Relational) vs NoSQL Databases for IoT
Decision context: When choosing a database architecture for your IoT data storage
Factor
SQL (Relational)
NoSQL (Document/Key-Value)
Latency
Medium (joins overhead)
Low (direct lookups)
Cost
Moderate (licensing, vertical scaling)
Variable (horizontal scaling)
Complexity
Higher (schema design, migrations)
Lower (flexible schema)
Scalability
Vertical (limited)
Horizontal (distributed)
Data integrity
Strong (ACID, foreign keys)
Eventual (varies by DB)
Query flexibility
High (SQL joins, aggregations)
Limited (document queries)
Choose SQL when:
Device metadata with relationships (users own devices, devices have sensors)
ACID compliance required (billing, inventory, user accounts)
Complex reporting with joins across multiple tables
Default recommendation: SQL for device registry and user data; NoSQL for event logs and configurations unless write volume exceeds SQL capacity.
4.6 Tradeoff: Time-Series vs General-Purpose Database
Time-Series Database vs General-Purpose Database
Decision context: When storing IoT sensor readings and telemetry data
Factor
Time-Series DB (InfluxDB, TimescaleDB)
General DB (PostgreSQL, MongoDB)
Latency
Optimized for time-range queries
Slower for time-based aggregations
Cost
Lower storage (compression)
Higher storage (no time optimization)
Complexity
Purpose-built (simpler for time data)
Requires manual optimization
Scalability
Built-in retention policies
Manual data lifecycle management
Write throughput
Very high (100K+ writes/sec)
Moderate (10K writes/sec)
Query patterns
Time-range, downsampling native
Time queries require indexes
Choose Time-Series DB when:
Primary access pattern is time-range queries (last hour, last week)
High-frequency sensor data (>100 readings/minute per device)
Automatic data retention and downsampling needed
Storage cost optimization is critical (compression)
Choose General-Purpose DB when:
Time-series is only a small part of the data model
Complex joins between sensor data and metadata needed
Team expertise is in traditional databases
Write volume is low (<1,000 writes/sec total)
Default recommendation: Time-series DB for sensor telemetry when >1,000 devices or >10 readings/minute; general DB acceptable for small deployments with simple queries.
4.7 Tradeoff: Embedded vs Client-Server Database
Embedded vs Client-Server Database
Decision context: When choosing database deployment architecture for IoT edge devices or gateways
Factor
Embedded Database (SQLite, LevelDB)
Client-Server Database (PostgreSQL, MySQL)
Deployment complexity
Simple - single file, no daemon
Complex - requires server process, networking
Resource usage
Minimal (1-16 MB RAM with default caches)
Higher (500 MB+ RAM for server)
Concurrent access
Limited (file locking)
High (connection pooling, MVCC)
Network dependency
None - runs locally
Requires network connectivity
Scalability
Single device only
Multi-client, replication, clustering
Backup/Recovery
File copy
Point-in-time recovery, streaming replication
Query capability
Full SQL (SQLite) or limited (K-V)
Full SQL with advanced features
Choose Embedded Database when:
Running on constrained edge devices (gateways, Raspberry Pi, microcontrollers)
Local data buffering during network outages is required
Single application access (no concurrent clients)
Deployment simplicity is critical (no DBA required)
Device operates in disconnected or intermittent connectivity scenarios
Choose Client-Server Database when:
Multiple applications or services access the same data
High concurrency is required (>10 simultaneous connections)
Advanced features needed (stored procedures, triggers, complex transactions)
Data must be shared across network (centralized IoT data store)
Professional database administration and monitoring is available
Default recommendation: Embedded databases (SQLite, LevelDB, RocksDB) for edge devices and local buffering; client-server databases for fog/cloud tiers and shared data access. Many IoT architectures use embedded at the edge syncing to client-server in the cloud.
4.8 Real-World Application Examples
4.8.1 Example 1: Smart Factory Predictive Maintenance
Data: Vibration, temperature, current sensors from 500 machines
Write rate: 5000 samples/sec (500 machines x 10 Hz)
Retention: 90 days hot, 2 years warm, 7 years cold
Consistency: Strong consistency for alerts, eventual for analytics
Database Choice: PostgreSQL with TimescaleDB + Redis - TimescaleDB hypertables for vitals time-series - PostgreSQL for patient records (ACID for compliance) - Redis for latest readings cache (real-time dashboard) - HIPAA-compliant encryption at rest and in transit
4.8.4 Example 4: Agricultural IoT Sensor Network
Data: Soil moisture, temperature, rainfall from 5000 sensors
Write rate: ~8 writes/sec (5000 sensors reporting once every 10 min = 5000/600)
Database Choice: TimescaleDB with PostGIS - Low write rate allows simple deployment - PostGIS extension for geospatial queries - Continuous aggregates for daily/weekly summaries - Automated data retention (1 year) - Integration with weather APIs via PostgreSQL foreign data wrappers
4.9 Tradeoff: Local vs Cloud Storage
Local vs Cloud Storage
Decision context: When designing IoT data architecture, you must decide where to store data - locally on edge devices/gateways or in cloud databases.
Factor
Local Storage
Cloud Storage
Power consumption
Low (no transmission)
High (radio active for uploads)
Latency
Sub-millisecond
50-500ms (network dependent)
Complexity
Lower (no cloud setup)
Higher (auth, APIs, sync)
Cost
Fixed (hardware only)
Variable (scales with data volume)
Reliability
Works offline
Requires connectivity
Scalability
Limited by device
Virtually unlimited
Data durability
Risk of device loss
Redundant, backed up
Choose Local when:
Real-time response is critical (< 10ms decisions)
Connectivity is unreliable or expensive (rural, industrial)
Privacy regulations require data to stay on-premises
Most data is processed and discarded (only anomalies uploaded)
Choose Cloud when:
Long-term retention and analytics are required
Data must be accessed from multiple locations
Machine learning training requires large historical datasets
Compliance requires centralized audit logs
Default recommendation: Hybrid approach - store raw data locally with time-based retention (24-72 hours), aggregate/compress before cloud upload, and send anomalies immediately to cloud.
Operational complexity: Low - one system, one backup strategy
Data consistency: Strong ACID guarantees across all data
Team expertise: Deep PostgreSQL skills sufficient
Decision Factors:
Choose Polyglot when: Write volume exceeds 200K/sec, distinct query patterns require specialized indexes, budget allows for operational complexity, team has multi-database expertise
Choose Single Database when: Total write volume under 100K/sec, data model fits relational + time-series (TimescaleDB), operational simplicity is paramount, small team without dedicated DBA
Real-world example: 10K sensors at 1Hz = 10K writes/sec - easily handled by single TimescaleDB instance. 100K sensors at 10Hz = 1M writes/sec - requires polyglot with InfluxDB for metrics, PostgreSQL for metadata
4.11 Choosing the Right IoT Database: Product Comparison
Once you have decided on a database type (time-series, relational, document, key-value, or graph), you need to choose a specific product. The table below maps common IoT requirements to concrete database products:
If your project needs…
Choose…
Because…
High-speed time-series writes with native TSDB engine
Zero-config, single file, runs on Raspberry Pi (or even ESP32)
Sub-millisecond latest-value lookups
Redis
In-memory, pub/sub, perfect for dashboard “current reading”
Graph relationships (device topology)
Neo4j
Cypher query language, network path analysis
Managed cloud with auto-scaling
Amazon Timestream / Azure Data Explorer
Serverless, pay-per-query, integrated with cloud IoT services
Quick Decision Flowchart:
Is this time-stamped sensor telemetry (>90% of queries are time-range)? Yes –> InfluxDB (simple) or TimescaleDB (need SQL JOINs)
Is this device metadata (registry, users, configs) needing ACID? Yes –> PostgreSQL
Is this on an edge device with <1 GB storage? Yes –> SQLite
Need real-time cache for dashboards? Yes –> Redis (complement, not replace)
Default: TimescaleDB for most IoT startups (SQL + time-series in one)
Key Takeaway
There is no single “best database” for IoT – the right choice depends on your data type (time-series vs. structured vs. flexible), write volume, query patterns, and consistency needs. Most production IoT systems use polyglot persistence: a time-series database for sensor telemetry, a relational database for device metadata, and a cache for real-time dashboards.
For Kids: Meet the Sensor Squad!
Choosing a database is like choosing the right container for different things you want to store!
4.11.1 The Sensor Squad Adventure: The Great Storage Challenge
The Sensor Squad had a big problem. They were collecting SO much information from sensors all over the city that their one big filing cabinet was overflowing!
“We need better storage!” declared Max the Microcontroller. “But what kind?”
Sammy the Sensor had temperature readings arriving EVERY SECOND. “I need something super fast that’s organized by TIME, like a diary with a page for every second!”
Lila the LED was keeping track of which sensors belonged to which buildings. “I need something organized like a family tree, showing how everything connects!”
Bella the Battery was saving device settings that looked different for every type of sensor. “My data doesn’t fit in neat rows and columns – each sensor has different settings!”
The Squad visited the Database Store and found four perfect containers:
Time-Series Database (for Sammy): “Like a super-fast diary that automatically organizes everything by when it happened!” Perfect for temperature readings arriving every second.
Relational Database (for Lila): “Like a perfectly organized spreadsheet where everything connects!” Perfect for knowing which sensor belongs to which building.
Document Database (for Bella): “Like a flexible folder where each page can look completely different!” Perfect for storing different settings for different sensors.
Key-Value Cache (for quick lookups): “Like a speed-reader who remembers the LATEST reading instantly!” Perfect for showing the current temperature on a dashboard.
“Wait,” said Max, “we’re using FOUR different databases?”
“Of course!” said Sammy. “Just like you use a lunchbox for food, a backpack for books, and a water bottle for drinks. The RIGHT container for each thing!”
4.11.2 Key Words for Kids
Word
What It Means
Database
An organized place to store information – like a super-smart filing cabinet
Time-Series
Data organized by time – like a diary that writes a new entry every second
Relational
Data organized in connected tables – like a family tree showing how things relate
Cache
A super-fast memory for the most recent data – like remembering the last thing someone told you
4.12 Try It Yourself
Hands-On Challenge: Choose databases for a smart agriculture system
Scenario: Farm with 500 soil sensors (moisture, temperature, pH) reporting every 10 minutes
Task: Apply the selection framework to choose databases for these data types:
Sensor readings (500 sensors x 6 readings/hr x 24 hr = 72K readings/day):
Data type: Timestamped telemetry
Query pattern: Time-range, aggregations
Your choice: _________
Why: _________
Sensor metadata (500 sensors with location, calibration date, owner):
Write Throughput Formula: \(W = S \times F \times P\) where \(S\) = sensors, \(F\) = frequency (Hz), \(P\) = combined overhead multiplier (encoding overhead x indexing overhead, typically 1.2-2.0).