4  Database Selection Framework

In 60 Seconds

Selecting the right database for IoT depends on your data shape and access patterns: use relational databases for structured metadata with ACID needs, time-series databases for timestamped sensor readings, document stores for flexible schemas, and key-value stores for fast lookups. Most IoT systems need polyglot persistence, combining multiple database types for different workloads.

Learning Objectives

After completing this chapter, you will be able to:

  • Apply a decision framework for selecting IoT database technologies
  • Evaluate trade-offs between SQL, NoSQL, and time-series databases for specific IoT workloads
  • Classify IoT data characteristics and map them to appropriate database types
  • Design hybrid polyglot persistence architectures for complex IoT systems
  • Calculate required write throughput and storage capacity for IoT deployments
MVU: Database Type Selection

Core Concept: Match database type to data shape–relational (SQL) for structured data with relationships and ACID needs, document stores for flexible schemas and horizontal scaling, time-series for timestamped sensor readings, and key-value for simple high-speed lookups. Why It Matters: Using a relational database for 100K sensor writes/second will fail; using a key-value store for complex joins will require application-level gymnastics – the right fit can eliminate hundreds of lines of application-level data transformation code. Key Takeaway: Most IoT systems use polyglot persistence: a relational database for device registry and user accounts, a time-series database for sensor telemetry, a document store for device configurations and event logs, and a cache for latest readings.

4.1 Prerequisites

Before diving into this chapter, you should be familiar with:

  • Edge, Fog, and Cloud Overview: Understanding the three-tier architecture helps determine where data should be stored
  • Networking Basics: Knowledge of network protocols and data transmission is essential for understanding how databases synchronize
  • Basic programming concepts: Familiarity with data structures, queries, and basic database operations (CRUD)

Common Pitfalls

Inserting 10,000 sensor readings per second into a PostgreSQL table with full ACID transactions causes write amplification from index maintenance, WAL flushing, and lock contention. Time-series databases like TimescaleDB and InfluxDB are designed for this workload and achieve 10-100x higher ingest rates through batch compression and minimal indexing.

Key Concepts
  • Relational Database: A structured database using tables with fixed schemas and SQL queries, best for IoT device registries and configuration data where relationships and ACID transactions matter
  • Document Store: A NoSQL database storing flexible JSON/BSON documents, suited for IoT event logs and device shadow states where schema varies across device types
  • Time-Series Database: A database purpose-built for timestamped sequential data with automatic chunking, compression, and retention – the optimal choice for sensor telemetry at scale
  • Key-Value Store: An ultra-fast in-memory or disk-based store mapping keys to values, used for IoT device session state, real-time counters, and pub/sub message brokering
  • Wide-Column Store: A NoSQL database organizing data in column families, enabling efficient reads of specific attributes across millions of devices – used by Cassandra for high-write IoT workloads
  • Schema-on-Read: A NoSQL approach that defers schema enforcement to query time, allowing flexible ingestion of heterogeneous IoT device data before structure is fully known
  • ACID Transactions: Atomicity, Consistency, Isolation, Durability – the four properties guaranteeing reliable database operations, required for billing and configuration changes but often sacrificed in time-series databases for write speed
  • Query Pattern Analysis: The process of identifying the most frequent read and write access patterns before selecting a database, because different databases optimize for fundamentally different access shapes

Teams often choose ‘the database they know’ before understanding how data will be queried. A fleet management system that needs per-device 30-day history queries requires a completely different schema design than one that primarily needs cross-fleet hourly aggregations. Analyze top-5 query patterns first; let them drive the schema and database choice.

Introducing five database technologies simultaneously multiplies operational complexity. Start with one or two databases that cover 80% of use cases. Add specialized stores only when a clear performance bottleneck or capability gap emerges. Premature polyglot persistence causes maintenance burden that outweighs any theoretical performance benefit.

4.2 What is a Database? (For Beginners)

Analogy: Think of a database as a super-organized filing cabinet for your IoT data.

Without a database, your IoT data is scattered across flat files, CSV logs, and ad-hoc scripts. Searching means scanning entire files sequentially. Multiple applications cannot safely write simultaneously. A power failure mid-write can corrupt everything. There is no standardized way to query, filter, or aggregate.

With a database, your data is indexed for fast retrieval. Concurrent applications read and write safely through transaction isolation. Built-in integrity constraints prevent invalid data. Standard query languages (SQL, InfluxQL) let you ask complex questions in a single statement.

4.2.1 Why Do IoT Systems Need Databases?

Your smart home generates data constantly:

Device Data Generated Per Day
Smart thermostat Temperature every 5 min 288 readings
Motion sensor Events when triggered ~50 events
Smart meter Energy usage every 15 min 96 readings
Security camera Video clips on motion ~2 GB

1 year x 10 devices = over half a million data points (not counting video). With a realistic mix – say 4 thermostats (288/day), 3 motion sensors (50/day), and 3 smart meters (96/day) – that is roughly (1,152 + 150 + 288) x 365 = 580,000 readings per year. You need a database to store, organize, and query this data efficiently.

4.3 Database Selection Decision Tree

Selecting the right database technology for IoT data is crucial for performance, scalability, and cost-effectiveness. This framework guides you through key decision points based on your data characteristics and access patterns.

4.3.1 How It Works: Database Selection Process

Database selection for IoT follows a systematic evaluation of data characteristics and access patterns:

Step 1: Identify Data Type

  • Time-stamped sensor readings (temperature every 5 min) → Time-series database
  • Structured metadata with relationships (devices → buildings → owners) → Relational database
  • Flexible documents (device configs, event logs) → Document store
  • Simple key-value lookups (latest reading per sensor) → Key-value cache

Step 2: Evaluate Write Volume (thresholds indicate when write rate should influence your database type choice, not absolute capacity limits)

  • High velocity (>1000 writes/sec) → InfluxDB or Cassandra (optimized for writes)
  • Moderate (100-1000/sec) → TimescaleDB or MongoDB
  • Low (<100/sec) → Any database works; choose for query needs

Step 3: Assess Query Patterns

  • Time-range queries (“last 24 hours”) → Time-series database with automatic time indexing
  • Complex joins across tables → Relational database
  • Full-text search → Elasticsearch or MongoDB with text indexes
  • Graph traversals → Neo4j or graph database

Step 4: Consider Consistency Requirements

  • ACID critical (billing, control commands) → Relational database (PostgreSQL, MySQL)
  • Eventually consistent acceptable (sensor telemetry, logs) → NoSQL (Cassandra, DynamoDB)

Step 5: Plan for Scale

  • Vertical scaling only (limited budget) → PostgreSQL, MySQL
  • Horizontal scaling required (millions of devices) → Cassandra, MongoDB, TimescaleDB clustering

Outcome: Most IoT systems require polyglot persistence—TimescaleDB for telemetry + PostgreSQL for metadata + Redis for caching + S3 for media.

Flowchart decision tree showing database selection based on data type (time-series, structured, flexible, key-value), branching to recommended database technologies for each category
Figure 4.1: Database Selection Decision Tree: Choose database technology based on data type, write frequency, and consistency requirements to optimize performance and cost.

4.4 Decision Criteria Explained

The five steps above give you a quick path to a database recommendation. The sections below explore each criterion in depth, with guidelines on where the boundaries lie and why.

4.4.1 1. Primary Data Type

Your data structure fundamentally determines database choice:

  • Time-stamped sensor readings: Temperature, pressure, GPS coordinates with timestamps. Characteristics: Regular intervals, immutable, time-range queries. Choose time-series database.

  • Structured metadata: Device registry, user accounts, configuration. Characteristics: Fixed schema, relationships, ACID transactions. Choose relational database.

  • Semi-structured events: Device logs, error messages, telemetry. Characteristics: Flexible schema, nested data, varied formats. Choose document database.

  • Key-value state: Latest sensor reading, session data, cache. Characteristics: Simple lookups by ID, high performance. Choose key-value store.

  • Relationships/topology: Device network topology, dependency graphs. Characteristics: Complex relationships, traversal queries. Choose graph database.

4.4.2 2. Write Frequency and Scale

Ingestion rate impacts database selection:

  • High velocity (>1000 writes/sec): Requires write-optimized storage. InfluxDB excels with in-memory buffering and batch writes. Example: Factory with 1000 sensors at 1Hz = 1000 writes/sec.

  • Moderate (100-1000 writes/sec): TimescaleDB provides good balance of write performance and SQL queryability. Example: Building with 200 sensors at 5Hz = 1000 writes/sec.

  • Low velocity (<100 writes/sec): Any time-series DB works. Even standard PostgreSQL with partitioning is viable. Example: Home with 20 sensors at 1/min (~0.3 writes/sec).

4.4.3 3. Data Retention and Tiering

Storage duration determines architecture:

  • Hot storage (<30 days): Keep in high-performance database for real-time queries. Fast access required. Examples: Real-time dashboards, alerting.

  • Warm storage (30-365 days): Balance performance and cost. Use compression and continuous aggregates. Examples: Monthly reports, trend analysis.

  • Cold storage (>1 year): Archive to object storage (S3, MinIO). Infrequent access acceptable. Examples: Compliance records, historical analysis.

Implementation: TimescaleDB supports automatic tiering with continuous aggregates and data retention policies. Raw data deleted after 30 days, but hourly aggregates retained for years.

4.4.4 4. Consistency Requirements

Transaction guarantees impact database choice:

  • ACID critical: Financial transactions, inventory updates, user accounts. Requires relational database (PostgreSQL, MySQL) for strong consistency and transactions.

  • Eventually consistent acceptable: Sensor readings, logs, analytics. NoSQL options viable (MongoDB, Cassandra) for higher scalability and availability.

4.4.5 5. Query Patterns

How you access data determines optimal database:

  • Time-range queries (last hour, last week): Time-series databases optimize for this with automatic indexing by time.

  • Complex aggregations (avg, percentiles, correlations): TimescaleDB continuous aggregates or InfluxDB queries (SQL in v3, Flux in v2).

  • Full-text search (log analysis): Elasticsearch or MongoDB with text indexes.

  • Graph traversals (find all connected devices): Neo4j or JanusGraph with native graph storage.

4.4.6 Worked Example: Polyglot Database Architecture for Smart Building

Scenario: A 50-floor commercial building with 12,000 sensors (temperature, humidity, occupancy, energy meters, access control, elevator status).

Data streams and their database assignments:

Data Type Volume Access Pattern Database Monthly Cost
Sensor telemetry (temp, humidity, energy) 10,000 sensors x 1/min = 14.4M writes/day Time-range queries, downsampling TimescaleDB (3-node cluster) $450
Device registry (12,000 devices) ~100 updates/day CRUD + relationships PostgreSQL (single node) $80
Access control events (with badge photos) ~50,000 events/day (~10 KB each incl. thumbnail) Time-range + full-text search Elasticsearch $200
Real-time dashboard cache Latest reading per sensor Key-value lookups by device ID Redis (single node, 2GB) $30
Elevator maintenance logs ~200 documents/day Flexible schema, nested JSON MongoDB (single node) $60
Building network topology 12,000 nodes, 25,000 edges Graph traversal (find affected devices) Neo4j (single node) $120

Total storage architecture cost: $940/month for a building generating $2.1M in annual energy costs. The database layer is a critical enabler of occupancy-based HVAC optimization that can save 15-22% on energy costs ($315K-$462K/year). At $940/month for database infrastructure, the storage cost is a small fraction of the overall system value.

Storage sizing over 1 year:

Database Hot (30 days) Warm (1 year) Cold (archive) Total
TimescaleDB 43 GB (raw) 15 GB (hourly aggregates) Parquet on S3 58 GB active
PostgreSQL 500 MB N/A N/A 500 MB
Elasticsearch 15 GB 50 GB (compressed) Snapshots on S3 65 GB
Redis 50 MB (volatile) N/A N/A 50 MB

Key insight: No single database handles all these workloads well. PostgreSQL could technically store everything, but it would require 4x the hardware and still underperform on time-range queries by 10-50x compared to TimescaleDB.

4.5 Tradeoff: SQL vs NoSQL Databases for IoT

SQL (Relational) vs NoSQL Databases for IoT

Decision context: When choosing a database architecture for your IoT data storage

Factor SQL (Relational) NoSQL (Document/Key-Value)
Latency Medium (joins overhead) Low (direct lookups)
Cost Moderate (licensing, vertical scaling) Variable (horizontal scaling)
Complexity Higher (schema design, migrations) Lower (flexible schema)
Scalability Vertical (limited) Horizontal (distributed)
Data integrity Strong (ACID, foreign keys) Eventual (varies by DB)
Query flexibility High (SQL joins, aggregations) Limited (document queries)

Choose SQL when:

  • Device metadata with relationships (users own devices, devices have sensors)
  • ACID compliance required (billing, inventory, user accounts)
  • Complex reporting with joins across multiple tables
  • Data schema is well-defined and stable

Choose NoSQL when:

  • High write throughput needed (>10,000 writes/sec)
  • Schema evolves frequently (device configurations, event logs)
  • Horizontal scaling is critical for growth
  • Simple key-value or document lookups dominate

Default recommendation: SQL for device registry and user data; NoSQL for event logs and configurations unless write volume exceeds SQL capacity.

4.6 Tradeoff: Time-Series vs General-Purpose Database

Time-Series Database vs General-Purpose Database

Decision context: When storing IoT sensor readings and telemetry data

Factor Time-Series DB (InfluxDB, TimescaleDB) General DB (PostgreSQL, MongoDB)
Latency Optimized for time-range queries Slower for time-based aggregations
Cost Lower storage (compression) Higher storage (no time optimization)
Complexity Purpose-built (simpler for time data) Requires manual optimization
Scalability Built-in retention policies Manual data lifecycle management
Write throughput Very high (100K+ writes/sec) Moderate (10K writes/sec)
Query patterns Time-range, downsampling native Time queries require indexes

Choose Time-Series DB when:

  • Primary access pattern is time-range queries (last hour, last week)
  • High-frequency sensor data (>100 readings/minute per device)
  • Automatic data retention and downsampling needed
  • Storage cost optimization is critical (compression)

Choose General-Purpose DB when:

  • Time-series is only a small part of the data model
  • Complex joins between sensor data and metadata needed
  • Team expertise is in traditional databases
  • Write volume is low (<1,000 writes/sec total)

Default recommendation: Time-series DB for sensor telemetry when >1,000 devices or >10 readings/minute; general DB acceptable for small deployments with simple queries.

4.7 Tradeoff: Embedded vs Client-Server Database

Embedded vs Client-Server Database

Decision context: When choosing database deployment architecture for IoT edge devices or gateways

Factor Embedded Database (SQLite, LevelDB) Client-Server Database (PostgreSQL, MySQL)
Deployment complexity Simple - single file, no daemon Complex - requires server process, networking
Resource usage Minimal (1-16 MB RAM with default caches) Higher (500 MB+ RAM for server)
Concurrent access Limited (file locking) High (connection pooling, MVCC)
Network dependency None - runs locally Requires network connectivity
Scalability Single device only Multi-client, replication, clustering
Backup/Recovery File copy Point-in-time recovery, streaming replication
Query capability Full SQL (SQLite) or limited (K-V) Full SQL with advanced features

Choose Embedded Database when:

  • Running on constrained edge devices (gateways, Raspberry Pi, microcontrollers)
  • Local data buffering during network outages is required
  • Single application access (no concurrent clients)
  • Deployment simplicity is critical (no DBA required)
  • Device operates in disconnected or intermittent connectivity scenarios

Choose Client-Server Database when:

  • Multiple applications or services access the same data
  • High concurrency is required (>10 simultaneous connections)
  • Advanced features needed (stored procedures, triggers, complex transactions)
  • Data must be shared across network (centralized IoT data store)
  • Professional database administration and monitoring is available

Default recommendation: Embedded databases (SQLite, LevelDB, RocksDB) for edge devices and local buffering; client-server databases for fog/cloud tiers and shared data access. Many IoT architectures use embedded at the edge syncing to client-server in the cloud.

4.8 Real-World Application Examples

4.8.1 Example 1: Smart Factory Predictive Maintenance

  • Data: Vibration, temperature, current sensors from 500 machines
  • Write rate: 5000 samples/sec (500 machines x 10 Hz)
  • Retention: 90 days hot, 2 years warm, 7 years cold
  • Queries: Real-time anomaly detection, trend analysis
  • Consistency: Eventually consistent acceptable

Database Choice: TimescaleDB + S3 - Time-series hypertables for efficient writes - Continuous aggregates for hourly/daily summaries - Automated retention policies (90 days raw data) - Compressed older data to S3 for long-term storage - PostgreSQL compatibility enables complex joins with device metadata

4.8.2 Example 2: Smart City Traffic Monitoring

  • Data: Vehicle counts, speed, camera images from 10,000 intersections
  • Write rate: 20,000 updates/sec (10K intersections x 2 Hz)
  • Retention: 24 hours images, 30 days metrics, 10 years aggregates
  • Queries: Real-time traffic flow, pattern analysis, image search
  • Consistency: Eventually consistent

Database Architecture: Multi-database - InfluxDB: Real-time traffic metrics (high write rate) - MongoDB: Intersection metadata, configurations - S3 + Elasticsearch: Image storage with metadata indexing - Redis: Latest traffic status cache for API

4.8.3 Example 3: Healthcare Patient Monitoring

  • Data: Heart rate, blood pressure, activity from 1000 patients
  • Write rate: 100 samples/sec (1000 patients x 0.1 Hz)
  • Retention: 30 days detailed, 7 years summaries
  • Queries: Real-time alerts, patient history, trend analysis
  • Consistency: Strong consistency for alerts, eventual for analytics

Database Choice: PostgreSQL with TimescaleDB + Redis - TimescaleDB hypertables for vitals time-series - PostgreSQL for patient records (ACID for compliance) - Redis for latest readings cache (real-time dashboard) - HIPAA-compliant encryption at rest and in transit

4.8.4 Example 4: Agricultural IoT Sensor Network

  • Data: Soil moisture, temperature, rainfall from 5000 sensors
  • Write rate: ~8 writes/sec (5000 sensors reporting once every 10 min = 5000/600)
  • Retention: 1 year detailed, 5 years aggregates
  • Queries: Time-range queries, spatial analysis, alerts
  • Consistency: Eventually consistent

Database Choice: TimescaleDB with PostGIS - Low write rate allows simple deployment - PostGIS extension for geospatial queries - Continuous aggregates for daily/weekly summaries - Automated data retention (1 year) - Integration with weather APIs via PostgreSQL foreign data wrappers

4.9 Tradeoff: Local vs Cloud Storage

Local vs Cloud Storage

Decision context: When designing IoT data architecture, you must decide where to store data - locally on edge devices/gateways or in cloud databases.

Factor Local Storage Cloud Storage
Power consumption Low (no transmission) High (radio active for uploads)
Latency Sub-millisecond 50-500ms (network dependent)
Complexity Lower (no cloud setup) Higher (auth, APIs, sync)
Cost Fixed (hardware only) Variable (scales with data volume)
Reliability Works offline Requires connectivity
Scalability Limited by device Virtually unlimited
Data durability Risk of device loss Redundant, backed up

Choose Local when:

  • Real-time response is critical (< 10ms decisions)
  • Connectivity is unreliable or expensive (rural, industrial)
  • Privacy regulations require data to stay on-premises
  • Most data is processed and discarded (only anomalies uploaded)

Choose Cloud when:

  • Long-term retention and analytics are required
  • Data must be accessed from multiple locations
  • Machine learning training requires large historical datasets
  • Compliance requires centralized audit logs

Default recommendation: Hybrid approach - store raw data locally with time-based retention (24-72 hours), aggregate/compress before cloud upload, and send anomalies immediately to cloud.

4.10 Polyglot Persistence Architecture

Tradeoff: Polyglot Persistence vs Single Database

Option A: Polyglot Persistence (Multiple Specialized Databases)

  • Write throughput: Optimized per workload (500K+ sensor writes, 10K metadata updates)
  • Query performance: Each database optimized for its query patterns
  • Infrastructure cost: $500-2,000/month (multiple managed services)
  • Operational complexity: High - multiple systems to monitor, backup, secure
  • Data consistency: Eventual (requires application-level coordination)
  • Team expertise: Requires knowledge of 3-4 database technologies

Option B: Single Database (PostgreSQL/TimescaleDB)

  • Write throughput: 100K-200K writes/sec (adequate for many IoT deployments)
  • Query performance: Good for time-series, excellent for joins
  • Infrastructure cost: $300-800/month (single managed instance)
  • Operational complexity: Low - one system, one backup strategy
  • Data consistency: Strong ACID guarantees across all data
  • Team expertise: Deep PostgreSQL skills sufficient

Decision Factors:

  • Choose Polyglot when: Write volume exceeds 200K/sec, distinct query patterns require specialized indexes, budget allows for operational complexity, team has multi-database expertise
  • Choose Single Database when: Total write volume under 100K/sec, data model fits relational + time-series (TimescaleDB), operational simplicity is paramount, small team without dedicated DBA
  • Real-world example: 10K sensors at 1Hz = 10K writes/sec - easily handled by single TimescaleDB instance. 100K sensors at 10Hz = 1M writes/sec - requires polyglot with InfluxDB for metrics, PostgreSQL for metadata

4.11 Choosing the Right IoT Database: Product Comparison

Once you have decided on a database type (time-series, relational, document, key-value, or graph), you need to choose a specific product. The table below maps common IoT requirements to concrete database products:

If your project needs… Choose… Because…
High-speed time-series writes with native TSDB engine InfluxDB Purpose-built TSDB, 300K+ writes/sec, built-in retention policies (v3 uses SQL; v2 uses Flux/InfluxQL)
SQL queries on time-series data TimescaleDB PostgreSQL extension, full SQL, JOINs with metadata tables
Flexible schemas, nested device configs MongoDB Document model, horizontal sharding, change streams
Lightweight edge storage on MCU/SBC SQLite Zero-config, single file, runs on Raspberry Pi (or even ESP32)
Sub-millisecond latest-value lookups Redis In-memory, pub/sub, perfect for dashboard “current reading”
Graph relationships (device topology) Neo4j Cypher query language, network path analysis
Managed cloud with auto-scaling Amazon Timestream / Azure Data Explorer Serverless, pay-per-query, integrated with cloud IoT services

Quick Decision Flowchart:

  1. Is this time-stamped sensor telemetry (>90% of queries are time-range)? Yes –> InfluxDB (simple) or TimescaleDB (need SQL JOINs)
  2. Is this device metadata (registry, users, configs) needing ACID? Yes –> PostgreSQL
  3. Is this on an edge device with <1 GB storage? Yes –> SQLite
  4. Need real-time cache for dashboards? Yes –> Redis (complement, not replace)
  5. Default: TimescaleDB for most IoT startups (SQL + time-series in one)
Key Takeaway

There is no single “best database” for IoT – the right choice depends on your data type (time-series vs. structured vs. flexible), write volume, query patterns, and consistency needs. Most production IoT systems use polyglot persistence: a time-series database for sensor telemetry, a relational database for device metadata, and a cache for real-time dashboards.

Choosing a database is like choosing the right container for different things you want to store!

4.11.1 The Sensor Squad Adventure: The Great Storage Challenge

The Sensor Squad had a big problem. They were collecting SO much information from sensors all over the city that their one big filing cabinet was overflowing!

“We need better storage!” declared Max the Microcontroller. “But what kind?”

Sammy the Sensor had temperature readings arriving EVERY SECOND. “I need something super fast that’s organized by TIME, like a diary with a page for every second!”

Lila the LED was keeping track of which sensors belonged to which buildings. “I need something organized like a family tree, showing how everything connects!”

Bella the Battery was saving device settings that looked different for every type of sensor. “My data doesn’t fit in neat rows and columns – each sensor has different settings!”

The Squad visited the Database Store and found four perfect containers:

  • Time-Series Database (for Sammy): “Like a super-fast diary that automatically organizes everything by when it happened!” Perfect for temperature readings arriving every second.

  • Relational Database (for Lila): “Like a perfectly organized spreadsheet where everything connects!” Perfect for knowing which sensor belongs to which building.

  • Document Database (for Bella): “Like a flexible folder where each page can look completely different!” Perfect for storing different settings for different sensors.

  • Key-Value Cache (for quick lookups): “Like a speed-reader who remembers the LATEST reading instantly!” Perfect for showing the current temperature on a dashboard.

“Wait,” said Max, “we’re using FOUR different databases?”

“Of course!” said Sammy. “Just like you use a lunchbox for food, a backpack for books, and a water bottle for drinks. The RIGHT container for each thing!”

4.11.2 Key Words for Kids

Word What It Means
Database An organized place to store information – like a super-smart filing cabinet
Time-Series Data organized by time – like a diary that writes a new entry every second
Relational Data organized in connected tables – like a family tree showing how things relate
Cache A super-fast memory for the most recent data – like remembering the last thing someone told you

4.12 Try It Yourself

Hands-On Challenge: Choose databases for a smart agriculture system

Scenario: Farm with 500 soil sensors (moisture, temperature, pH) reporting every 10 minutes

Task: Apply the selection framework to choose databases for these data types:

  1. Sensor readings (500 sensors x 6 readings/hr x 24 hr = 72K readings/day):
    • Data type: Timestamped telemetry
    • Query pattern: Time-range, aggregations
    • Your choice: _________
    • Why: _________
  2. Sensor metadata (500 sensors with location, calibration date, owner):
    • Data type: Structured with relationships
    • Query pattern: Lookups, joins with readings
    • Your choice: _________
    • Why: _________
  3. Irrigation events (variable schema, 50-200 events/day):
    • Data type: Flexible documents (different irrigation systems have different fields)
    • Query pattern: Time-range, text search
    • Your choice: _________
    • Why: _________
  4. Latest reading cache (500 current values for dashboard):
    • Data type: Key-value (sensor_id → latest_value)
    • Query pattern: Sub-millisecond lookups
    • Your choice: _________
    • Why: _________

Answers:

  1. TimescaleDB (time-series optimized, SQL queries)
  2. PostgreSQL (relational, ACID, can join with TimescaleDB)
  3. MongoDB (flexible schema, text indexes)
  4. Redis (in-memory, <1ms access)

What to Observe:

  • Each data type maps to a different optimal database
  • Forcing everything into PostgreSQL would work but underperform
  • Polyglot persistence is standard for production IoT

4.13 Summary

  • Database selection framework guides choosing technologies based on data type, write frequency, retention requirements, and query patterns
  • SQL databases excel for structured metadata with relationships and ACID requirements
  • NoSQL databases shine for flexible schemas, high write throughput, and horizontal scaling
  • Time-series databases are optimal for sensor telemetry with time-range queries
  • Hybrid architectures combine specialized databases for optimal performance across different workloads
  • Local vs cloud storage trade-offs depend on latency, connectivity, and retention requirements

Database write throughput determines feasibility for high-velocity IoT sensor data. Calculate required capacity:

Write Throughput Formula: \(W = S \times F \times P\) where \(S\) = sensors, \(F\) = frequency (Hz), \(P\) = combined overhead multiplier (encoding overhead x indexing overhead, typically 1.2-2.0).

Worked example (industrial monitoring system): - 500 machines × 10 sensors each = 5,000 sensors - Sampling at 1 Hz = 5,000 writes/sec raw - Combined overhead: JSON encoding (1.5x) × indexing (1.2x) → \(P = 1.5 \times 1.2 = 1.8\) - Required throughput: \(5000 \times 1 \times 1.8 = 9,000\) writes/sec

Database capacity: PostgreSQL (~10K writes/sec on tuned hardware), TimescaleDB (~50K writes/sec, comfortable headroom), InfluxDB (~300K writes/sec, over-provisioned). At 9,000 writes/sec, PostgreSQL technically works but leaves minimal headroom for traffic spikes. TimescaleDB is the safer choice, offering 5x headroom plus native time-series optimizations (compression, retention policies, continuous aggregates).

4.13.1 Interactive: Write Throughput Calculator

Use this calculator to estimate the required database write throughput for your IoT deployment and see which databases can handle the load.

4.13.2 Interactive: Storage Sizing Estimator

Estimate how much storage your IoT deployment requires across hot, warm, and cold tiers.

4.14 See Also

Related Chapters:

External Resources:

4.15 What’s Next

Now that you can select the right database type for different IoT workloads, explore these related topics to deepen your data storage knowledge:

Topic Chapter What You’ll Learn
Distributed trade-offs CAP Theorem and Database Categories Evaluate consistency, availability, and partition tolerance trade-offs when choosing between CP and AP database configurations
Time-series deep dive Time-Series Databases Configure InfluxDB and TimescaleDB for sensor telemetry, including retention policies, continuous aggregates, and compression
Horizontal scaling Sharding Strategies Design sharding schemes (time-based, device-based, geographic) when a single database node cannot handle your write volume
Data quality Data Quality Monitoring Implement validation pipelines to detect missing readings, outliers, and schema drift across polyglot database architectures
Complete architectures Worked Examples Examine end-to-end database architecture case studies for smart city, industrial, and healthcare IoT deployments