Real-world IoT database design requires matching storage architecture to specific workloads. A fleet management system with geospatial queries benefits from TimescaleDB with PostGIS, while a smart city data lake needs tiered storage (hot/warm/cold) to reduce costs by over 99.9%. The key is quantifying your data volume, query patterns, and retention needs before selecting technologies.
Learning Objectives
After completing this chapter, you will be able to:
Apply database selection frameworks to real-world IoT scenarios
Design complete storage architectures with appropriate tiering
Calculate storage requirements and costs for large-scale deployments
Implement schema designs optimized for specific query patterns
For Beginners: Data Storage Design Examples
These worked examples walk you through the process of designing data storage for real IoT scenarios, step by step. Think of them as cooking demonstrations where an experienced chef explains every decision while preparing a complex dish. By following along, you learn not just what to do, but why, building the judgment you need for your own projects.
8.1 Worked Example: Fleet Management System
8.2 Choosing a Database for 10,000-Vehicle Fleet Tracking
Scenario: A logistics company is building a fleet management system for 10,000 delivery vehicles. Each vehicle reports GPS location, speed, fuel level, engine diagnostics, and driver behavior every 10 seconds. The system must support real-time vehicle tracking on a map, historical route analysis, geofencing alerts, and monthly compliance reports. The data must be retained for 3 years for regulatory compliance.
Goal: Select and design the optimal database architecture, comparing InfluxDB, TimescaleDB, and MongoDB for this use case, considering query patterns, data volumes, and operational costs.
What we do: Calculate data ingestion rate, storage requirements, and classify query patterns.
Why: Database selection depends heavily on data characteristics and access patterns.
Data ingestion analysis:
Metric
Value
Calculation
Vehicles
10,000
Fixed fleet size
Report interval
10 seconds
GPS + telemetry
Messages per second
1,000
10,000 / 10
Payload size (avg)
250 bytes
GPS + 8 sensor values + metadata
Bytes per second
250 KB/s
1,000 x 250 bytes
Daily volume (raw)
21.6 GB
250 KB/s x 86,400
Monthly volume
648 GB
21.6 GB x 30
3-year retention
23.3 TB
648 GB x 36
Query pattern classification:
Query Type
Frequency
Latency Requirement
Data Range
Real-time map update
100/sec
< 100ms
Last reading per vehicle
Vehicle history (24h)
50/day
< 2s
8,640 points per vehicle
Geofence check
1,000/sec
< 50ms
Current position vs. polygons
Monthly fuel report
10/month
< 30s
30 days, all vehicles
Compliance audit
5/year
< 5min
1 year, full resolution
Key insight: This workload combines time-series telemetry (95% of writes), real-time point queries (current state), and analytical aggregations (reports). A hybrid approach may be optimal.
Try It: IoT Fleet Data Volume Calculator
Adjust the parameters below to see how fleet size, reporting interval, and payload size affect data volumes and storage requirements.
What we do: Compare InfluxDB, TimescaleDB, and MongoDB against our requirements.
Why: Each database has strengths for different aspects of fleet management.
InfluxDB (Purpose-built time-series):
Aspect
Rating
Notes
Write throughput
Excellent
Built for high-frequency metrics
Time-range queries
Excellent
Native time-series optimizations
Point queries (latest)
Good
Requires careful schema design
Geospatial queries
Poor
No native geo support
Compression
Excellent
10-20x for numeric data
SQL compatibility
Limited
Flux (v2) or SQL (v3); ecosystem less mature than PostgreSQL
TimescaleDB (PostgreSQL extension):
Aspect
Rating
Notes
Write throughput
Very Good
100K+ rows/sec with hypertables
Time-range queries
Excellent
Automatic time partitioning
Point queries (latest)
Good
Continuous aggregates help
Geospatial queries
Excellent
PostGIS integration
Compression
Very Good
90-95% for time-series
SQL compatibility
Full
Standard PostgreSQL SQL
MongoDB (Document database):
Aspect
Rating
Notes
Write throughput
Very Good
Sharding scales horizontally
Time-range queries
Good
Requires proper indexing
Point queries (latest)
Excellent
Flexible document model
Geospatial queries
Very Good
Native geospatial indexes
Compression
Good
WiredTiger compression
SQL compatibility
None
MongoDB Query Language
What we do: Design schemas for the recommended database (TimescaleDB) with alternatives.
Why: Schema design significantly impacts query performance and storage efficiency.
Recommendation: TimescaleDB - Best balance of time-series performance, geospatial capabilities, and SQL familiarity.
TimescaleDB schema design:
-- Hypertable: time-partitioned telemetryCREATETABLE vehicle_telemetry (time TIMESTAMPTZ NOTNULL, vehicle_id VARCHAR(20) NOTNULL, location GEOGRAPHY(POINT, 4326), -- PostGIS speed_kmh SMALLINT, heading SMALLINT, fuel_pct SMALLINT, engine_rpm SMALLINT, engine_temp SMALLINT, odometer_km INTEGER, driver_id VARCHAR(20));SELECT create_hypertable('vehicle_telemetry', 'time', chunk_time_interval =>INTERVAL'1 day');-- Compress after 7 days, segment by vehicle for efficient queriesALTERTABLE vehicle_telemetry SET ( timescaledb.compress, timescaledb.compress_segmentby ='vehicle_id');SELECT add_compression_policy('vehicle_telemetry', INTERVAL'7 days');CREATEINDEXON vehicle_telemetry (vehicle_id, timeDESC);CREATEINDEXON vehicle_telemetry USING GIST (location);-- Near-real-time position view (refreshed every 10s)CREATEMATERIALIZEDVIEW vehicle_latestWITH (timescaledb.continuous) ASSELECT time_bucket('10 seconds', time) AS bucket, vehicle_id,last(time, time) AS last_seen,last(location, time) AS location,last(speed_kmh, time) AS speed_kmhFROM vehicle_telemetry GROUPBY bucket, vehicle_id;
Storage tier design (3-year retention with cost optimization):
Tier
Age
Resolution
Storage
Monthly Cost
Hot
0-7 days
10-second
SSD
$50/TB
Warm
7-90 days
10-second (compressed)
HDD
$10/TB
Cold
90 days - 3 years
1-minute aggregates
Object storage
$2/TB
Storage cost calculation:
Tier
Raw Volume
Stored Volume
Storage Cost
Notes
Hot (7 days)
151 GB
151 GB
$7.55/month
Uncompressed, fast SSD
Warm (83 days)
1.79 TB
179 GB
$1.79/month
90% compression
Cold (2.75 years)
3.62 TB
362 GB
$0.72/month
Downsampled 6x + 90% compression
Total
~5.6 TB raw
~692 GB stored
~$10.06/month
vs. $1,165/mo all-SSD uncompressed
The all-SSD cost assumes full-resolution 3-year retention. Tiered storage achieves lower cost partly through downsampling older data to 1-minute resolution.
What we do: Write optimized queries for each use case.
Why: Database choice is only half the solution; query design determines actual performance.
Real-time map update (< 100ms requirement):
-- Query the continuous aggregate for latest positions-- Use the most recent bucket to get near-real-time dataSELECT vehicle_id, location, speed_kmh, last_seenFROM vehicle_latestWHERE bucket > NOW() -INTERVAL'2 minutes'ORDERBY vehicle_id, bucket DESC;-- Performance: < 10ms for 10,000 vehicles (indexed materialized view)
Vehicle 24-hour history (< 2s requirement):
-- Time-bounded query with vehicle filterSELECTtime, location, speed_kmh, fuel_pctFROM vehicle_telemetryWHERE vehicle_id ='VEH-001'ANDtime> NOW() -INTERVAL'24 hours'ORDERBYtime;-- Performance: < 500ms (8,640 rows from single day chunk)
Geofence check (< 50ms requirement):
-- Check if vehicle is within geofenced area-- Use a subquery to get the latest bucket per vehicleSELECT v.vehicle_id, g.zone_nameFROM (SELECTDISTINCTON (vehicle_id) vehicle_id, locationFROM vehicle_latestWHERE bucket > NOW() -INTERVAL'1 minute'ORDERBY vehicle_id, bucket DESC) vJOIN geofence_zones g ON ST_Within(v.location::geometry, g.boundary);-- Performance: < 30ms with spatial index
-- Aggregate fuel consumption per vehicleSELECT vehicle_id,MIN(fuel_pct) AS min_fuel,MAX(fuel_pct) AS max_fuel,SUM(CASEWHEN fuel_pct_diff >0THEN fuel_pct_diff ELSE0END) AS refuel_totalFROM (SELECT vehicle_id, fuel_pct, fuel_pct -LAG(fuel_pct) OVER (PARTITIONBY vehicle_id ORDERBYtime) AS fuel_pct_diffFROM vehicle_telemetryWHEREtime> NOW() -INTERVAL'30 days') subGROUPBY vehicle_id;-- Performance: < 15s (parallelized across 30 day-chunks)
Outcome: TimescaleDB with PostgreSQL/PostGIS selected as the optimal database for fleet management, with a tiered storage strategy reducing costs by 99.1%.
Key decisions made and why:
Decision
Choice
Rationale
Primary database
TimescaleDB
Best combination of time-series + geospatial + SQL
Why not InfluxDB
No geospatial
Geofencing is critical; would need secondary database
Why not MongoDB
Not ideal for time-series
Time-range aggregations less efficient
Storage strategy
3-tier (hot/warm/cold)
99.1% cost reduction while meeting latency SLAs
Real-time queries
Continuous aggregates
Pre-computed latest position; < 10ms response
Quantified outcomes:
Write throughput: 1,000 messages/second sustained (headroom to 10K/s)
How does compression and tiering achieve such dramatic savings? Consider a smaller fleet of 500 trucks with GPS (1 Hz) and diagnostics (0.2 Hz) as a simplified example:
$ = (500 ) + (500 ) = 55 $
That is 1.73 TB/year raw. With TimescaleDB’s compression + tiered storage: hot (7 days, uncompressed) = 33 GB; warm (83 days, compressed at 90%) = 39 GB; cold (275 days, downsampled 6x + compressed at 90%) = 22 GB. That gives approximately 94 GB total first year.
Over 3 years: \(94 + 94 + 94 = 282\text{ GB} \approx 0.28\text{ TB}\) (~95% reduction vs. 5.19 TB raw). The same compression principles apply to the full 10,000-vehicle fleet, but at larger scale.
The fleet management example illustrates how geospatial requirements can be the deciding factor in database selection. The next example tackles a different challenge: what happens when data volume, not query complexity, is the primary driver.
8.3 Worked Example: Smart City Data Lake
8.4 Designing a Data Lake for a Smart City with 49,000 Sensors
Scenario: A city government is deploying a smart city platform with 49,000 sensors generating diverse data types: traffic flow (video + counts), air quality (particulate matter, CO2, ozone), noise levels, weather stations, smart streetlights, and water quality monitors. The platform must support real-time dashboards for city operations, historical analysis for urban planning, ML model training for predictive services, and compliance with data retention regulations (7 years for environmental data).
Goal: Design a cost-optimized data lake architecture with appropriate storage tiers, partitioning strategy, query patterns, and schema evolution approach that balances performance, cost, and regulatory compliance.
What we do: Catalog all data sources, estimate volumes, and classify by access patterns and retention requirements.
Why: Storage tier selection and partitioning strategy depend heavily on data characteristics.
Data source inventory:
Source
Count
Frequency
Payload
Daily Volume
Data Type
Traffic cameras
2,000
30 fps video
50 KB/frame
259 TB
Unstructured
Traffic counters
5,000
1 min
100 bytes
720 MB
Time-series
Air quality
500
5 min
200 bytes
29 MB
Time-series
Noise sensors
1,000
10 sec
50 bytes
432 MB
Time-series
Weather stations
100
15 min
500 bytes
4.8 MB
Time-series
Smart streetlights
40,000
1 hour
150 bytes
144 MB
Time-series
Water quality
400
30 min
300 bytes
5.8 MB
Time-series
Total sensors
49,000
-
-
~260 TB/day
Mixed
Key insight: Video data (99.5% of volume) dominates storage costs but is accessed rarely (incident investigation only). Time-series sensor data (0.5% of volume) is accessed frequently for dashboards and analytics. This split demands a tiered approach.
What we do: Define hot, warm, and cold storage tiers with appropriate technologies for each data type.
Why: A single storage tier cannot cost-effectively serve all access patterns.
The following diagram shows the flow from 49,000 sensors through Kafka to two storage tracks (time-series and object storage), each with hot, warm, and cold tiers.
Three-tier architecture:
[49,000 Sensors]
|
v
[Ingestion Layer: Kafka]
|
+---+---+
| |
v v
[Time-Series] [Object Storage]
| |
v v
+--+--+ +---+---+
| | | | |
v v v v v
[Hot] [Warm] [Hot][Warm][Cold]
7d 90d 24h 30d Archive
Tier specifications:
Tier
Technology
Data
Retention
Storage Cost
Query Latency
Hot (TS)
TimescaleDB SSD
Sensor readings
7 days
$0.10/GB/mo
< 50ms
Warm (TS)
TimescaleDB HDD
Sensor readings
90 days
$0.02/GB/mo
< 500ms
Cold (TS)
S3 + Parquet
Aggregated sensors
7 years
$0.004/GB/mo
< 10s
Hot (Video)
MinIO SSD
Recent clips
24 hours
$0.10/GB/mo
< 1s
Warm (Video)
MinIO HDD
Investigation window
30 days
$0.02/GB/mo
< 5s
Cold (Video)
S3 Glacier
Archive
90 days
$0.004/GB/mo
3-5 hours
What we do: Design partition keys for optimal query performance across different access patterns.
Why: Poor partitioning causes full-table scans (slow, expensive) or hot partitions (write bottlenecks).
Adding fields: Always optional with defaults (NULL or computed)
Changing units: Store original + converted, never overwrite historical
Renaming fields: Create view alias, deprecate old name over 6 months
Removing fields: Stop populating, retain historical for query compatibility
What we do: Calculate storage costs and implement aggressive cost reduction through compression, aggregation, and tiering.
Why: At 260 TB/day, naive storage costs would exceed $1M/month. Smart optimization reduces this by 95%+.
Storage cost analysis (before optimization):
Assuming a blended storage cost of $0.02/GB/month and full data retention (using 365-day year):
Data Type
Daily Volume
Steady-State (1 yr)
Monthly Storage Cost
Video raw
259 TB
94,535 TB
$1,890,700
Sensors
1.4 GB
511 GB
$10
Total
259 TB
94,535 TB
$1,890,710/month
Optimization strategies applied:
Strategy
Target
Reduction
Implementation
Video motion detection
Keep only motion clips
99%
Edge AI filters static footage; motion clips average ~1% of continuous recording duration
Video resolution tiering
4K to 1080p to 480p by age
75%
Transcode at 24h, 7d boundaries
Video retention policy
Delete after 90 days
92%
Lifecycle policy (compliance exempt)
Sensor compression
TimescaleDB native
92%
Segment by sensor_id
Sensor downsampling
1-min to 1-hour to 1-day
99%
Continuous aggregates for cold tier
Cost after optimization:
Tier
Data Type
Retention
Steady-State Volume
$/GB/month
Monthly Cost
Hot TS
Sensors
7 days
10 GB
$0.10
$1.00
Warm TS
Sensors
90 days
10 GB (compressed)
$0.02
$0.20
Cold TS
Aggregates
7 years
500 GB
$0.004
$2.00
Hot Video
Motion clips
24 hours
2.6 TB
$0.10
$260
Warm Video
Motion clips
30 days
78 TB
$0.02
$1,560
Cold Video
Compliance*
90 days
78 TB
$0.004
$312
Total
-
-
~159 TB
-
~$2,135/month
*Environmental video exempt from 90-day deletion (compliance requirement)
Savings: $1,890,710/month to ~$2,135/month = 99.9% reduction (~$22.7M/year saved)
Try It: Smart City Storage Cost Calculator
Explore how different optimization strategies affect storage costs for a smart city deployment. Adjust video filtering and retention to see the impact.
Note: This calculator models only video motion filtering and retention. The full optimization in the worked example includes additional strategies (resolution tiering, sensor compression, downsampling) not shown here. As a result, the calculator’s optimized cost will be higher than the $2,135/month in the static table above.
Outcome: A production-ready data lake architecture serving 49,000 sensors with sub-100ms dashboard latency, 7-year compliance retention, and ~$2,135/month storage cost (vs. $1.89M/month unoptimized).
Key decisions made and why:
Decision
Choice
Rationale
Time-series database
TimescaleDB
SQL familiarity for city analysts, native compression, continuous aggregates
Object storage
MinIO + S3
On-prem for high-bandwidth video, S3 for archive
Video processing
Edge motion detection
99% volume reduction at source, keeps only relevant footage
Partitioning
Time-first, then sensor
Matches 95% of query patterns (time-range dashboards)
Schema evolution
JSONB + version table
Maximum flexibility without migration downtime
Cold storage format
Parquet
Columnar compression, Spark/Presto compatible for ML
Video clip retrieval: 8s hot, 25s warm (target: < 30s)
Monthly storage cost: ~$2,135 (99.9% reduction from naive approach)
Compliance coverage: 7 years environmental sensor data, 90 days video with compliance exemptions
Schema migrations: Zero downtime (additive changes only)
Key Takeaway
Successful IoT database architecture starts with quantifying your data: calculate ingestion rates, classify query patterns, estimate storage volumes, and define retention requirements. Then match technologies to workloads – a single “best” database rarely exists. Tiered storage (hot/warm/cold) combined with compression, aggregation, and edge processing can reduce storage costs by 95-99%+ while meeting all latency requirements.
For Kids: Meet the Sensor Squad!
Designing a database architecture is like planning the perfect kitchen for a restaurant – you need the right tools in the right places!
8.4.1 The Sensor Squad Adventure: Building the Perfect Data Kitchen
The Sensor Squad was hired to build data storage for two big projects: tracking 10,000 delivery trucks AND monitoring 49,000 city sensors!
“That’s a LOT of data!” said Sammy the Sensor. “How do we store it all without spending a fortune?”
Max the Microcontroller became the architect. “First, let’s figure out WHAT data we have and HOW MUCH.”
For the delivery trucks: - “Each truck reports its GPS location every 10 seconds. That’s 1,000 messages every second!” - “We need to show trucks on a MAP, so we need a database that understands geography!”
“That’s like needing a kitchen with BOTH an oven AND a grill!” said Lila the LED. “We picked TimescaleDB because it handles time-data AND maps!”
For the city sensors: - “49,000 sensors send data all day. Plus 2,000 cameras recording video!” - “The video takes up 99% of the storage space but is only needed when something goes wrong.”
Bella the Battery had a brilliant cost-saving idea: “Let’s organize storage like a CLOSET!”
Top shelf (HOT storage): This week’s data – fast to reach, but expensive space
Middle shelf (WARM storage): This month’s data – a bit slower, much cheaper
Bottom shelf (COLD storage): Old data in boxes – slow to unpack, but super cheap!
“And for video, let’s use AI to keep ONLY the interesting clips!” added Sammy. “That cuts storage by 99%!”
The result? Instead of spending $1.9 MILLION per month on storage, they spent only about $2,100 per month. The Squad saved 99.9%!
“The secret,” said Max, “is matching the RIGHT storage to each type of data, just like using the right container for each food in your fridge!”
8.4.2 Key Words for Kids
Word
What It Means
Architecture
The plan for how a system is built – like the blueprint for a house
Tiered Storage
Storing data in different places based on how often you need it – like keeping snacks nearby but bulk food in the basement
Cost Optimization
Finding the cheapest way to do something well – like buying in bulk to save money
8.5 Deep Dives
The following sections explore specific topics that arose in the worked examples above: how indexing transforms query performance, when to use continuous aggregates, and how to avoid costly database migrations.
Worked Example: Calculating Query Performance Impact of Indexing
Scenario: Fleet management system queries “show all vehicles in geofence polygon X in the last hour” 100 times/minute.
Without Spatial Index:
SELECT vehicle_id, location, speed_kmh, timeFROM vehicle_telemetryWHEREtime> NOW() -INTERVAL'1 hour'AND ST_Within(location::geometry, 'POLYGON((...))'::geometry);-- Query plan: Sequential scan of 3.6M rows (10K vehicles x 360 readings/hour at 10s intervals)-- Execution time: 8.5 seconds
With Spatial Index (GIST):
CREATEINDEX idx_location ON vehicle_telemetry USING GIST (location);-- Query plan: Index scan of ~50 rows (vehicles in polygon)-- Execution time: 12 milliseconds
Performance Calculation:
Queries per minute: 100
Without index: 100 queries x 8.5s = 850 seconds CPU (14 minutes)
With index: 100 queries x 0.012s = 1.2 seconds CPU
Speedup: 8.5s / 0.012s = 708x faster
CPU savings: 850s - 1.2s = 848.8 seconds/minute saved
Database load: 99.86% reduction
Cost Analysis:
Index creation time: 45 seconds (one-time)
Index storage: 120 MB (0.02% of table size)
Index maintenance: 2% write overhead
Query improvement: 708x faster
Key Insight: Spatial indexes transform IoT geofencing from unusable (8.5s) to real-time (12ms) at negligible cost.
Decision Framework: When to Use Continuous Aggregates vs Materialized Views
Characteristic
Continuous Aggregate
Materialized View
Which to Choose?
Data freshness
Auto-refresh (5 min)
Manual REFRESH
Continuous if need recent data
Compute cost
Incremental (only new data)
Full recompute
Continuous for large datasets
Setup complexity
Medium (policies)
Low (simple SQL)
Materialized for prototypes
Use case
Real-time dashboards
Daily/weekly reports
Continuous for production
Storage
Stores aggregate only
Stores full result
Continuous saves space
Example Decision:
Scenario: Fleet dashboard showing “Average speed by vehicle in last 24 hours”
Option A: Real-time query (no pre-aggregation):
SELECT vehicle_id, AVG(speed_kmh) FROM vehicle_telemetryWHEREtime> NOW() -INTERVAL'24 hours'GROUPBY vehicle_id;-- Scans 86.4M rows (10K vehicles x 8,640 readings/day at 10s intervals)-- Query time: 4.2 seconds
# Step 1: Export InfluxDB to CSV (chunked by time)# Note: InfluxDB v2 uses Flux; v3 uses SQL. Adjust query syntax to your version.for month in{1..12};dostart=$(printf"2024-%02d-01T00:00:00Z"$month)if[$month-eq 12 ];thenstop="2025-01-01T00:00:00Z"elsestop=$(printf"2024-%02d-01T00:00:00Z"$((month+1)))fiinflux query \"from(bucket: \"sensors\") |> range(start: ${start}, stop: ${stop})"\--raw> export_2024_$(printf"%02d"$month).csvdone# Step 2: Import to TimescaleDB (transform CSV headers as needed)for file in export_*.csv;dopsql-c"\\COPY sensor_readings FROM '${file}' CSV HEADER"done# Step 3: Validate row counts match between source and targetinflux query 'from(bucket: "sensors") |> count()'--rawpsql-c"SELECT COUNT(*) FROM sensor_readings;"
Time estimate: 5 TB at 10 MB/s = 500K seconds = 6 days export + 6 days import = 12 days
Key Insight: Migrating databases costs 10-100x more than choosing correctly upfront. Plan for growth.
Match IoT Storage Scenarios to Their Optimal Architecture
Key Concepts
Fleet Telemetry Schema: A time-series table design for vehicle IoT data using device ID + timestamp as the composite key, with partitioning by vehicle group to balance write distribution and query co-location
Smart City Data Lake: A multi-tier storage architecture combining real-time streaming ingestion (Kafka), hot time-series storage (TimescaleDB), and cold archival (S3/Parquet) for city-scale sensor networks
Ingestion Pipeline: The data path from IoT device through broker (MQTT/Kafka) to database, including validation, enrichment, and routing stages that determine data quality and storage destination
Continuous Aggregate Materialization: Pre-computing hourly and daily rollups in TimescaleDB that auto-refresh as new sensor data arrives, enabling fleet-wide dashboards to query aggregates rather than billions of raw rows
Polyglot Persistence Pattern: Using PostgreSQL for device registry (relational), TimescaleDB for telemetry (time-series), Redis for real-time state (key-value), and S3/Parquet for cold archival (object) in a single production system
Partition Pruning: The query optimizer technique of skipping irrelevant time chunks based on WHERE clause time predicates, essential for achieving sub-second query times on multi-year IoT datasets
Write-Ahead Log Sizing: Configuring PostgreSQL/TimescaleDB WAL buffer and checkpoint settings to absorb bursty IoT write patterns without excessive I/O stalls during checkpoint operations
Tiered Compression Ratio: The actual storage reduction measured after applying columnar compression to IoT telemetry – typically 10-20x for float sensor values, 5-8x for mixed types – used to size storage budgets
Common Pitfalls
1. Building a single monolithic schema for all device types
Storing telemetry for 50 different device types (vehicles, HVAC, water meters) in a single wide table with nullable columns for type-specific fields creates sparse, inefficient storage and slow queries. Use inheritance or separate hypertables per device category, joined through a common device registry.
2. Skipping load testing before production deployment
An IoT worked example that performs perfectly with 100 simulated devices may collapse with 10,000 real devices due to lock contention, connection pool exhaustion, or disk I/O saturation. Always load test with realistic device counts, message rates, and concurrent query loads before cutover.
3. Not pre-computing aggregations for dashboard queries
Running real-time SUM/AVG across all raw sensor data for a fleet dashboard query causes full table scans on billions of rows. Create continuous aggregates (hourly, daily) during initial schema design – retrofitting them into a running production system requires careful migration to avoid query downtime.
Label the Diagram
💻 Code Challenge
8.6 Summary
Fleet management example demonstrates database selection trade-offs (TimescaleDB vs InfluxDB vs MongoDB) with geospatial requirements as the deciding factor
Smart city data lake example shows tiered storage architecture reducing costs by 99.9% through edge processing, compression, and lifecycle policies
Schema evolution strategies enable gradual changes without breaking existing devices or queries
Cost optimization requires understanding access patterns to place data in appropriate storage tiers
Query optimization through continuous aggregates and proper indexing achieves sub-100ms latency
You have completed the Data Storage and Databases chapter series. The worked examples here demonstrated how to translate requirements into concrete architectures. Your next step depends on where you want to go from here: