After completing this chapter, you will be able to:
Apply database selection frameworks to real-world IoT scenarios
Design complete storage architectures with appropriate tiering
Calculate storage requirements and costs for large-scale deployments
Implement schema designs optimized for specific query patterns
1279.1 Worked Example: Fleet Management System
1279.2 Choosing a Database for 10,000-Vehicle Fleet Tracking
Scenario: A logistics company is building a fleet management system for 10,000 delivery vehicles. Each vehicle reports GPS location, speed, fuel level, engine diagnostics, and driver behavior every 10 seconds. The system must support real-time vehicle tracking on a map, historical route analysis, geofencing alerts, and monthly compliance reports. The data must be retained for 3 years for regulatory compliance.
Goal: Select and design the optimal database architecture, comparing InfluxDB, TimescaleDB, and MongoDB for this use case, considering query patterns, data volumes, and operational costs.
What we do: Calculate data ingestion rate, storage requirements, and classify query patterns.
Why: Database selection depends heavily on data characteristics and access patterns.
Data ingestion analysis:
Metric
Value
Calculation
Vehicles
10,000
Fixed fleet size
Report interval
10 seconds
GPS + telemetry
Messages per second
1,000
10,000 / 10
Payload size (avg)
250 bytes
GPS + 8 sensor values + metadata
Bytes per second
250 KB/s
1,000 x 250 bytes
Daily volume (raw)
21.6 GB
250 KB/s x 86,400
Monthly volume
648 GB
21.6 GB x 30
3-year retention
23.3 TB
648 GB x 36
Query pattern classification:
Query Type
Frequency
Latency Requirement
Data Range
Real-time map update
100/sec
< 100ms
Last reading per vehicle
Vehicle history (24h)
50/day
< 2s
8,640 points per vehicle
Geofence check
1,000/sec
< 50ms
Current position vs. polygons
Monthly fuel report
10/month
< 30s
30 days, all vehicles
Compliance audit
5/year
< 5min
1 year, full resolution
Key insight: This workload combines time-series telemetry (95% of writes), real-time point queries (current state), and analytical aggregations (reports). A hybrid approach may be optimal.
What we do: Compare InfluxDB, TimescaleDB, and MongoDB against our requirements.
Why: Each database has strengths for different aspects of fleet management.
InfluxDB (Purpose-built time-series):
Aspect
Rating
Notes
Write throughput
Excellent
Built for high-frequency metrics
Time-range queries
Excellent
Native time-series optimizations
Point queries (latest)
Good
Requires careful schema design
Geospatial queries
Poor
No native geo support
Compression
Excellent
10-20x for numeric data
SQL compatibility
Limited
Flux query language (learning curve)
TimescaleDB (PostgreSQL extension):
Aspect
Rating
Notes
Write throughput
Very Good
100K+ rows/sec with hypertables
Time-range queries
Excellent
Automatic time partitioning
Point queries (latest)
Good
Continuous aggregates help
Geospatial queries
Excellent
PostGIS integration
Compression
Very Good
90-95% for time-series
SQL compatibility
Full
Standard PostgreSQL SQL
MongoDB (Document database):
Aspect
Rating
Notes
Write throughput
Very Good
Sharding scales horizontally
Time-range queries
Good
Requires proper indexing
Point queries (latest)
Excellent
Flexible document model
Geospatial queries
Very Good
Native geospatial indexes
Compression
Good
WiredTiger compression
SQL compatibility
None
MongoDB Query Language
What we do: Design schemas for the recommended database (TimescaleDB) with alternatives.
Why: Schema design significantly impacts query performance and storage efficiency.
Recommendation: TimescaleDB - Best balance of time-series performance, geospatial capabilities, and SQL familiarity.
TimescaleDB schema design:
-- Main telemetry hypertable (time-partitioned)CREATETABLE vehicle_telemetry (time TIMESTAMPTZ NOTNULL, vehicle_id VARCHAR(20) NOTNULL, location GEOGRAPHY(POINT, 4326), -- PostGIS speed_kmh SMALLINT, heading SMALLINT, fuel_pct SMALLINT, engine_rpm SMALLINT, engine_temp SMALLINT, odometer_km INTEGER, driver_id VARCHAR(20));-- Convert to hypertable with 1-day chunksSELECT create_hypertable('vehicle_telemetry', 'time', chunk_time_interval =>INTERVAL'1 day');-- Compression policy: compress after 7 daysALTERTABLE vehicle_telemetry SET ( timescaledb.compress, timescaledb.compress_segmentby ='vehicle_id');SELECT add_compression_policy('vehicle_telemetry', INTERVAL'7 days');-- Indexes for common queriesCREATEINDEXON vehicle_telemetry (vehicle_id, timeDESC);CREATEINDEXON vehicle_telemetry USING GIST (location);-- Continuous aggregate for latest position (real-time map)CREATEMATERIALIZEDVIEW vehicle_latestWITH (timescaledb.continuous) ASSELECT vehicle_id,last(time, time) AS last_seen,last(location, time) AS location,last(speed_kmh, time) AS speed_kmhFROM vehicle_telemetryGROUPBY vehicle_id;-- Refresh every 10 seconds for near-real-timeSELECT add_continuous_aggregate_policy('vehicle_latest', start_offset =>INTERVAL'1 hour', end_offset =>INTERVAL'10 seconds', schedule_interval =>INTERVAL'10 seconds');
Storage tier design (3-year retention with cost optimization):
Tier
Age
Resolution
Storage
Monthly Cost
Hot
0-7 days
10-second
SSD
$50/TB
Warm
7-90 days
10-second (compressed)
HDD
$10/TB
Cold
90 days - 3 years
1-minute aggregates
Object storage
$2/TB
Storage cost calculation:
Tier
Data Volume
Storage Cost
Notes
Hot (7 days)
150 GB
$7.50/month
Uncompressed, fast SSD
Warm (83 days)
1.5 TB (compressed)
$15/month
90% compression
Cold (2.75 years)
2 TB
$4/month
Downsampled 6x
Total
~3.7 TB effective
~$26.50/month
vs. $1,165/mo uncompressed
What we do: Write optimized queries for each use case.
Why: Database choice is only half the solution; query design determines actual performance.
Real-time map update (< 100ms requirement):
-- Query the continuous aggregate for latest positionsSELECT vehicle_id, location, speed_kmh, last_seenFROM vehicle_latestWHERE last_seen > NOW() -INTERVAL'2 minutes';-- Performance: < 10ms for 10,000 vehicles (indexed materialized view)
Vehicle 24-hour history (< 2s requirement):
-- Time-bounded query with vehicle filterSELECTtime, location, speed_kmh, fuel_pctFROM vehicle_telemetryWHERE vehicle_id ='VEH-001'ANDtime> NOW() -INTERVAL'24 hours'ORDERBYtime;-- Performance: < 500ms (8,640 rows from single day chunk)
Geofence check (< 50ms requirement):
-- Check if vehicle is within geofenced areaSELECT v.vehicle_id, g.zone_nameFROM vehicle_latest vJOIN geofence_zones g ON ST_Within(v.location::geometry, g.boundary)WHERE v.last_seen > NOW() -INTERVAL'1 minute';-- Performance: < 30ms with spatial index
-- Aggregate fuel consumption per vehicleSELECT vehicle_id,MIN(fuel_pct) AS min_fuel,MAX(fuel_pct) AS max_fuel,SUM(CASEWHEN fuel_pct_diff >0THEN fuel_pct_diff ELSE0END) AS refuel_totalFROM (SELECT vehicle_id, fuel_pct, fuel_pct -LAG(fuel_pct) OVER (PARTITIONBY vehicle_id ORDERBYtime) AS fuel_pct_diffFROM vehicle_telemetryWHEREtime> NOW() -INTERVAL'30 days') subGROUPBY vehicle_id;-- Performance: < 15s (parallelized across 30 day-chunks)
Outcome: TimescaleDB with PostgreSQL/PostGIS selected as the optimal database for fleet management, with a tiered storage strategy reducing costs by 97%.
Key decisions made and why:
Decision
Choice
Rationale
Primary database
TimescaleDB
Best combination of time-series + geospatial + SQL
Why not InfluxDB
No geospatial
Geofencing is critical; would need secondary database
Why not MongoDB
Not ideal for time-series
Time-range aggregations less efficient
Storage strategy
3-tier (hot/warm/cold)
97% cost reduction while meeting latency SLAs
Real-time queries
Continuous aggregates
Pre-computed latest position; < 10ms response
Quantified outcomes:
Write throughput: 1,000 messages/second sustained (headroom to 10K/s)
{const container =document.getElementById('kc-db-11');if (container &&typeof InlineKnowledgeCheck !=='undefined') { container.innerHTML=''; container.appendChild(InlineKnowledgeCheck.create({question:"Based on the fleet management worked example, why was TimescaleDB chosen over InfluxDB despite InfluxDB being a 'purpose-built' time-series database?",options: [ {text:"TimescaleDB has better write performance for high-volume data",correct:false,feedback:"Actually, InfluxDB often has slightly higher raw write performance. The decision was based on feature set, not just write speed."}, {text:"TimescaleDB provides PostGIS integration for geofencing and spatial queries",correct:true,feedback:"Correct! Fleet management requires geospatial queries (geofencing, route analysis). InfluxDB lacks native geo support, which would require a secondary database. TimescaleDB + PostGIS handles both time-series and spatial in one system."}, {text:"InfluxDB doesn't support compression for cost optimization",correct:false,feedback:"InfluxDB actually has excellent compression (10-20x for numeric data). The limitation was geospatial support, not compression."}, {text:"TimescaleDB is free while InfluxDB requires enterprise licensing",correct:false,feedback:"Both have open-source versions. The decision was technical (geospatial support), not licensing-driven."} ],difficulty:"medium",topic:"iot-databases" })); }}
1279.3 Worked Example: Smart City Data Lake
1279.4 Designing a Data Lake for a Smart City with 50,000 Sensors
Scenario: A city government is deploying a smart city platform with 50,000 sensors generating diverse data types: traffic flow (video + counts), air quality (particulate matter, CO2, ozone), noise levels, weather stations, smart streetlights, and water quality monitors. The platform must support real-time dashboards for city operations, historical analysis for urban planning, ML model training for predictive services, and compliance with data retention regulations (7 years for environmental data).
Goal: Design a cost-optimized data lake architecture with appropriate storage tiers, partitioning strategy, query patterns, and schema evolution approach that balances performance, cost, and regulatory compliance.
What we do: Catalog all data sources, estimate volumes, and classify by access patterns and retention requirements.
Why: Storage tier selection and partitioning strategy depend heavily on data characteristics.
Data source inventory:
Source
Count
Frequency
Payload
Daily Volume
Data Type
Traffic cameras
2,000
30 fps video
50 KB/frame
259 TB
Unstructured
Traffic counters
5,000
1 min
100 bytes
720 MB
Time-series
Air quality
500
5 min
200 bytes
29 MB
Time-series
Noise sensors
1,000
10 sec
50 bytes
432 MB
Time-series
Weather stations
100
15 min
500 bytes
4.8 MB
Time-series
Smart streetlights
40,000
1 hour
150 bytes
144 MB
Time-series
Water quality
400
30 min
300 bytes
5.8 MB
Time-series
Total sensors
49,000
-
-
~260 TB/day
Mixed
Key insight: Video data (99.5% of volume) dominates storage costs but is accessed rarely (incident investigation only). Time-series sensor data (0.5% of volume) is accessed frequently for dashboards and analytics. This split demands a tiered approach.
What we do: Define hot, warm, and cold storage tiers with appropriate technologies for each data type.
Why: A single storage tier cannot cost-effectively serve all access patterns.
Three-tier architecture:
[50,000 Sensors]
|
v
[Ingestion Layer: Kafka]
|
+---+---+
| |
v v
[Time-Series] [Object Storage]
| |
v v
+--+--+ +---+---+
| | | | |
v v v v v
[Hot] [Warm] [Hot][Warm][Cold]
7d 90d 24h 30d Archive
Tier specifications:
Tier
Technology
Data
Retention
Storage Cost
Query Latency
Hot (TS)
TimescaleDB SSD
Sensor readings
7 days
$0.10/GB/mo
< 50ms
Warm (TS)
TimescaleDB HDD
Sensor readings
90 days
$0.02/GB/mo
< 500ms
Cold (TS)
S3 + Parquet
Aggregated sensors
7 years
$0.004/GB/mo
< 10s
Hot (Video)
MinIO SSD
Recent clips
24 hours
$0.10/GB/mo
< 1s
Warm (Video)
MinIO HDD
Investigation window
30 days
$0.02/GB/mo
< 5s
Cold (Video)
S3 Glacier
Archive
90 days
$0.004/GB/mo
3-5 hours
What we do: Design partition keys for optimal query performance across different access patterns.
Why: Poor partitioning causes full-table scans (slow, expensive) or hot partitions (write bottlenecks).
Adding fields: Always optional with defaults (NULL or computed)
Changing units: Store original + converted, never overwrite historical
Renaming fields: Create view alias, deprecate old name over 6 months
Removing fields: Stop populating, retain historical for query compatibility
What we do: Calculate storage costs and implement aggressive cost reduction through compression, aggregation, and tiering.
Why: At 260 TB/day, naive storage costs would exceed $1M/month. Smart optimization reduces this by 95%+.
Storage cost analysis (before optimization):
Data Type
Daily Volume
Monthly (30d)
Yearly
Cost @ $0.02/GB
Annual Cost
Video raw
259 TB
7,770 TB
93,240 TB
$0.02
$1,864,800
Sensors
1.4 GB
42 GB
504 GB
$0.02
$10
Total
259 TB
7,770 TB
93,240 TB
-
$1,864,810
Optimization strategies applied:
Strategy
Target
Reduction
Implementation
Video motion detection
Keep only motion clips
90%
Edge AI filters static footage
Video resolution tiering
4K to 1080p to 480p by age
75%
Transcode at 24h, 7d boundaries
Video retention policy
Delete after 90 days
92%
Lifecycle policy (compliance exempt)
Sensor compression
TimescaleDB native
92%
Segment by sensor_id
Sensor downsampling
1-min to 1-hour to 1-day
99%
Continuous aggregates for cold tier
Cost after optimization:
Tier
Data Type
Retention
Volume/Year
Unit Cost
Annual Cost
Hot TS
Sensors
7 days
10 GB
$0.10
$1
Warm TS
Sensors
90 days
100 GB
$0.02
$2
Cold TS
Aggregates
7 years
500 GB
$0.004
$14
Hot Video
Motion clips
24 hours
5 TB
$0.10
$500
Warm Video
Motion clips
30 days
150 TB
$0.02
$3,000
Cold Video
Compliance
90 days*
200 TB
$0.004
$800
Total
-
-
~350 TB
-
$4,317
*Environmental video exempt from 90-day deletion (compliance requirement)
Savings: $1,864,810 to $4,317 = 99.8% reduction ($1.86M/year saved)
Outcome: A production-ready data lake architecture serving 50,000 sensors with sub-100ms dashboard latency, 7-year compliance retention, and $4,317/year storage cost (vs. $1.86M unoptimized).
Key decisions made and why:
Decision
Choice
Rationale
Time-series database
TimescaleDB
SQL familiarity for city analysts, native compression, continuous aggregates
Object storage
MinIO + S3
On-prem for high-bandwidth video, S3 for archive
Video processing
Edge motion detection
90% volume reduction at source, keeps only relevant footage
Partitioning
Time-first, then sensor
Matches 95% of query patterns (time-range dashboards)
Schema evolution
JSONB + version table
Maximum flexibility without migration downtime
Cold storage format
Parquet
Columnar compression, Spark/Presto compatible for ML
Video clip retrieval: 8s hot, 45s warm (target: < 30s)
Annual storage cost: $4,317 (99.8% reduction from naive approach)
Compliance coverage: 7 years environmental data, 90 days video with compliance exemptions
Schema migrations: Zero downtime (additive changes only)
1279.5 Summary
Fleet management example demonstrates database selection trade-offs (TimescaleDB vs InfluxDB vs MongoDB) with geospatial requirements as the deciding factor
Smart city data lake example shows tiered storage architecture reducing costs by 99.8% through edge processing, compression, and lifecycle policies
Schema evolution strategies enable gradual changes without breaking existing devices or queries
Cost optimization requires understanding access patterns to place data in appropriate storage tiers
Query optimization through continuous aggregates and proper indexing achieves sub-100ms latency
1279.6 Whatβs Next
Youβve completed the Data Storage and Databases chapter series. Continue exploring: