8  Data Storage Worked Examples

In 60 Seconds

Real-world IoT database design requires matching storage architecture to specific workloads. A fleet management system with geospatial queries benefits from TimescaleDB with PostGIS, while a smart city data lake needs tiered storage (hot/warm/cold) to reduce costs by over 99.9%. The key is quantifying your data volume, query patterns, and retention needs before selecting technologies.

Learning Objectives

After completing this chapter, you will be able to:

  • Apply database selection frameworks to real-world IoT scenarios
  • Design complete storage architectures with appropriate tiering
  • Calculate storage requirements and costs for large-scale deployments
  • Implement schema designs optimized for specific query patterns

These worked examples walk you through the process of designing data storage for real IoT scenarios, step by step. Think of them as cooking demonstrations where an experienced chef explains every decision while preparing a complex dish. By following along, you learn not just what to do, but why, building the judgment you need for your own projects.

8.1 Worked Example: Fleet Management System

8.2 Choosing a Database for 10,000-Vehicle Fleet Tracking

Scenario: A logistics company is building a fleet management system for 10,000 delivery vehicles. Each vehicle reports GPS location, speed, fuel level, engine diagnostics, and driver behavior every 10 seconds. The system must support real-time vehicle tracking on a map, historical route analysis, geofencing alerts, and monthly compliance reports. The data must be retained for 3 years for regulatory compliance.

Goal: Select and design the optimal database architecture, comparing InfluxDB, TimescaleDB, and MongoDB for this use case, considering query patterns, data volumes, and operational costs.

What we do: Calculate data ingestion rate, storage requirements, and classify query patterns.

Why: Database selection depends heavily on data characteristics and access patterns.

Data ingestion analysis:

Metric Value Calculation
Vehicles 10,000 Fixed fleet size
Report interval 10 seconds GPS + telemetry
Messages per second 1,000 10,000 / 10
Payload size (avg) 250 bytes GPS + 8 sensor values + metadata
Bytes per second 250 KB/s 1,000 x 250 bytes
Daily volume (raw) 21.6 GB 250 KB/s x 86,400
Monthly volume 648 GB 21.6 GB x 30
3-year retention 23.3 TB 648 GB x 36

Query pattern classification:

Query Type Frequency Latency Requirement Data Range
Real-time map update 100/sec < 100ms Last reading per vehicle
Vehicle history (24h) 50/day < 2s 8,640 points per vehicle
Geofence check 1,000/sec < 50ms Current position vs. polygons
Monthly fuel report 10/month < 30s 30 days, all vehicles
Compliance audit 5/year < 5min 1 year, full resolution

Key insight: This workload combines time-series telemetry (95% of writes), real-time point queries (current state), and analytical aggregations (reports). A hybrid approach may be optimal.

Adjust the parameters below to see how fleet size, reporting interval, and payload size affect data volumes and storage requirements.

What we do: Compare InfluxDB, TimescaleDB, and MongoDB against our requirements.

Why: Each database has strengths for different aspects of fleet management.

InfluxDB (Purpose-built time-series):

Aspect Rating Notes
Write throughput Excellent Built for high-frequency metrics
Time-range queries Excellent Native time-series optimizations
Point queries (latest) Good Requires careful schema design
Geospatial queries Poor No native geo support
Compression Excellent 10-20x for numeric data
SQL compatibility Limited Flux (v2) or SQL (v3); ecosystem less mature than PostgreSQL

TimescaleDB (PostgreSQL extension):

Aspect Rating Notes
Write throughput Very Good 100K+ rows/sec with hypertables
Time-range queries Excellent Automatic time partitioning
Point queries (latest) Good Continuous aggregates help
Geospatial queries Excellent PostGIS integration
Compression Very Good 90-95% for time-series
SQL compatibility Full Standard PostgreSQL SQL

MongoDB (Document database):

Aspect Rating Notes
Write throughput Very Good Sharding scales horizontally
Time-range queries Good Requires proper indexing
Point queries (latest) Excellent Flexible document model
Geospatial queries Very Good Native geospatial indexes
Compression Good WiredTiger compression
SQL compatibility None MongoDB Query Language

What we do: Design schemas for the recommended database (TimescaleDB) with alternatives.

Why: Schema design significantly impacts query performance and storage efficiency.

Recommendation: TimescaleDB - Best balance of time-series performance, geospatial capabilities, and SQL familiarity.

TimescaleDB schema design:

-- Hypertable: time-partitioned telemetry
CREATE TABLE vehicle_telemetry (
    time TIMESTAMPTZ NOT NULL, vehicle_id VARCHAR(20) NOT NULL,
    location GEOGRAPHY(POINT, 4326),  -- PostGIS
    speed_kmh SMALLINT, heading SMALLINT, fuel_pct SMALLINT,
    engine_rpm SMALLINT, engine_temp SMALLINT,
    odometer_km INTEGER, driver_id VARCHAR(20)
);
SELECT create_hypertable('vehicle_telemetry', 'time',
       chunk_time_interval => INTERVAL '1 day');

-- Compress after 7 days, segment by vehicle for efficient queries
ALTER TABLE vehicle_telemetry SET (
    timescaledb.compress, timescaledb.compress_segmentby = 'vehicle_id');
SELECT add_compression_policy('vehicle_telemetry', INTERVAL '7 days');

CREATE INDEX ON vehicle_telemetry (vehicle_id, time DESC);
CREATE INDEX ON vehicle_telemetry USING GIST (location);

-- Near-real-time position view (refreshed every 10s)
CREATE MATERIALIZED VIEW vehicle_latest
WITH (timescaledb.continuous) AS
SELECT time_bucket('10 seconds', time) AS bucket, vehicle_id,
       last(time, time) AS last_seen,
       last(location, time) AS location,
       last(speed_kmh, time) AS speed_kmh
FROM vehicle_telemetry GROUP BY bucket, vehicle_id;

Storage tier design (3-year retention with cost optimization):

Tier Age Resolution Storage Monthly Cost
Hot 0-7 days 10-second SSD $50/TB
Warm 7-90 days 10-second (compressed) HDD $10/TB
Cold 90 days - 3 years 1-minute aggregates Object storage $2/TB

Storage cost calculation:

Tier Raw Volume Stored Volume Storage Cost Notes
Hot (7 days) 151 GB 151 GB $7.55/month Uncompressed, fast SSD
Warm (83 days) 1.79 TB 179 GB $1.79/month 90% compression
Cold (2.75 years) 3.62 TB 362 GB $0.72/month Downsampled 6x + 90% compression
Total ~5.6 TB raw ~692 GB stored ~$10.06/month vs. $1,165/mo all-SSD uncompressed

The all-SSD cost assumes full-resolution 3-year retention. Tiered storage achieves lower cost partly through downsampling older data to 1-minute resolution.

What we do: Write optimized queries for each use case.

Why: Database choice is only half the solution; query design determines actual performance.

Real-time map update (< 100ms requirement):

-- Query the continuous aggregate for latest positions
-- Use the most recent bucket to get near-real-time data
SELECT vehicle_id, location, speed_kmh, last_seen
FROM vehicle_latest
WHERE bucket > NOW() - INTERVAL '2 minutes'
ORDER BY vehicle_id, bucket DESC;

-- Performance: < 10ms for 10,000 vehicles (indexed materialized view)

Vehicle 24-hour history (< 2s requirement):

-- Time-bounded query with vehicle filter
SELECT time, location, speed_kmh, fuel_pct
FROM vehicle_telemetry
WHERE vehicle_id = 'VEH-001'
  AND time > NOW() - INTERVAL '24 hours'
ORDER BY time;

-- Performance: < 500ms (8,640 rows from single day chunk)

Geofence check (< 50ms requirement):

-- Check if vehicle is within geofenced area
-- Use a subquery to get the latest bucket per vehicle
SELECT v.vehicle_id, g.zone_name
FROM (
    SELECT DISTINCT ON (vehicle_id) vehicle_id, location
    FROM vehicle_latest
    WHERE bucket > NOW() - INTERVAL '1 minute'
    ORDER BY vehicle_id, bucket DESC
) v
JOIN geofence_zones g ON ST_Within(v.location::geometry, g.boundary)
;

-- Performance: < 30ms with spatial index

Monthly fuel consumption report (< 30s requirement):

-- Aggregate fuel consumption per vehicle
SELECT vehicle_id,
       MIN(fuel_pct) AS min_fuel,
       MAX(fuel_pct) AS max_fuel,
       SUM(CASE WHEN fuel_pct_diff > 0 THEN fuel_pct_diff ELSE 0 END) AS refuel_total
FROM (
    SELECT vehicle_id, fuel_pct,
           fuel_pct - LAG(fuel_pct) OVER (PARTITION BY vehicle_id ORDER BY time) AS fuel_pct_diff
    FROM vehicle_telemetry
    WHERE time > NOW() - INTERVAL '30 days'
) sub
GROUP BY vehicle_id;

-- Performance: < 15s (parallelized across 30 day-chunks)

Outcome: TimescaleDB with PostgreSQL/PostGIS selected as the optimal database for fleet management, with a tiered storage strategy reducing costs by 99.1%.

Key decisions made and why:

Decision Choice Rationale
Primary database TimescaleDB Best combination of time-series + geospatial + SQL
Why not InfluxDB No geospatial Geofencing is critical; would need secondary database
Why not MongoDB Not ideal for time-series Time-range aggregations less efficient
Storage strategy 3-tier (hot/warm/cold) 99.1% cost reduction while meeting latency SLAs
Real-time queries Continuous aggregates Pre-computed latest position; < 10ms response

Quantified outcomes:

  • Write throughput: 1,000 messages/second sustained (headroom to 10K/s)
  • Real-time map latency: < 10ms (continuous aggregate)
  • 24-hour history query: < 500ms (single chunk access)
  • Monthly report generation: < 15s (parallel chunk processing)
  • Storage cost: ~$10/month (vs. $1,165/month all-SSD uncompressed)
  • 3-year retention: ~692 GB effective storage (99.1% cost reduction)

How does compression and tiering achieve such dramatic savings? Consider a smaller fleet of 500 trucks with GPS (1 Hz) and diagnostics (0.2 Hz) as a simplified example:

$ = (500 ) + (500 ) = 55 $

That is 1.73 TB/year raw. With TimescaleDB’s compression + tiered storage: hot (7 days, uncompressed) = 33 GB; warm (83 days, compressed at 90%) = 39 GB; cold (275 days, downsampled 6x + compressed at 90%) = 22 GB. That gives approximately 94 GB total first year.

Over 3 years: \(94 + 94 + 94 = 282\text{ GB} \approx 0.28\text{ TB}\) (~95% reduction vs. 5.19 TB raw). The same compression principles apply to the full 10,000-vehicle fleet, but at larger scale.

The fleet management example illustrates how geospatial requirements can be the deciding factor in database selection. The next example tackles a different challenge: what happens when data volume, not query complexity, is the primary driver.


8.3 Worked Example: Smart City Data Lake

8.4 Designing a Data Lake for a Smart City with 49,000 Sensors

Scenario: A city government is deploying a smart city platform with 49,000 sensors generating diverse data types: traffic flow (video + counts), air quality (particulate matter, CO2, ozone), noise levels, weather stations, smart streetlights, and water quality monitors. The platform must support real-time dashboards for city operations, historical analysis for urban planning, ML model training for predictive services, and compliance with data retention regulations (7 years for environmental data).

Goal: Design a cost-optimized data lake architecture with appropriate storage tiers, partitioning strategy, query patterns, and schema evolution approach that balances performance, cost, and regulatory compliance.

What we do: Catalog all data sources, estimate volumes, and classify by access patterns and retention requirements.

Why: Storage tier selection and partitioning strategy depend heavily on data characteristics.

Data source inventory:

Source Count Frequency Payload Daily Volume Data Type
Traffic cameras 2,000 30 fps video 50 KB/frame 259 TB Unstructured
Traffic counters 5,000 1 min 100 bytes 720 MB Time-series
Air quality 500 5 min 200 bytes 29 MB Time-series
Noise sensors 1,000 10 sec 50 bytes 432 MB Time-series
Weather stations 100 15 min 500 bytes 4.8 MB Time-series
Smart streetlights 40,000 1 hour 150 bytes 144 MB Time-series
Water quality 400 30 min 300 bytes 5.8 MB Time-series
Total sensors 49,000 - - ~260 TB/day Mixed

Key insight: Video data (99.5% of volume) dominates storage costs but is accessed rarely (incident investigation only). Time-series sensor data (0.5% of volume) is accessed frequently for dashboards and analytics. This split demands a tiered approach.

What we do: Define hot, warm, and cold storage tiers with appropriate technologies for each data type.

Why: A single storage tier cannot cost-effectively serve all access patterns.

The following diagram shows the flow from 49,000 sensors through Kafka to two storage tracks (time-series and object storage), each with hot, warm, and cold tiers.

Three-tier architecture:

[49,000 Sensors]
       |
       v
[Ingestion Layer: Kafka]
       |
   +---+---+
   |       |
   v       v
[Time-Series]  [Object Storage]
   |               |
   v               v
+--+--+       +---+---+
|     |       |   |   |
v     v       v   v   v
[Hot] [Warm]  [Hot][Warm][Cold]
 7d   90d     24h  30d  Archive

Tier specifications:

Tier Technology Data Retention Storage Cost Query Latency
Hot (TS) TimescaleDB SSD Sensor readings 7 days $0.10/GB/mo < 50ms
Warm (TS) TimescaleDB HDD Sensor readings 90 days $0.02/GB/mo < 500ms
Cold (TS) S3 + Parquet Aggregated sensors 7 years $0.004/GB/mo < 10s
Hot (Video) MinIO SSD Recent clips 24 hours $0.10/GB/mo < 1s
Warm (Video) MinIO HDD Investigation window 30 days $0.02/GB/mo < 5s
Cold (Video) S3 Glacier Archive 90 days $0.004/GB/mo 3-5 hours

What we do: Design partition keys for optimal query performance across different access patterns.

Why: Poor partitioning causes full-table scans (slow, expensive) or hot partitions (write bottlenecks).

Time-series partitioning (TimescaleDB hypertables):

-- Main sensor readings hypertable
CREATE TABLE sensor_readings (
    time        TIMESTAMPTZ NOT NULL,
    sensor_id   VARCHAR(32) NOT NULL,
    sensor_type VARCHAR(20) NOT NULL,
    location_id VARCHAR(20) NOT NULL,
    value       DOUBLE PRECISION,
    unit        VARCHAR(10),
    quality     SMALLINT DEFAULT 100
);

-- Create hypertable with 1-hour chunks (optimized for high-frequency data)
SELECT create_hypertable('sensor_readings', 'time',
    chunk_time_interval => INTERVAL '1 hour',
    create_default_indexes => FALSE);

-- Compound index for dashboard queries: "last hour by sensor type in district"
CREATE INDEX idx_type_location_time ON sensor_readings
    (sensor_type, location_id, time DESC);

-- Compression policy: compress after 24 hours
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'sensor_id, sensor_type',
    timescaledb.compress_orderby = 'time DESC'
);
SELECT add_compression_policy('sensor_readings', INTERVAL '24 hours');

-- Retention policy: move to S3 after 90 days
SELECT add_retention_policy('sensor_readings', INTERVAL '90 days');

Object storage partitioning (video):

s3://smart-city-video/
├── raw/
│   └── year=2024/
│       └── month=01/
│           └── day=15/
│               └── camera_id=CAM-001/
│                   └── hour=00/
│                       ├── 00-00-00_00-05-00.mp4
│                       └── 00-05-00_00-10-00.mp4
├── processed/
│   ├── thumbnails/
│   ├── motion_clips/
│   └── analytics/
└── archive/ (Glacier, 90+ days)

What we do: Plan for schema changes as new sensor types are added and data requirements evolve.

Why: IoT deployments evolve constantly. New sensor types, additional fields, changed units.

Schema versioning approach:

-- Metadata table tracking schema versions
CREATE TABLE schema_versions (
    sensor_type     VARCHAR(20) NOT NULL,
    current_version INTEGER NOT NULL,
    schema_json     JSONB NOT NULL,
    effective_from  TIMESTAMPTZ NOT NULL,
    changelog       TEXT,
    PRIMARY KEY (sensor_type, current_version)
);

-- Example: Air quality schema evolution
INSERT INTO schema_versions VALUES
('air_quality', 1, '{
    "fields": ["pm25", "pm10", "co2", "temperature", "humidity"],
    "units": {"pm25": "ug/m3", "pm10": "ug/m3", "co2": "ppm"}
}', '2024-01-01', 'Initial deployment'),

('air_quality', 2, '{
    "fields": ["pm25", "pm10", "pm1", "co2", "no2", "o3", "temperature", "humidity", "pressure"],
    "units": {"pm25": "ug/m3", "pm1": "ug/m3", "no2": "ppb", "o3": "ppb", "pressure": "hPa"}
}', '2024-06-01', 'Added PM1, NO2, O3, pressure from new sensor model');

Forward compatibility rules:

  1. Adding fields: Always optional with defaults (NULL or computed)
  2. Changing units: Store original + converted, never overwrite historical
  3. Renaming fields: Create view alias, deprecate old name over 6 months
  4. Removing fields: Stop populating, retain historical for query compatibility

What we do: Calculate storage costs and implement aggressive cost reduction through compression, aggregation, and tiering.

Why: At 260 TB/day, naive storage costs would exceed $1M/month. Smart optimization reduces this by 95%+.

Storage cost analysis (before optimization):

Assuming a blended storage cost of $0.02/GB/month and full data retention (using 365-day year):

Data Type Daily Volume Steady-State (1 yr) Monthly Storage Cost
Video raw 259 TB 94,535 TB $1,890,700
Sensors 1.4 GB 511 GB $10
Total 259 TB 94,535 TB $1,890,710/month

Optimization strategies applied:

Strategy Target Reduction Implementation
Video motion detection Keep only motion clips 99% Edge AI filters static footage; motion clips average ~1% of continuous recording duration
Video resolution tiering 4K to 1080p to 480p by age 75% Transcode at 24h, 7d boundaries
Video retention policy Delete after 90 days 92% Lifecycle policy (compliance exempt)
Sensor compression TimescaleDB native 92% Segment by sensor_id
Sensor downsampling 1-min to 1-hour to 1-day 99% Continuous aggregates for cold tier

Cost after optimization:

Tier Data Type Retention Steady-State Volume $/GB/month Monthly Cost
Hot TS Sensors 7 days 10 GB $0.10 $1.00
Warm TS Sensors 90 days 10 GB (compressed) $0.02 $0.20
Cold TS Aggregates 7 years 500 GB $0.004 $2.00
Hot Video Motion clips 24 hours 2.6 TB $0.10 $260
Warm Video Motion clips 30 days 78 TB $0.02 $1,560
Cold Video Compliance* 90 days 78 TB $0.004 $312
Total - - ~159 TB - ~$2,135/month

*Environmental video exempt from 90-day deletion (compliance requirement)

Savings: $1,890,710/month to ~$2,135/month = 99.9% reduction (~$22.7M/year saved)

Explore how different optimization strategies affect storage costs for a smart city deployment. Adjust video filtering and retention to see the impact.

Note: This calculator models only video motion filtering and retention. The full optimization in the worked example includes additional strategies (resolution tiering, sensor compression, downsampling) not shown here. As a result, the calculator’s optimized cost will be higher than the $2,135/month in the static table above.

Outcome: A production-ready data lake architecture serving 49,000 sensors with sub-100ms dashboard latency, 7-year compliance retention, and ~$2,135/month storage cost (vs. $1.89M/month unoptimized).

Key decisions made and why:

Decision Choice Rationale
Time-series database TimescaleDB SQL familiarity for city analysts, native compression, continuous aggregates
Object storage MinIO + S3 On-prem for high-bandwidth video, S3 for archive
Video processing Edge motion detection 99% volume reduction at source, keeps only relevant footage
Partitioning Time-first, then sensor Matches 95% of query patterns (time-range dashboards)
Schema evolution JSONB + version table Maximum flexibility without migration downtime
Cold storage format Parquet Columnar compression, Spark/Presto compatible for ML

Architecture diagram:

Smart city data lake architecture diagram showing data flow from 49,000 sensors through Kafka ingestion layer to tiered storage: TimescaleDB for time-series data (hot, warm, cold tiers) and MinIO/S3 for video data (hot, warm, Glacier archive)
Figure 8.1

Quantified outcomes:

  • Real-time dashboard latency: 45ms P99 (target: < 100ms)
  • Historical query (1-year range): 2.3s (target: < 5s)
  • Video clip retrieval: 8s hot, 25s warm (target: < 30s)
  • Monthly storage cost: ~$2,135 (99.9% reduction from naive approach)
  • Compliance coverage: 7 years environmental sensor data, 90 days video with compliance exemptions
  • Schema migrations: Zero downtime (additive changes only)

Key Takeaway

Successful IoT database architecture starts with quantifying your data: calculate ingestion rates, classify query patterns, estimate storage volumes, and define retention requirements. Then match technologies to workloads – a single “best” database rarely exists. Tiered storage (hot/warm/cold) combined with compression, aggregation, and edge processing can reduce storage costs by 95-99%+ while meeting all latency requirements.

Designing a database architecture is like planning the perfect kitchen for a restaurant – you need the right tools in the right places!

8.4.1 The Sensor Squad Adventure: Building the Perfect Data Kitchen

The Sensor Squad was hired to build data storage for two big projects: tracking 10,000 delivery trucks AND monitoring 49,000 city sensors!

“That’s a LOT of data!” said Sammy the Sensor. “How do we store it all without spending a fortune?”

Max the Microcontroller became the architect. “First, let’s figure out WHAT data we have and HOW MUCH.”

For the delivery trucks: - “Each truck reports its GPS location every 10 seconds. That’s 1,000 messages every second!” - “We need to show trucks on a MAP, so we need a database that understands geography!”

“That’s like needing a kitchen with BOTH an oven AND a grill!” said Lila the LED. “We picked TimescaleDB because it handles time-data AND maps!”

For the city sensors: - “49,000 sensors send data all day. Plus 2,000 cameras recording video!” - “The video takes up 99% of the storage space but is only needed when something goes wrong.”

Bella the Battery had a brilliant cost-saving idea: “Let’s organize storage like a CLOSET!”

  • Top shelf (HOT storage): This week’s data – fast to reach, but expensive space
  • Middle shelf (WARM storage): This month’s data – a bit slower, much cheaper
  • Bottom shelf (COLD storage): Old data in boxes – slow to unpack, but super cheap!

“And for video, let’s use AI to keep ONLY the interesting clips!” added Sammy. “That cuts storage by 99%!”

The result? Instead of spending $1.9 MILLION per month on storage, they spent only about $2,100 per month. The Squad saved 99.9%!

“The secret,” said Max, “is matching the RIGHT storage to each type of data, just like using the right container for each food in your fridge!”

8.4.2 Key Words for Kids

Word What It Means
Architecture The plan for how a system is built – like the blueprint for a house
Tiered Storage Storing data in different places based on how often you need it – like keeping snacks nearby but bulk food in the basement
Cost Optimization Finding the cheapest way to do something well – like buying in bulk to save money

8.5 Deep Dives

The following sections explore specific topics that arose in the worked examples above: how indexing transforms query performance, when to use continuous aggregates, and how to avoid costly database migrations.

Scenario: Fleet management system queries “show all vehicles in geofence polygon X in the last hour” 100 times/minute.

Without Spatial Index:

SELECT vehicle_id, location, speed_kmh, time
FROM vehicle_telemetry
WHERE time > NOW() - INTERVAL '1 hour'
  AND ST_Within(location::geometry, 'POLYGON((...))'::geometry);

-- Query plan: Sequential scan of 3.6M rows (10K vehicles x 360 readings/hour at 10s intervals)
-- Execution time: 8.5 seconds

With Spatial Index (GIST):

CREATE INDEX idx_location ON vehicle_telemetry USING GIST (location);

-- Query plan: Index scan of ~50 rows (vehicles in polygon)
-- Execution time: 12 milliseconds

Performance Calculation:

Queries per minute: 100
Without index: 100 queries x 8.5s = 850 seconds CPU (14 minutes)
With index: 100 queries x 0.012s = 1.2 seconds CPU

Speedup: 8.5s / 0.012s = 708x faster
CPU savings: 850s - 1.2s = 848.8 seconds/minute saved
Database load: 99.86% reduction

Cost Analysis:

  • Index creation time: 45 seconds (one-time)
  • Index storage: 120 MB (0.02% of table size)
  • Index maintenance: 2% write overhead
  • Query improvement: 708x faster

Key Insight: Spatial indexes transform IoT geofencing from unusable (8.5s) to real-time (12ms) at negligible cost.

Characteristic Continuous Aggregate Materialized View Which to Choose?
Data freshness Auto-refresh (5 min) Manual REFRESH Continuous if need recent data
Compute cost Incremental (only new data) Full recompute Continuous for large datasets
Setup complexity Medium (policies) Low (simple SQL) Materialized for prototypes
Use case Real-time dashboards Daily/weekly reports Continuous for production
Storage Stores aggregate only Stores full result Continuous saves space

Example Decision:

Scenario: Fleet dashboard showing “Average speed by vehicle in last 24 hours”

Option A: Real-time query (no pre-aggregation):

SELECT vehicle_id, AVG(speed_kmh) FROM vehicle_telemetry
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY vehicle_id;
-- Scans 86.4M rows (10K vehicles x 8,640 readings/day at 10s intervals)
-- Query time: 4.2 seconds

Option B: Continuous aggregate (pre-computed 5-min buckets):

CREATE MATERIALIZED VIEW vehicle_speed_5min
WITH (timescaledb.continuous) AS
SELECT time_bucket('5 minutes', time) AS bucket, vehicle_id, AVG(speed_kmh) as avg_speed
FROM vehicle_telemetry GROUP BY bucket, vehicle_id;

-- Dashboard queries 2.88M pre-aggregated rows (10K vehicles x 288 buckets per day)
-- Query time: 18 milliseconds
-- Refresh cost: 1 second every 5 minutes (incremental)

Choice: Use continuous aggregate (30x fewer rows to scan, significantly faster query times, always fresh).

Key Insight: For frequently-run IoT queries, continuous aggregates pay for themselves after ~10 queries.

Common Mistake: Not Planning for Data Migration

The Error: Starting with InfluxDB for simplicity, then realizing you need SQL joins after storing 5 TB of data.

Migration Nightmare:

Month 0: Deploy InfluxDB (fast, works great)
Month 6: Need to join sensor data with customer billing
Month 7: Realize InfluxDB cannot join with PostgreSQL easily
Month 8: Proposal: Migrate 5 TB to TimescaleDB
Month 9: Export attempt: 2 weeks continuous export (InfluxDB chokes)
Month 10: Incremental export + live dual-write (complex, error-prone)
Month 12: Migration complete, 3 months of dev time, $50K cost

Prevention: Start with flexible architecture using message queues:

[Sensors] -> [Kafka/MQTT] -> [Stream Processor] +-> [InfluxDB] (time-series)
                                                +-> [PostgreSQL] (metadata)
                                                +-> [MongoDB] (events)

Benefits:

  • Add/remove databases without sensor changes
  • Replay history to new database if needed
  • Easy A/B testing of database options

Migration Example (if you must):

# Step 1: Export InfluxDB to CSV (chunked by time)
# Note: InfluxDB v2 uses Flux; v3 uses SQL. Adjust query syntax to your version.
for month in {1..12}; do
    start=$(printf "2024-%02d-01T00:00:00Z" $month)
    if [ $month -eq 12 ]; then
        stop="2025-01-01T00:00:00Z"
    else
        stop=$(printf "2024-%02d-01T00:00:00Z" $((month+1)))
    fi
    influx query \
      "from(bucket: \"sensors\") |> range(start: ${start}, stop: ${stop})" \
      --raw > export_2024_$(printf "%02d" $month).csv
done

# Step 2: Import to TimescaleDB (transform CSV headers as needed)
for file in export_*.csv; do
    psql -c "\\COPY sensor_readings FROM '${file}' CSV HEADER"
done

# Step 3: Validate row counts match between source and target
influx query 'from(bucket: "sensors") |> count()' --raw
psql -c "SELECT COUNT(*) FROM sensor_readings;"

Time estimate: 5 TB at 10 MB/s = 500K seconds = 6 days export + 6 days import = 12 days

Key Insight: Migrating databases costs 10-100x more than choosing correctly upfront. Plan for growth.

Key Concepts
  • Fleet Telemetry Schema: A time-series table design for vehicle IoT data using device ID + timestamp as the composite key, with partitioning by vehicle group to balance write distribution and query co-location
  • Smart City Data Lake: A multi-tier storage architecture combining real-time streaming ingestion (Kafka), hot time-series storage (TimescaleDB), and cold archival (S3/Parquet) for city-scale sensor networks
  • Ingestion Pipeline: The data path from IoT device through broker (MQTT/Kafka) to database, including validation, enrichment, and routing stages that determine data quality and storage destination
  • Continuous Aggregate Materialization: Pre-computing hourly and daily rollups in TimescaleDB that auto-refresh as new sensor data arrives, enabling fleet-wide dashboards to query aggregates rather than billions of raw rows
  • Polyglot Persistence Pattern: Using PostgreSQL for device registry (relational), TimescaleDB for telemetry (time-series), Redis for real-time state (key-value), and S3/Parquet for cold archival (object) in a single production system
  • Partition Pruning: The query optimizer technique of skipping irrelevant time chunks based on WHERE clause time predicates, essential for achieving sub-second query times on multi-year IoT datasets
  • Write-Ahead Log Sizing: Configuring PostgreSQL/TimescaleDB WAL buffer and checkpoint settings to absorb bursty IoT write patterns without excessive I/O stalls during checkpoint operations
  • Tiered Compression Ratio: The actual storage reduction measured after applying columnar compression to IoT telemetry – typically 10-20x for float sensor values, 5-8x for mixed types – used to size storage budgets

Common Pitfalls

Storing telemetry for 50 different device types (vehicles, HVAC, water meters) in a single wide table with nullable columns for type-specific fields creates sparse, inefficient storage and slow queries. Use inheritance or separate hypertables per device category, joined through a common device registry.

An IoT worked example that performs perfectly with 100 simulated devices may collapse with 10,000 real devices due to lock contention, connection pool exhaustion, or disk I/O saturation. Always load test with realistic device counts, message rates, and concurrent query loads before cutover.

Running real-time SUM/AVG across all raw sensor data for a fleet dashboard query causes full table scans on billions of rows. Create continuous aggregates (hourly, daily) during initial schema design – retrofitting them into a running production system requires careful migration to avoid query downtime.

8.6 Summary

  • Fleet management example demonstrates database selection trade-offs (TimescaleDB vs InfluxDB vs MongoDB) with geospatial requirements as the deciding factor
  • Smart city data lake example shows tiered storage architecture reducing costs by 99.9% through edge processing, compression, and lifecycle policies
  • Schema evolution strategies enable gradual changes without breaking existing devices or queries
  • Cost optimization requires understanding access patterns to place data in appropriate storage tiers
  • Query optimization through continuous aggregates and proper indexing achieves sub-100ms latency

8.7 Concept Relationships

Prerequisites - Complete these first:

Related Concepts - Connect with:

Practical Applications:

8.8 What’s Next

You have completed the Data Storage and Databases chapter series. The worked examples here demonstrated how to translate requirements into concrete architectures. Your next step depends on where you want to go from here:

If you want to… Read this next
Scale storage for massive IoT datasets Big Data Overview
Process data at the edge before storing it Edge Compute Patterns
Explore cloud-native storage architectures Data in the Cloud
Connect heterogeneous data sources and formats Interoperability
Build real-time data pipelines with stream processing Stream Processing

8.9 Resources

8.9.1 Databases

8.9.2 Tools and Libraries

8.9.3 Books

  • “Designing Data-Intensive Applications” by Martin Kleppmann
  • “Database Internals” by Alex Petrov
  • “NoSQL Distilled” by Pramod Sadalage and Martin Fowler