1279  Data Storage Worked Examples

Learning Objectives

After completing this chapter, you will be able to:

  • Apply database selection frameworks to real-world IoT scenarios
  • Design complete storage architectures with appropriate tiering
  • Calculate storage requirements and costs for large-scale deployments
  • Implement schema designs optimized for specific query patterns

1279.1 Worked Example: Fleet Management System

1279.2 Choosing a Database for 10,000-Vehicle Fleet Tracking

Scenario: A logistics company is building a fleet management system for 10,000 delivery vehicles. Each vehicle reports GPS location, speed, fuel level, engine diagnostics, and driver behavior every 10 seconds. The system must support real-time vehicle tracking on a map, historical route analysis, geofencing alerts, and monthly compliance reports. The data must be retained for 3 years for regulatory compliance.

Goal: Select and design the optimal database architecture, comparing InfluxDB, TimescaleDB, and MongoDB for this use case, considering query patterns, data volumes, and operational costs.

What we do: Calculate data ingestion rate, storage requirements, and classify query patterns.

Why: Database selection depends heavily on data characteristics and access patterns.

Data ingestion analysis:

Metric Value Calculation
Vehicles 10,000 Fixed fleet size
Report interval 10 seconds GPS + telemetry
Messages per second 1,000 10,000 / 10
Payload size (avg) 250 bytes GPS + 8 sensor values + metadata
Bytes per second 250 KB/s 1,000 x 250 bytes
Daily volume (raw) 21.6 GB 250 KB/s x 86,400
Monthly volume 648 GB 21.6 GB x 30
3-year retention 23.3 TB 648 GB x 36

Query pattern classification:

Query Type Frequency Latency Requirement Data Range
Real-time map update 100/sec < 100ms Last reading per vehicle
Vehicle history (24h) 50/day < 2s 8,640 points per vehicle
Geofence check 1,000/sec < 50ms Current position vs. polygons
Monthly fuel report 10/month < 30s 30 days, all vehicles
Compliance audit 5/year < 5min 1 year, full resolution

Key insight: This workload combines time-series telemetry (95% of writes), real-time point queries (current state), and analytical aggregations (reports). A hybrid approach may be optimal.

What we do: Compare InfluxDB, TimescaleDB, and MongoDB against our requirements.

Why: Each database has strengths for different aspects of fleet management.

InfluxDB (Purpose-built time-series):

Aspect Rating Notes
Write throughput Excellent Built for high-frequency metrics
Time-range queries Excellent Native time-series optimizations
Point queries (latest) Good Requires careful schema design
Geospatial queries Poor No native geo support
Compression Excellent 10-20x for numeric data
SQL compatibility Limited Flux query language (learning curve)

TimescaleDB (PostgreSQL extension):

Aspect Rating Notes
Write throughput Very Good 100K+ rows/sec with hypertables
Time-range queries Excellent Automatic time partitioning
Point queries (latest) Good Continuous aggregates help
Geospatial queries Excellent PostGIS integration
Compression Very Good 90-95% for time-series
SQL compatibility Full Standard PostgreSQL SQL

MongoDB (Document database):

Aspect Rating Notes
Write throughput Very Good Sharding scales horizontally
Time-range queries Good Requires proper indexing
Point queries (latest) Excellent Flexible document model
Geospatial queries Very Good Native geospatial indexes
Compression Good WiredTiger compression
SQL compatibility None MongoDB Query Language

What we do: Design schemas for the recommended database (TimescaleDB) with alternatives.

Why: Schema design significantly impacts query performance and storage efficiency.

Recommendation: TimescaleDB - Best balance of time-series performance, geospatial capabilities, and SQL familiarity.

TimescaleDB schema design:

-- Main telemetry hypertable (time-partitioned)
CREATE TABLE vehicle_telemetry (
    time        TIMESTAMPTZ NOT NULL,
    vehicle_id  VARCHAR(20) NOT NULL,
    location    GEOGRAPHY(POINT, 4326),  -- PostGIS
    speed_kmh   SMALLINT,
    heading     SMALLINT,
    fuel_pct    SMALLINT,
    engine_rpm  SMALLINT,
    engine_temp SMALLINT,
    odometer_km INTEGER,
    driver_id   VARCHAR(20)
);

-- Convert to hypertable with 1-day chunks
SELECT create_hypertable('vehicle_telemetry', 'time',
    chunk_time_interval => INTERVAL '1 day');

-- Compression policy: compress after 7 days
ALTER TABLE vehicle_telemetry SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'vehicle_id'
);
SELECT add_compression_policy('vehicle_telemetry', INTERVAL '7 days');

-- Indexes for common queries
CREATE INDEX ON vehicle_telemetry (vehicle_id, time DESC);
CREATE INDEX ON vehicle_telemetry USING GIST (location);

-- Continuous aggregate for latest position (real-time map)
CREATE MATERIALIZED VIEW vehicle_latest
WITH (timescaledb.continuous) AS
SELECT vehicle_id,
       last(time, time) AS last_seen,
       last(location, time) AS location,
       last(speed_kmh, time) AS speed_kmh
FROM vehicle_telemetry
GROUP BY vehicle_id;

-- Refresh every 10 seconds for near-real-time
SELECT add_continuous_aggregate_policy('vehicle_latest',
    start_offset => INTERVAL '1 hour',
    end_offset => INTERVAL '10 seconds',
    schedule_interval => INTERVAL '10 seconds');

Storage tier design (3-year retention with cost optimization):

Tier Age Resolution Storage Monthly Cost
Hot 0-7 days 10-second SSD $50/TB
Warm 7-90 days 10-second (compressed) HDD $10/TB
Cold 90 days - 3 years 1-minute aggregates Object storage $2/TB

Storage cost calculation:

Tier Data Volume Storage Cost Notes
Hot (7 days) 150 GB $7.50/month Uncompressed, fast SSD
Warm (83 days) 1.5 TB (compressed) $15/month 90% compression
Cold (2.75 years) 2 TB $4/month Downsampled 6x
Total ~3.7 TB effective ~$26.50/month vs. $1,165/mo uncompressed

What we do: Write optimized queries for each use case.

Why: Database choice is only half the solution; query design determines actual performance.

Real-time map update (< 100ms requirement):

-- Query the continuous aggregate for latest positions
SELECT vehicle_id, location, speed_kmh, last_seen
FROM vehicle_latest
WHERE last_seen > NOW() - INTERVAL '2 minutes';

-- Performance: < 10ms for 10,000 vehicles (indexed materialized view)

Vehicle 24-hour history (< 2s requirement):

-- Time-bounded query with vehicle filter
SELECT time, location, speed_kmh, fuel_pct
FROM vehicle_telemetry
WHERE vehicle_id = 'VEH-001'
  AND time > NOW() - INTERVAL '24 hours'
ORDER BY time;

-- Performance: < 500ms (8,640 rows from single day chunk)

Geofence check (< 50ms requirement):

-- Check if vehicle is within geofenced area
SELECT v.vehicle_id, g.zone_name
FROM vehicle_latest v
JOIN geofence_zones g ON ST_Within(v.location::geometry, g.boundary)
WHERE v.last_seen > NOW() - INTERVAL '1 minute';

-- Performance: < 30ms with spatial index

Monthly fuel consumption report (< 30s requirement):

-- Aggregate fuel consumption per vehicle
SELECT vehicle_id,
       MIN(fuel_pct) AS min_fuel,
       MAX(fuel_pct) AS max_fuel,
       SUM(CASE WHEN fuel_pct_diff > 0 THEN fuel_pct_diff ELSE 0 END) AS refuel_total
FROM (
    SELECT vehicle_id, fuel_pct,
           fuel_pct - LAG(fuel_pct) OVER (PARTITION BY vehicle_id ORDER BY time) AS fuel_pct_diff
    FROM vehicle_telemetry
    WHERE time > NOW() - INTERVAL '30 days'
) sub
GROUP BY vehicle_id;

-- Performance: < 15s (parallelized across 30 day-chunks)

Outcome: TimescaleDB with PostgreSQL/PostGIS selected as the optimal database for fleet management, with a tiered storage strategy reducing costs by 97%.

Key decisions made and why:

Decision Choice Rationale
Primary database TimescaleDB Best combination of time-series + geospatial + SQL
Why not InfluxDB No geospatial Geofencing is critical; would need secondary database
Why not MongoDB Not ideal for time-series Time-range aggregations less efficient
Storage strategy 3-tier (hot/warm/cold) 97% cost reduction while meeting latency SLAs
Real-time queries Continuous aggregates Pre-computed latest position; < 10ms response

Quantified outcomes:

  • Write throughput: 1,000 messages/second sustained (headroom to 10K/s)
  • Real-time map latency: < 10ms (continuous aggregate)
  • 24-hour history query: < 500ms (single chunk access)
  • Monthly report generation: < 15s (parallel chunk processing)
  • Storage cost: $26.50/month (vs. $1,165/month uncompressed)
  • 3-year retention: 3.7 TB effective storage (97% reduction)

1279.3 Worked Example: Smart City Data Lake

1279.4 Designing a Data Lake for a Smart City with 50,000 Sensors

Scenario: A city government is deploying a smart city platform with 50,000 sensors generating diverse data types: traffic flow (video + counts), air quality (particulate matter, CO2, ozone), noise levels, weather stations, smart streetlights, and water quality monitors. The platform must support real-time dashboards for city operations, historical analysis for urban planning, ML model training for predictive services, and compliance with data retention regulations (7 years for environmental data).

Goal: Design a cost-optimized data lake architecture with appropriate storage tiers, partitioning strategy, query patterns, and schema evolution approach that balances performance, cost, and regulatory compliance.

What we do: Catalog all data sources, estimate volumes, and classify by access patterns and retention requirements.

Why: Storage tier selection and partitioning strategy depend heavily on data characteristics.

Data source inventory:

Source Count Frequency Payload Daily Volume Data Type
Traffic cameras 2,000 30 fps video 50 KB/frame 259 TB Unstructured
Traffic counters 5,000 1 min 100 bytes 720 MB Time-series
Air quality 500 5 min 200 bytes 29 MB Time-series
Noise sensors 1,000 10 sec 50 bytes 432 MB Time-series
Weather stations 100 15 min 500 bytes 4.8 MB Time-series
Smart streetlights 40,000 1 hour 150 bytes 144 MB Time-series
Water quality 400 30 min 300 bytes 5.8 MB Time-series
Total sensors 49,000 - - ~260 TB/day Mixed

Key insight: Video data (99.5% of volume) dominates storage costs but is accessed rarely (incident investigation only). Time-series sensor data (0.5% of volume) is accessed frequently for dashboards and analytics. This split demands a tiered approach.

What we do: Define hot, warm, and cold storage tiers with appropriate technologies for each data type.

Why: A single storage tier cannot cost-effectively serve all access patterns.

Three-tier architecture:

[50,000 Sensors]
       |
       v
[Ingestion Layer: Kafka]
       |
   +---+---+
   |       |
   v       v
[Time-Series]  [Object Storage]
   |               |
   v               v
+--+--+       +---+---+
|     |       |   |   |
v     v       v   v   v
[Hot] [Warm]  [Hot][Warm][Cold]
 7d   90d     24h  30d  Archive

Tier specifications:

Tier Technology Data Retention Storage Cost Query Latency
Hot (TS) TimescaleDB SSD Sensor readings 7 days $0.10/GB/mo < 50ms
Warm (TS) TimescaleDB HDD Sensor readings 90 days $0.02/GB/mo < 500ms
Cold (TS) S3 + Parquet Aggregated sensors 7 years $0.004/GB/mo < 10s
Hot (Video) MinIO SSD Recent clips 24 hours $0.10/GB/mo < 1s
Warm (Video) MinIO HDD Investigation window 30 days $0.02/GB/mo < 5s
Cold (Video) S3 Glacier Archive 90 days $0.004/GB/mo 3-5 hours

What we do: Design partition keys for optimal query performance across different access patterns.

Why: Poor partitioning causes full-table scans (slow, expensive) or hot partitions (write bottlenecks).

Time-series partitioning (TimescaleDB hypertables):

-- Main sensor readings hypertable
CREATE TABLE sensor_readings (
    time        TIMESTAMPTZ NOT NULL,
    sensor_id   VARCHAR(32) NOT NULL,
    sensor_type VARCHAR(20) NOT NULL,
    location_id VARCHAR(20) NOT NULL,
    value       DOUBLE PRECISION,
    unit        VARCHAR(10),
    quality     SMALLINT DEFAULT 100
);

-- Create hypertable with 1-hour chunks (optimized for high-frequency data)
SELECT create_hypertable('sensor_readings', 'time',
    chunk_time_interval => INTERVAL '1 hour',
    create_default_indexes => FALSE);

-- Compound index for dashboard queries: "last hour by sensor type in district"
CREATE INDEX idx_type_location_time ON sensor_readings
    (sensor_type, location_id, time DESC);

-- Compression policy: compress after 24 hours
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'sensor_id, sensor_type',
    timescaledb.compress_orderby = 'time DESC'
);
SELECT add_compression_policy('sensor_readings', INTERVAL '24 hours');

-- Retention policy: move to S3 after 90 days
SELECT add_retention_policy('sensor_readings', INTERVAL '90 days');

Object storage partitioning (video):

s3://smart-city-video/
β”œβ”€β”€ raw/
β”‚   └── year=2024/
β”‚       └── month=01/
β”‚           └── day=15/
β”‚               └── camera_id=CAM-001/
β”‚                   └── hour=00/
β”‚                       β”œβ”€β”€ 00-00-00_00-05-00.mp4
β”‚                       └── 00-05-00_00-10-00.mp4
β”œβ”€β”€ processed/
β”‚   β”œβ”€β”€ thumbnails/
β”‚   β”œβ”€β”€ motion_clips/
β”‚   └── analytics/
└── archive/ (Glacier, 90+ days)

What we do: Plan for schema changes as new sensor types are added and data requirements evolve.

Why: IoT deployments evolve constantly. New sensor types, additional fields, changed units.

Schema versioning approach:

-- Metadata table tracking schema versions
CREATE TABLE schema_versions (
    sensor_type     VARCHAR(20) PRIMARY KEY,
    current_version INTEGER NOT NULL,
    schema_json     JSONB NOT NULL,
    effective_from  TIMESTAMPTZ NOT NULL,
    changelog       TEXT
);

-- Example: Air quality schema evolution
INSERT INTO schema_versions VALUES
('air_quality', 1, '{
    "fields": ["pm25", "pm10", "co2", "temperature", "humidity"],
    "units": {"pm25": "ug/m3", "pm10": "ug/m3", "co2": "ppm"}
}', '2024-01-01', 'Initial deployment'),

('air_quality', 2, '{
    "fields": ["pm25", "pm10", "pm1", "co2", "no2", "o3", "temperature", "humidity", "pressure"],
    "units": {"pm25": "ug/m3", "pm1": "ug/m3", "no2": "ppb", "o3": "ppb", "pressure": "hPa"}
}', '2024-06-01', 'Added PM1, NO2, O3, pressure from new sensor model');

Forward compatibility rules:

  1. Adding fields: Always optional with defaults (NULL or computed)
  2. Changing units: Store original + converted, never overwrite historical
  3. Renaming fields: Create view alias, deprecate old name over 6 months
  4. Removing fields: Stop populating, retain historical for query compatibility

What we do: Calculate storage costs and implement aggressive cost reduction through compression, aggregation, and tiering.

Why: At 260 TB/day, naive storage costs would exceed $1M/month. Smart optimization reduces this by 95%+.

Storage cost analysis (before optimization):

Data Type Daily Volume Monthly (30d) Yearly Cost @ $0.02/GB Annual Cost
Video raw 259 TB 7,770 TB 93,240 TB $0.02 $1,864,800
Sensors 1.4 GB 42 GB 504 GB $0.02 $10
Total 259 TB 7,770 TB 93,240 TB - $1,864,810

Optimization strategies applied:

Strategy Target Reduction Implementation
Video motion detection Keep only motion clips 90% Edge AI filters static footage
Video resolution tiering 4K to 1080p to 480p by age 75% Transcode at 24h, 7d boundaries
Video retention policy Delete after 90 days 92% Lifecycle policy (compliance exempt)
Sensor compression TimescaleDB native 92% Segment by sensor_id
Sensor downsampling 1-min to 1-hour to 1-day 99% Continuous aggregates for cold tier

Cost after optimization:

Tier Data Type Retention Volume/Year Unit Cost Annual Cost
Hot TS Sensors 7 days 10 GB $0.10 $1
Warm TS Sensors 90 days 100 GB $0.02 $2
Cold TS Aggregates 7 years 500 GB $0.004 $14
Hot Video Motion clips 24 hours 5 TB $0.10 $500
Warm Video Motion clips 30 days 150 TB $0.02 $3,000
Cold Video Compliance 90 days* 200 TB $0.004 $800
Total - - ~350 TB - $4,317

*Environmental video exempt from 90-day deletion (compliance requirement)

Savings: $1,864,810 to $4,317 = 99.8% reduction ($1.86M/year saved)

Outcome: A production-ready data lake architecture serving 50,000 sensors with sub-100ms dashboard latency, 7-year compliance retention, and $4,317/year storage cost (vs. $1.86M unoptimized).

Key decisions made and why:

Decision Choice Rationale
Time-series database TimescaleDB SQL familiarity for city analysts, native compression, continuous aggregates
Object storage MinIO + S3 On-prem for high-bandwidth video, S3 for archive
Video processing Edge motion detection 90% volume reduction at source, keeps only relevant footage
Partitioning Time-first, then sensor Matches 95% of query patterns (time-range dashboards)
Schema evolution JSONB + version table Maximum flexibility without migration downtime
Cold storage format Parquet Columnar compression, Spark/Presto compatible for ML

Architecture diagram:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#7F8C8D', 'secondaryColor': '#16A085', 'tertiaryColor': '#E67E22', 'fontSize': '12px'}}}%%
flowchart TB
    subgraph Sources["Data Sources"]
        SENSORS["50,000 Sensors"]
        CAMERAS["Traffic Cameras"]
        EDGE["Edge AI:<br/>Motion Detection"]
    end

    KAFKA["Apache Kafka"]

    subgraph TSPath["Time-Series Path"]
        TSHOT["TimescaleDB Hot<br/>(7 days)"]
        TSWARM["TimescaleDB Warm<br/>(90 days)"]
        TSCOLD["S3 Parquet<br/>(7 years)"]
    end

    subgraph VideoPath["Video Path"]
        VHOT["MinIO Hot Video<br/>(24 hours)"]
        VWARM["MinIO Warm Video<br/>(30 days)"]
        VCOLD["S3 Glacier<br/>(90 days archive)"]
    end

    subgraph MLPath["Analytics Path"]
        SPARK["Spark Streaming"]
        ML["ML Training<br/>(Weekly batch)"]
    end

    SENSORS --> KAFKA
    CAMERAS --> EDGE
    EDGE --> KAFKA

    KAFKA --> TSHOT
    KAFKA --> VHOT
    KAFKA --> SPARK

    TSHOT --> TSWARM
    TSWARM --> TSCOLD

    VHOT --> VWARM
    VWARM --> VCOLD

    SPARK --> ML

    style Sources fill:#2C3E50,stroke:#16A085,color:#fff
    style TSPath fill:#16A085,stroke:#2C3E50,color:#fff
    style VideoPath fill:#E67E22,stroke:#2C3E50,color:#fff
    style MLPath fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style KAFKA fill:#2C3E50,stroke:#E67E22,color:#fff
    style EDGE fill:#E67E22,stroke:#2C3E50,color:#fff

Quantified outcomes:

  • Real-time dashboard latency: 45ms P99 (target: < 100ms)
  • Historical query (1-year range): 2.3s (target: < 5s)
  • Video clip retrieval: 8s hot, 45s warm (target: < 30s)
  • Annual storage cost: $4,317 (99.8% reduction from naive approach)
  • Compliance coverage: 7 years environmental data, 90 days video with compliance exemptions
  • Schema migrations: Zero downtime (additive changes only)

1279.5 Summary

  • Fleet management example demonstrates database selection trade-offs (TimescaleDB vs InfluxDB vs MongoDB) with geospatial requirements as the deciding factor
  • Smart city data lake example shows tiered storage architecture reducing costs by 99.8% through edge processing, compression, and lifecycle policies
  • Schema evolution strategies enable gradual changes without breaking existing devices or queries
  • Cost optimization requires understanding access patterns to place data in appropriate storage tiers
  • Query optimization through continuous aggregates and proper indexing achieves sub-100ms latency

1279.6 What’s Next

You’ve completed the Data Storage and Databases chapter series. Continue exploring:

1279.7 Resources

1279.7.1 Databases

1279.7.2 Tools and Libraries

1279.7.3 Books

  • β€œDesigning Data-Intensive Applications” by Martin Kleppmann
  • β€œDatabase Internals” by Alex Petrov
  • β€œNoSQL Distilled” by Pramod Sadalage and Martin Fowler