1279 Data Storage Worked Examples

Learning Objectives

After completing this chapter, you will be able to:

Apply database selection frameworks to real-world IoT scenarios
Design complete storage architectures with appropriate tiering
Calculate storage requirements and costs for large-scale deployments
Implement schema designs optimized for specific query patterns

1279.1 Worked Example: Fleet Management System

1279.2 Choosing a Database for 10,000-Vehicle Fleet Tracking

Scenario: A logistics company is building a fleet management system for 10,000 delivery vehicles. Each vehicle reports GPS location, speed, fuel level, engine diagnostics, and driver behavior every 10 seconds. The system must support real-time vehicle tracking on a map, historical route analysis, geofencing alerts, and monthly compliance reports. The data must be retained for 3 years for regulatory compliance.

Goal: Select and design the optimal database architecture, comparing InfluxDB, TimescaleDB, and MongoDB for this use case, considering query patterns, data volumes, and operational costs.

What we do: Calculate data ingestion rate, storage requirements, and classify query patterns.

Why: Database selection depends heavily on data characteristics and access patterns.

Data ingestion analysis:

Metric	Value	Calculation
Vehicles	10,000	Fixed fleet size
Report interval	10 seconds	GPS + telemetry
Messages per second	1,000	10,000 / 10
Payload size (avg)	250 bytes	GPS + 8 sensor values + metadata
Bytes per second	250 KB/s	1,000 x 250 bytes
Daily volume (raw)	21.6 GB	250 KB/s x 86,400
Monthly volume	648 GB	21.6 GB x 30
3-year retention	23.3 TB	648 GB x 36

Query pattern classification:

Query Type	Frequency	Latency Requirement	Data Range
Real-time map update	100/sec	< 100ms	Last reading per vehicle
Vehicle history (24h)	50/day	< 2s	8,640 points per vehicle
Geofence check	1,000/sec	< 50ms	Current position vs. polygons
Monthly fuel report	10/month	< 30s	30 days, all vehicles
Compliance audit	5/year	< 5min	1 year, full resolution

Key insight: This workload combines time-series telemetry (95% of writes), real-time point queries (current state), and analytical aggregations (reports). A hybrid approach may be optimal.

What we do: Compare InfluxDB, TimescaleDB, and MongoDB against our requirements.

Why: Each database has strengths for different aspects of fleet management.

InfluxDB (Purpose-built time-series):

Aspect	Rating	Notes
Write throughput	Excellent	Built for high-frequency metrics
Time-range queries	Excellent	Native time-series optimizations
Point queries (latest)	Good	Requires careful schema design
Geospatial queries	Poor	No native geo support
Compression	Excellent	10-20x for numeric data
SQL compatibility	Limited	Flux query language (learning curve)

TimescaleDB (PostgreSQL extension):

Aspect	Rating	Notes
Write throughput	Very Good	100K+ rows/sec with hypertables
Time-range queries	Excellent	Automatic time partitioning
Point queries (latest)	Good	Continuous aggregates help
Geospatial queries	Excellent	PostGIS integration
Compression	Very Good	90-95% for time-series
SQL compatibility	Full	Standard PostgreSQL SQL

MongoDB (Document database):

Aspect	Rating	Notes
Write throughput	Very Good	Sharding scales horizontally
Time-range queries	Good	Requires proper indexing
Point queries (latest)	Excellent	Flexible document model
Geospatial queries	Very Good	Native geospatial indexes
Compression	Good	WiredTiger compression
SQL compatibility	None	MongoDB Query Language

What we do: Design schemas for the recommended database (TimescaleDB) with alternatives.

Why: Schema design significantly impacts query performance and storage efficiency.

Recommendation: TimescaleDB - Best balance of time-series performance, geospatial capabilities, and SQL familiarity.

TimescaleDB schema design:

-- Main telemetry hypertable (time-partitioned)
CREATE TABLE vehicle_telemetry (
    time        TIMESTAMPTZ NOT NULL,
    vehicle_id  VARCHAR(20) NOT NULL,
    location    GEOGRAPHY(POINT, 4326),  -- PostGIS
    speed_kmh   SMALLINT,
    heading     SMALLINT,
    fuel_pct    SMALLINT,
    engine_rpm  SMALLINT,
    engine_temp SMALLINT,
    odometer_km INTEGER,
    driver_id   VARCHAR(20)
);

-- Convert to hypertable with 1-day chunks
SELECT create_hypertable('vehicle_telemetry', 'time',
    chunk_time_interval => INTERVAL '1 day');

-- Compression policy: compress after 7 days
ALTER TABLE vehicle_telemetry SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'vehicle_id'
);
SELECT add_compression_policy('vehicle_telemetry', INTERVAL '7 days');

-- Indexes for common queries
CREATE INDEX ON vehicle_telemetry (vehicle_id, time DESC);
CREATE INDEX ON vehicle_telemetry USING GIST (location);

-- Continuous aggregate for latest position (real-time map)
CREATE MATERIALIZED VIEW vehicle_latest
WITH (timescaledb.continuous) AS
SELECT vehicle_id,
       last(time, time) AS last_seen,
       last(location, time) AS location,
       last(speed_kmh, time) AS speed_kmh
FROM vehicle_telemetry
GROUP BY vehicle_id;

-- Refresh every 10 seconds for near-real-time
SELECT add_continuous_aggregate_policy('vehicle_latest',
    start_offset => INTERVAL '1 hour',
    end_offset => INTERVAL '10 seconds',
    schedule_interval => INTERVAL '10 seconds');

Storage tier design (3-year retention with cost optimization):

Tier	Age	Resolution	Storage	Monthly Cost
Hot	0-7 days	10-second	SSD	$50/TB
Warm	7-90 days	10-second (compressed)	HDD	$10/TB
Cold	90 days - 3 years	1-minute aggregates	Object storage	$2/TB

Storage cost calculation:

Tier	Data Volume	Storage Cost	Notes
Hot (7 days)	150 GB	$7.50/month	Uncompressed, fast SSD
Warm (83 days)	1.5 TB (compressed)	$15/month	90% compression
Cold (2.75 years)	2 TB	$4/month	Downsampled 6x
Total	~3.7 TB effective	~$26.50/month	vs. $1,165/mo uncompressed

What we do: Write optimized queries for each use case.

Why: Database choice is only half the solution; query design determines actual performance.

Real-time map update (< 100ms requirement):

-- Query the continuous aggregate for latest positions
SELECT vehicle_id, location, speed_kmh, last_seen
FROM vehicle_latest
WHERE last_seen > NOW() - INTERVAL '2 minutes';

-- Performance: < 10ms for 10,000 vehicles (indexed materialized view)

Vehicle 24-hour history (< 2s requirement):

-- Time-bounded query with vehicle filter
SELECT time, location, speed_kmh, fuel_pct
FROM vehicle_telemetry
WHERE vehicle_id = 'VEH-001'
  AND time > NOW() - INTERVAL '24 hours'
ORDER BY time;

-- Performance: < 500ms (8,640 rows from single day chunk)

Geofence check (< 50ms requirement):

-- Check if vehicle is within geofenced area
SELECT v.vehicle_id, g.zone_name
FROM vehicle_latest v
JOIN geofence_zones g ON ST_Within(v.location::geometry, g.boundary)
WHERE v.last_seen > NOW() - INTERVAL '1 minute';

-- Performance: < 30ms with spatial index

Monthly fuel consumption report (< 30s requirement):

-- Aggregate fuel consumption per vehicle
SELECT vehicle_id,
       MIN(fuel_pct) AS min_fuel,
       MAX(fuel_pct) AS max_fuel,
       SUM(CASE WHEN fuel_pct_diff > 0 THEN fuel_pct_diff ELSE 0 END) AS refuel_total
FROM (
    SELECT vehicle_id, fuel_pct,
           fuel_pct - LAG(fuel_pct) OVER (PARTITION BY vehicle_id ORDER BY time) AS fuel_pct_diff
    FROM vehicle_telemetry
    WHERE time > NOW() - INTERVAL '30 days'
) sub
GROUP BY vehicle_id;

-- Performance: < 15s (parallelized across 30 day-chunks)

Outcome: TimescaleDB with PostgreSQL/PostGIS selected as the optimal database for fleet management, with a tiered storage strategy reducing costs by 97%.

Key decisions made and why:

Decision	Choice	Rationale
Primary database	TimescaleDB	Best combination of time-series + geospatial + SQL
Why not InfluxDB	No geospatial	Geofencing is critical; would need secondary database
Why not MongoDB	Not ideal for time-series	Time-range aggregations less efficient
Storage strategy	3-tier (hot/warm/cold)	97% cost reduction while meeting latency SLAs
Real-time queries	Continuous aggregates	Pre-computed latest position; < 10ms response

Quantified outcomes:

Write throughput: 1,000 messages/second sustained (headroom to 10K/s)
Real-time map latency: < 10ms (continuous aggregate)
24-hour history query: < 500ms (single chunk access)
Monthly report generation: < 15s (parallel chunk processing)
Storage cost: $26.50/month (vs. $1,165/month uncompressed)
3-year retention: 3.7 TB effective storage (97% reduction)

Show code

{
  const container = document.getElementById('kc-db-11');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "Based on the fleet management worked example, why was TimescaleDB chosen over InfluxDB despite InfluxDB being a 'purpose-built' time-series database?",
      options: [
        {text: "TimescaleDB has better write performance for high-volume data", correct: false, feedback: "Actually, InfluxDB often has slightly higher raw write performance. The decision was based on feature set, not just write speed."},
        {text: "TimescaleDB provides PostGIS integration for geofencing and spatial queries", correct: true, feedback: "Correct! Fleet management requires geospatial queries (geofencing, route analysis). InfluxDB lacks native geo support, which would require a secondary database. TimescaleDB + PostGIS handles both time-series and spatial in one system."},
        {text: "InfluxDB doesn't support compression for cost optimization", correct: false, feedback: "InfluxDB actually has excellent compression (10-20x for numeric data). The limitation was geospatial support, not compression."},
        {text: "TimescaleDB is free while InfluxDB requires enterprise licensing", correct: false, feedback: "Both have open-source versions. The decision was technical (geospatial support), not licensing-driven."}
      ],
      difficulty: "medium",
      topic: "iot-databases"
    }));
  }
}

1279.3 Worked Example: Smart City Data Lake

1279.4 Designing a Data Lake for a Smart City with 50,000 Sensors

Scenario: A city government is deploying a smart city platform with 50,000 sensors generating diverse data types: traffic flow (video + counts), air quality (particulate matter, CO2, ozone), noise levels, weather stations, smart streetlights, and water quality monitors. The platform must support real-time dashboards for city operations, historical analysis for urban planning, ML model training for predictive services, and compliance with data retention regulations (7 years for environmental data).

Goal: Design a cost-optimized data lake architecture with appropriate storage tiers, partitioning strategy, query patterns, and schema evolution approach that balances performance, cost, and regulatory compliance.

What we do: Catalog all data sources, estimate volumes, and classify by access patterns and retention requirements.

Why: Storage tier selection and partitioning strategy depend heavily on data characteristics.

Data source inventory:

Source	Count	Frequency	Payload	Daily Volume	Data Type
Traffic cameras	2,000	30 fps video	50 KB/frame	259 TB	Unstructured
Traffic counters	5,000	1 min	100 bytes	720 MB	Time-series
Air quality	500	5 min	200 bytes	29 MB	Time-series
Noise sensors	1,000	10 sec	50 bytes	432 MB	Time-series
Weather stations	100	15 min	500 bytes	4.8 MB	Time-series
Smart streetlights	40,000	1 hour	150 bytes	144 MB	Time-series
Water quality	400	30 min	300 bytes	5.8 MB	Time-series
Total sensors	49,000	-	-	~260 TB/day	Mixed

Key insight: Video data (99.5% of volume) dominates storage costs but is accessed rarely (incident investigation only). Time-series sensor data (0.5% of volume) is accessed frequently for dashboards and analytics. This split demands a tiered approach.

What we do: Define hot, warm, and cold storage tiers with appropriate technologies for each data type.

Why: A single storage tier cannot cost-effectively serve all access patterns.

Three-tier architecture:

[50,000 Sensors]
       |
       v
[Ingestion Layer: Kafka]
       |
   +---+---+
   |       |
   v       v
[Time-Series]  [Object Storage]
   |               |
   v               v
+--+--+       +---+---+
|     |       |   |   |
v     v       v   v   v
[Hot] [Warm]  [Hot][Warm][Cold]
 7d   90d     24h  30d  Archive

Tier specifications:

Tier	Technology	Data	Retention	Storage Cost	Query Latency
Hot (TS)	TimescaleDB SSD	Sensor readings	7 days	$0.10/GB/mo	< 50ms
Warm (TS)	TimescaleDB HDD	Sensor readings	90 days	$0.02/GB/mo	< 500ms
Cold (TS)	S3 + Parquet	Aggregated sensors	7 years	$0.004/GB/mo	< 10s
Hot (Video)	MinIO SSD	Recent clips	24 hours	$0.10/GB/mo	< 1s
Warm (Video)	MinIO HDD	Investigation window	30 days	$0.02/GB/mo	< 5s
Cold (Video)	S3 Glacier	Archive	90 days	$0.004/GB/mo	3-5 hours

What we do: Design partition keys for optimal query performance across different access patterns.

Why: Poor partitioning causes full-table scans (slow, expensive) or hot partitions (write bottlenecks).

Time-series partitioning (TimescaleDB hypertables):

-- Main sensor readings hypertable
CREATE TABLE sensor_readings (
    time        TIMESTAMPTZ NOT NULL,
    sensor_id   VARCHAR(32) NOT NULL,
    sensor_type VARCHAR(20) NOT NULL,
    location_id VARCHAR(20) NOT NULL,
    value       DOUBLE PRECISION,
    unit        VARCHAR(10),
    quality     SMALLINT DEFAULT 100
);

-- Create hypertable with 1-hour chunks (optimized for high-frequency data)
SELECT create_hypertable('sensor_readings', 'time',
    chunk_time_interval => INTERVAL '1 hour',
    create_default_indexes => FALSE);

-- Compound index for dashboard queries: "last hour by sensor type in district"
CREATE INDEX idx_type_location_time ON sensor_readings
    (sensor_type, location_id, time DESC);

-- Compression policy: compress after 24 hours
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'sensor_id, sensor_type',
    timescaledb.compress_orderby = 'time DESC'
);
SELECT add_compression_policy('sensor_readings', INTERVAL '24 hours');

-- Retention policy: move to S3 after 90 days
SELECT add_retention_policy('sensor_readings', INTERVAL '90 days');

Object storage partitioning (video):

s3://smart-city-video/
├── raw/
│   └── year=2024/
│       └── month=01/
│           └── day=15/
│               └── camera_id=CAM-001/
│                   └── hour=00/
│                       ├── 00-00-00_00-05-00.mp4
│                       └── 00-05-00_00-10-00.mp4
├── processed/
│   ├── thumbnails/
│   ├── motion_clips/
│   └── analytics/
└── archive/ (Glacier, 90+ days)

What we do: Plan for schema changes as new sensor types are added and data requirements evolve.

Why: IoT deployments evolve constantly. New sensor types, additional fields, changed units.

Schema versioning approach:

-- Metadata table tracking schema versions
CREATE TABLE schema_versions (
    sensor_type     VARCHAR(20) PRIMARY KEY,
    current_version INTEGER NOT NULL,
    schema_json     JSONB NOT NULL,
    effective_from  TIMESTAMPTZ NOT NULL,
    changelog       TEXT
);

-- Example: Air quality schema evolution
INSERT INTO schema_versions VALUES
('air_quality', 1, '{
    "fields": ["pm25", "pm10", "co2", "temperature", "humidity"],
    "units": {"pm25": "ug/m3", "pm10": "ug/m3", "co2": "ppm"}
}', '2024-01-01', 'Initial deployment'),

('air_quality', 2, '{
    "fields": ["pm25", "pm10", "pm1", "co2", "no2", "o3", "temperature", "humidity", "pressure"],
    "units": {"pm25": "ug/m3", "pm1": "ug/m3", "no2": "ppb", "o3": "ppb", "pressure": "hPa"}
}', '2024-06-01', 'Added PM1, NO2, O3, pressure from new sensor model');

Forward compatibility rules:

Adding fields: Always optional with defaults (NULL or computed)
Changing units: Store original + converted, never overwrite historical
Renaming fields: Create view alias, deprecate old name over 6 months
Removing fields: Stop populating, retain historical for query compatibility

What we do: Calculate storage costs and implement aggressive cost reduction through compression, aggregation, and tiering.

Why: At 260 TB/day, naive storage costs would exceed $1M/month. Smart optimization reduces this by 95%+.

Storage cost analysis (before optimization):

Data Type	Daily Volume	Monthly (30d)	Yearly	Cost @ $0.02/GB	Annual Cost
Video raw	259 TB	7,770 TB	93,240 TB	$0.02	$1,864,800
Sensors	1.4 GB	42 GB	504 GB	$0.02	$10
Total	259 TB	7,770 TB	93,240 TB	-	$1,864,810

Optimization strategies applied:

Strategy	Target	Reduction	Implementation
Video motion detection	Keep only motion clips	90%	Edge AI filters static footage
Video resolution tiering	4K to 1080p to 480p by age	75%	Transcode at 24h, 7d boundaries
Video retention policy	Delete after 90 days	92%	Lifecycle policy (compliance exempt)
Sensor compression	TimescaleDB native	92%	Segment by sensor_id
Sensor downsampling	1-min to 1-hour to 1-day	99%	Continuous aggregates for cold tier

Cost after optimization:

Tier	Data Type	Retention	Volume/Year	Unit Cost	Annual Cost
Hot TS	Sensors	7 days	10 GB	$0.10	$1
Warm TS	Sensors	90 days	100 GB	$0.02	$2
Cold TS	Aggregates	7 years	500 GB	$0.004	$14
Hot Video	Motion clips	24 hours	5 TB	$0.10	$500
Warm Video	Motion clips	30 days	150 TB	$0.02	$3,000
Cold Video	Compliance	90 days*	200 TB	$0.004	$800
Total	-	-	~350 TB	-	$4,317

*Environmental video exempt from 90-day deletion (compliance requirement)

Savings: $1,864,810 to $4,317 = 99.8% reduction ($1.86M/year saved)

Outcome: A production-ready data lake architecture serving 50,000 sensors with sub-100ms dashboard latency, 7-year compliance retention, and $4,317/year storage cost (vs. $1.86M unoptimized).

Key decisions made and why:

Decision	Choice	Rationale
Time-series database	TimescaleDB	SQL familiarity for city analysts, native compression, continuous aggregates
Object storage	MinIO + S3	On-prem for high-bandwidth video, S3 for archive
Video processing	Edge motion detection	90% volume reduction at source, keeps only relevant footage
Partitioning	Time-first, then sensor	Matches 95% of query patterns (time-range dashboards)
Schema evolution	JSONB + version table	Maximum flexibility without migration downtime
Cold storage format	Parquet	Columnar compression, Spark/Presto compatible for ML

Architecture diagram:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#7F8C8D', 'secondaryColor': '#16A085', 'tertiaryColor': '#E67E22', 'fontSize': '12px'}}}%%
flowchart TB
    subgraph Sources["Data Sources"]
        SENSORS["50,000 Sensors"]
        CAMERAS["Traffic Cameras"]
        EDGE["Edge AI:<br/>Motion Detection"]
    end

    KAFKA["Apache Kafka"]

    subgraph TSPath["Time-Series Path"]
        TSHOT["TimescaleDB Hot<br/>(7 days)"]
        TSWARM["TimescaleDB Warm<br/>(90 days)"]
        TSCOLD["S3 Parquet<br/>(7 years)"]
    end

    subgraph VideoPath["Video Path"]
        VHOT["MinIO Hot Video<br/>(24 hours)"]
        VWARM["MinIO Warm Video<br/>(30 days)"]
        VCOLD["S3 Glacier<br/>(90 days archive)"]
    end

    subgraph MLPath["Analytics Path"]
        SPARK["Spark Streaming"]
        ML["ML Training<br/>(Weekly batch)"]
    end

    SENSORS --> KAFKA
    CAMERAS --> EDGE
    EDGE --> KAFKA

    KAFKA --> TSHOT
    KAFKA --> VHOT
    KAFKA --> SPARK

    TSHOT --> TSWARM
    TSWARM --> TSCOLD

    VHOT --> VWARM
    VWARM --> VCOLD

    SPARK --> ML

    style Sources fill:#2C3E50,stroke:#16A085,color:#fff
    style TSPath fill:#16A085,stroke:#2C3E50,color:#fff
    style VideoPath fill:#E67E22,stroke:#2C3E50,color:#fff
    style MLPath fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style KAFKA fill:#2C3E50,stroke:#E67E22,color:#fff
    style EDGE fill:#E67E22,stroke:#2C3E50,color:#fff

Quantified outcomes:

Real-time dashboard latency: 45ms P99 (target: < 100ms)
Historical query (1-year range): 2.3s (target: < 5s)
Video clip retrieval: 8s hot, 45s warm (target: < 30s)
Annual storage cost: $4,317 (99.8% reduction from naive approach)
Compliance coverage: 7 years environmental data, 90 days video with compliance exemptions
Schema migrations: Zero downtime (additive changes only)

1279.5 Summary

Fleet management example demonstrates database selection trade-offs (TimescaleDB vs InfluxDB vs MongoDB) with geospatial requirements as the deciding factor
Smart city data lake example shows tiered storage architecture reducing costs by 99.8% through edge processing, compression, and lifecycle policies
Schema evolution strategies enable gradual changes without breaking existing devices or queries
Cost optimization requires understanding access patterns to place data in appropriate storage tiers
Query optimization through continuous aggregates and proper indexing achieves sub-100ms latency

1279.6 What’s Next

You’ve completed the Data Storage and Databases chapter series. Continue exploring:

1279 Data Storage Worked Examples

Learning Objectives

1279.1 Worked Example: Fleet Management System

1279.2 Choosing a Database for 10,000-Vehicle Fleet Tracking

1279.3 Worked Example: Smart City Data Lake

1279.4 Designing a Data Lake for a Smart City with 50,000 Sensors

1279.5 Summary

1279.6 What’s Next

1279.7 Resources

1279.7.1 Databases

1279.7.2 Tools and Libraries

1279.7.3 Books