1278  Time-Series Databases

Learning Objectives

After completing this chapter, you will be able to:

  • Understand time-series database architectures (TimescaleDB, InfluxDB)
  • Implement efficient schemas for sensor data storage
  • Optimize queries with continuous aggregates and indexing
  • Configure retention policies and compression for cost optimization
  • Avoid common pitfalls in time-series database design

1278.1 Introduction

Time-series databases are purpose-built for storing and querying timestamped data–the dominant pattern in IoT sensor telemetry. They achieve 10-100x better performance than generic SQL databases for time-series workloads through specialized optimizations.

1278.2 Time-Series Database Benefits

Feature Time-Series DB Generic SQL
Write throughput 100K-1M/sec 10K-50K/sec
Time-range queries Optimized (partition pruning) Full table scan
Compression 90-95% automatic Manual, limited
Retention policies Built-in Custom scripts
Continuous aggregates Native Materialized views

1278.3 Interactive: IoT Storage Calculator

Use this calculator to estimate storage requirements for your IoT deployment:

TipStorage Estimator

1278.4 TimescaleDB: PostgreSQL for Time-Series

TimescaleDB extends PostgreSQL with hypertables–automatically partitioned tables optimized for time-series data.

1278.4.1 Hypertable Benefits

Feature Standard Table Hypertable
Query recent data Full table scan Scan 1 chunk
Delete old data Slow DELETE Instant DROP chunk
Parallel processing Limited Multi-chunk concurrent
Index management One large index Per-chunk indexes

1278.4.2 Creating a Hypertable

-- Main sensor readings table
CREATE TABLE sensor_readings (
    time        TIMESTAMPTZ NOT NULL,
    device_id   VARCHAR(50) NOT NULL,
    value       DOUBLE PRECISION,
    quality     SMALLINT DEFAULT 100
);

-- Convert to hypertable with 1-day chunks
SELECT create_hypertable('sensor_readings', 'time',
    chunk_time_interval => INTERVAL '1 day');

-- Index for common queries
CREATE INDEX ON sensor_readings (device_id, time DESC);

1278.4.3 Compression Policy

-- Enable compression
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id'
);

-- Compress data older than 7 days
SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');

Compression results: 90-95% storage reduction for typical sensor data.

1278.4.4 Retention Policy

-- Automatically drop chunks older than 90 days
SELECT add_retention_policy('sensor_readings', INTERVAL '90 days');

Why it’s fast: Dropping entire chunks is a metadata operation (milliseconds), not row-by-row deletion (hours for millions of rows).

1278.4.5 Continuous Aggregates

Pre-compute common aggregations for dashboard queries:

-- Create hourly summary
CREATE MATERIALIZED VIEW hourly_summary
WITH (timescaledb.continuous) AS
SELECT
    device_id,
    time_bucket('1 hour', time) AS hour,
    AVG(value) AS avg_value,
    MIN(value) AS min_value,
    MAX(value) AS max_value,
    COUNT(*) AS reading_count
FROM sensor_readings
GROUP BY device_id, hour;

-- Auto-refresh every 30 minutes
SELECT add_continuous_aggregate_policy('hourly_summary',
    start_offset => INTERVAL '1 day',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '30 minutes');

Performance impact: Dashboard queries go from 45 seconds (500M rows) to 10ms (hourly aggregates).

1278.5 Common Pitfalls

CautionPitfall: Underestimating Cardinality in Time-Series Indexes

The Mistake: Creating composite indexes on (device_id, timestamp) without considering that each unique device creates a separate index entry per time interval, leading to index bloat that exceeds the data itself.

Why It Happens: Developers apply relational database indexing patterns to time-series data. In a traditional database, indexing a user_id column with 100,000 users creates 100,000 index entries. But in time-series, indexing (device_id, timestamp) for 100,000 devices with 1-second resolution creates 100,000 x 86,400 = 8.64 billion entries per day.

The Fix: Use time-partitioned tables with partition-local indexes:

-- PostgreSQL/TimescaleDB: Partition by time, index within partitions
CREATE TABLE sensor_readings (
    time        TIMESTAMPTZ NOT NULL,
    device_id   TEXT NOT NULL,
    value       DOUBLE PRECISION
);

-- TimescaleDB automatically partitions by time
SELECT create_hypertable('sensor_readings', 'time',
    chunk_time_interval => INTERVAL '1 day');

-- Index is created per-chunk, not globally
CREATE INDEX ON sensor_readings (device_id, time DESC);

-- Each daily chunk has ~8.64M rows for 100K devices at 1Hz
-- Index per chunk: ~200 MB vs global index: ~73 GB for 1 year

Cardinality calculation: For 100,000 devices at 1 Hz: - Daily rows: 100,000 x 86,400 = 8.64 billion rows/year - Global B-tree index: ~8 bytes/entry x 8.64B = 69 GB index overhead - Chunked index (365 daily chunks): 365 x 200 MB = 73 GB, but queries only scan relevant chunks

CautionPitfall: Storing Timezone-Naive Timestamps

The Mistake: Using TIMESTAMP instead of TIMESTAMPTZ (timestamp with timezone), causing silent data corruption when devices or servers change timezones, or when DST transitions occur.

Why It Happens: TIMESTAMP (without timezone) is shorter to type and appears simpler. Developers assume “we’ll always use UTC” but don’t enforce it at the schema level.

The Fix: Always use timezone-aware timestamp types and enforce UTC:

-- WRONG: Timezone-naive timestamp
CREATE TABLE sensor_data (
    time        TIMESTAMP NOT NULL,  -- Ambiguous!
    device_id   TEXT,
    value       DOUBLE PRECISION
);

-- CORRECT: Timezone-aware timestamp
CREATE TABLE sensor_data (
    time        TIMESTAMPTZ NOT NULL,  -- Stores UTC, displays in session TZ
    device_id   TEXT,
    value       DOUBLE PRECISION
);

-- Enforce UTC on insert (PostgreSQL)
ALTER DATABASE iot_db SET timezone TO 'UTC';

Storage difference: Both TIMESTAMP and TIMESTAMPTZ use 8 bytes in PostgreSQL–there’s no storage penalty for timezone awareness.

1278.6 Knowledge Check Quizzes

Question 1: Which database type is MOST appropriate for storing time-series sensor data with high write throughput requirements?

Time-series databases are purpose-built for sensor data: optimized for timestamp-ordered inserts, automatic downsampling, efficient time-range queries, and data retention policies. They achieve 10-20x better write performance than generic SQL for time-series workloads.

Question 2: An IoT application queries sensor data primarily by device_id and timestamp. Which indexing strategy provides the best query performance?

Composite index (device_id, timestamp) optimizes queries like WHERE device_id = 'sensor1' AND timestamp > '2024-01-01'. Database can use the index for both filters in a single lookup. Order matters: leading column (device_id) enables index use, then timestamp for range scans.

Question 3: A smart city deploys 10,000 sensors generating 1 reading/sec each. After 1 year, the database has 315 billion records. Which retention strategy is most practical?

Multi-tier retention balances detail and cost by keeping raw data only when it is most useful (debugging and model training) and downsampling older data for trends.

Record counts (10,000 sensors at 1 Hz): - Raw forever: ~315B rows/year - Raw for 30 days: ~25.9B rows - Hourly aggregates for 1 year: ~87.6M rows - Daily aggregates: ~3.65M rows/year

At a 10-year horizon, the policy stores roughly 26B rows versus ~3.15T rows if you retained all raw data (~120x more).

Question 4: What is the primary advantage of TimescaleDB hypertables over standard PostgreSQL tables for IoT data?

Hypertables automatically partition data into time-based chunks (e.g., daily or weekly). Benefits: 1. Query optimization: Time-range queries only scan relevant chunks 2. Efficient retention: Drop old chunks instantly vs slow DELETE 3. Parallel processing: Multiple chunks processed concurrently

Example: 1 billion records over 365 days. Standard table: Full scan for recent data (slow). Hypertable with daily chunks: Scan 1 chunk (2.7M records) = 365x faster!

Question 5: A temperature sensor table has 100 million records. A query for data from the last 24 hours scans the entire table. What’s the most likely problem?

Symptom: Query WHERE timestamp > NOW() - INTERVAL '1 day' performs full table scan.

Diagnosis: 1. No timestamp index: Database must read all 100M rows. Solution: CREATE INDEX ON sensor_data(timestamp) 2. Outdated statistics: Query planner doesn’t know data distribution. Solution: ANALYZE sensor_data

After fixing: Index seeks directly to relevant records. 1 day of data (~100k records) vs 100M total = 1000x less data scanned.

1278.7 Summary

  • Time-series databases provide 10-100x performance improvement over generic SQL for sensor data
  • TimescaleDB hypertables automatically partition data by time, enabling efficient queries and retention
  • Compression reduces storage by 90-95% for older data
  • Retention policies implement multi-tier storage (raw -> aggregated -> archived)
  • Continuous aggregates pre-compute dashboards for millisecond response times
  • Proper indexing with time-partitioned tables avoids cardinality explosion
  • Timezone-aware timestamps prevent silent data corruption

1278.8 What’s Next

Continue with Data Quality Monitoring to learn how to validate and monitor data quality in IoT database systems.

Related Chapters: - Database Selection Framework - Choosing the right database - CAP Theorem - Distributed systems trade-offs - Sharding Strategies - Horizontal scaling