1278 Time-Series Databases

Learning Objectives

After completing this chapter, you will be able to:

Understand time-series database architectures (TimescaleDB, InfluxDB)
Implement efficient schemas for sensor data storage
Optimize queries with continuous aggregates and indexing
Configure retention policies and compression for cost optimization
Avoid common pitfalls in time-series database design

1278.1 Introduction

Time-series databases are purpose-built for storing and querying timestamped data–the dominant pattern in IoT sensor telemetry. They achieve 10-100x better performance than generic SQL databases for time-series workloads through specialized optimizations.

1278.2 Time-Series Database Benefits

Feature	Time-Series DB	Generic SQL
Write throughput	100K-1M/sec	10K-50K/sec
Time-range queries	Optimized (partition pruning)	Full table scan
Compression	90-95% automatic	Manual, limited
Retention policies	Built-in	Custom scripts
Continuous aggregates	Native	Materialized views

1278.3 Interactive: IoT Storage Calculator

Use this calculator to estimate storage requirements for your IoT deployment:

Storage Estimator

Show code

viewof numSensors = Inputs.range([10, 10000], {
  value: 100,
  step: 10,
  label: "Number of Sensors"
})

viewof readingsPerHour = Inputs.range([1, 3600], {
  value: 60,
  step: 1,
  label: "Readings per Hour (per sensor)"
})

viewof bytesPerReading = Inputs.range([10, 1000], {
  value: 50,
  step: 10,
  label: "Bytes per Reading"
})

viewof retentionDays = Inputs.range([1, 365], {
  value: 30,
  step: 1,
  label: "Retention Period (Days)"
})

Show code

storageCalc = {
  const readingsPerDay = numSensors * readingsPerHour * 24;
  const dailyBytes = readingsPerDay * bytesPerReading;
  const totalBytes = dailyBytes * retentionDays;
  const dailyGB = dailyBytes / 1e9;
  const totalGB = totalBytes / 1e9;
  const totalTB = totalGB / 1000;

  // Cost estimates (AWS RDS pricing as reference)
  const monthlyCostPerGB = 0.115; // Approximate AWS RDS cost
  const monthlyCost = totalGB * monthlyCostPerGB;

  // Database recommendation logic
  let dbRecommendation;
  let dbReason;

  if (totalGB < 10) {
    dbRecommendation = "SQLite or PostgreSQL";
    dbReason = "Small dataset, single-server database sufficient";
  } else if (totalGB < 100) {
    dbRecommendation = "TimescaleDB (PostgreSQL extension)";
    dbReason = "Medium scale, time-series optimizations beneficial";
  } else if (totalGB < 1000) {
    dbRecommendation = "InfluxDB or TimescaleDB with compression";
    dbReason = "Large scale, dedicated time-series DB recommended";
  } else {
    dbRecommendation = "InfluxDB + Cassandra (tiered storage)";
    dbReason = "Very large scale, distributed storage required";
  }

  return {
    readingsPerDay,
    dailyBytes,
    totalBytes,
    dailyGB,
    totalGB,
    totalTB,
    monthlyCost,
    dbRecommendation,
    dbReason
  };
}

md`### Storage Requirements

| Metric | Value |
|--------|-------|
| **Readings per Day** | ${storageCalc.readingsPerDay.toLocaleString()} readings |
| **Daily Data Volume** | ${storageCalc.dailyGB.toFixed(2)} GB |
| **Total Storage (${retentionDays} days)** | ${storageCalc.totalGB < 1000 ? storageCalc.totalGB.toFixed(2) + ' GB' : storageCalc.totalTB.toFixed(2) + ' TB'} |
| **Estimated Monthly Cost** | $${storageCalc.monthlyCost.toFixed(2)} (cloud storage) |
| **Recommended Database** | **${storageCalc.dbRecommendation}** |
| **Reason** | ${storageCalc.dbReason} |
`

Show code

storageGrowthData = {
  const data = [];
  for (let day = 1; day <= retentionDays; day += Math.ceil(retentionDays / 30)) {
    data.push({
      day: day,
      storage_gb: (storageCalc.dailyGB * day)
    });
  }
  return data;
}

Plot.plot({
  title: "Storage Growth Over Time",
  width: 640,
  height: 300,
  x: {label: "Days", grid: true},
  y: {label: "Storage (GB)", grid: true},
  marks: [
    Plot.line(storageGrowthData, {x: "day", y: "storage_gb", stroke: "#16A085", strokeWidth: 2}),
    Plot.dot(storageGrowthData, {x: "day", y: "storage_gb", fill: "#2C3E50", r: 4})
  ]
})

Show code

{
  const container = document.getElementById('kc-db-9');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "You're estimating storage for a smart city deployment: 5,000 sensors, each reporting 200 bytes every 30 seconds, with 1-year retention. What is the approximate annual storage requirement BEFORE compression?",
      options: [
        {text: "10 GB - a single server can handle this easily", correct: false, feedback: "Let's calculate: 5,000 sensors x 2 readings/min x 60 min x 24 hr x 365 days x 200 bytes = 5,256 billion bytes = ~5.3 TB per year."},
        {text: "500 GB - a moderately sized database", correct: false, feedback: "This underestimates by 10x. At 5,000 sensors with 2 readings/minute, you're generating significant data volume."},
        {text: "5 TB - requires careful capacity planning", correct: true, feedback: "Correct! 5,000 sensors x 2/min x 60 x 24 x 365 x 200 bytes = ~5.3 TB. With 90% compression (TimescaleDB), this reduces to ~530 GB. Always calculate raw storage first, then apply compression factor."},
        {text: "50 TB - needs distributed storage system", correct: false, feedback: "This overestimates by 10x. At 200 bytes per reading and 30-second intervals, the calculation yields ~5 TB before compression."}
      ],
      difficulty: "easy",
      topic: "iot-databases"
    }));
  }
}

1278.4 TimescaleDB: PostgreSQL for Time-Series

TimescaleDB extends PostgreSQL with hypertables–automatically partitioned tables optimized for time-series data.

1278.4.1 Hypertable Benefits

Feature	Standard Table	Hypertable
Query recent data	Full table scan	Scan 1 chunk
Delete old data	Slow DELETE	Instant DROP chunk
Parallel processing	Limited	Multi-chunk concurrent
Index management	One large index	Per-chunk indexes

1278.4.2 Creating a Hypertable

-- Main sensor readings table
CREATE TABLE sensor_readings (
    time        TIMESTAMPTZ NOT NULL,
    device_id   VARCHAR(50) NOT NULL,
    value       DOUBLE PRECISION,
    quality     SMALLINT DEFAULT 100
);

-- Convert to hypertable with 1-day chunks
SELECT create_hypertable('sensor_readings', 'time',
    chunk_time_interval => INTERVAL '1 day');

-- Index for common queries
CREATE INDEX ON sensor_readings (device_id, time DESC);

1278.4.3 Compression Policy

-- Enable compression
ALTER TABLE sensor_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'device_id'
);

-- Compress data older than 7 days
SELECT add_compression_policy('sensor_readings', INTERVAL '7 days');

Compression results: 90-95% storage reduction for typical sensor data.

1278.4.4 Retention Policy

-- Automatically drop chunks older than 90 days
SELECT add_retention_policy('sensor_readings', INTERVAL '90 days');

Why it’s fast: Dropping entire chunks is a metadata operation (milliseconds), not row-by-row deletion (hours for millions of rows).

1278.4.5 Continuous Aggregates

Pre-compute common aggregations for dashboard queries:

-- Create hourly summary
CREATE MATERIALIZED VIEW hourly_summary
WITH (timescaledb.continuous) AS
SELECT
    device_id,
    time_bucket('1 hour', time) AS hour,
    AVG(value) AS avg_value,
    MIN(value) AS min_value,
    MAX(value) AS max_value,
    COUNT(*) AS reading_count
FROM sensor_readings
GROUP BY device_id, hour;

-- Auto-refresh every 30 minutes
SELECT add_continuous_aggregate_policy('hourly_summary',
    start_offset => INTERVAL '1 day',
    end_offset => INTERVAL '1 hour',
    schedule_interval => INTERVAL '30 minutes');

Performance impact: Dashboard queries go from 45 seconds (500M rows) to 10ms (hourly aggregates).

1278.5 Common Pitfalls

Pitfall: Underestimating Cardinality in Time-Series Indexes

The Mistake: Creating composite indexes on (device_id, timestamp) without considering that each unique device creates a separate index entry per time interval, leading to index bloat that exceeds the data itself.

Why It Happens: Developers apply relational database indexing patterns to time-series data. In a traditional database, indexing a user_id column with 100,000 users creates 100,000 index entries. But in time-series, indexing (device_id, timestamp) for 100,000 devices with 1-second resolution creates 100,000 x 86,400 = 8.64 billion entries per day.

The Fix: Use time-partitioned tables with partition-local indexes:

-- PostgreSQL/TimescaleDB: Partition by time, index within partitions
CREATE TABLE sensor_readings (
    time        TIMESTAMPTZ NOT NULL,
    device_id   TEXT NOT NULL,
    value       DOUBLE PRECISION
);

-- TimescaleDB automatically partitions by time
SELECT create_hypertable('sensor_readings', 'time',
    chunk_time_interval => INTERVAL '1 day');

-- Index is created per-chunk, not globally
CREATE INDEX ON sensor_readings (device_id, time DESC);

-- Each daily chunk has ~8.64M rows for 100K devices at 1Hz
-- Index per chunk: ~200 MB vs global index: ~73 GB for 1 year

Cardinality calculation: For 100,000 devices at 1 Hz: - Daily rows: 100,000 x 86,400 = 8.64 billion rows/year - Global B-tree index: ~8 bytes/entry x 8.64B = 69 GB index overhead - Chunked index (365 daily chunks): 365 x 200 MB = 73 GB, but queries only scan relevant chunks

Pitfall: Storing Timezone-Naive Timestamps

The Mistake: Using TIMESTAMP instead of TIMESTAMPTZ (timestamp with timezone), causing silent data corruption when devices or servers change timezones, or when DST transitions occur.

Why It Happens: TIMESTAMP (without timezone) is shorter to type and appears simpler. Developers assume “we’ll always use UTC” but don’t enforce it at the schema level.

The Fix: Always use timezone-aware timestamp types and enforce UTC:

-- WRONG: Timezone-naive timestamp
CREATE TABLE sensor_data (
    time        TIMESTAMP NOT NULL,  -- Ambiguous!
    device_id   TEXT,
    value       DOUBLE PRECISION
);

-- CORRECT: Timezone-aware timestamp
CREATE TABLE sensor_data (
    time        TIMESTAMPTZ NOT NULL,  -- Stores UTC, displays in session TZ
    device_id   TEXT,
    value       DOUBLE PRECISION
);

-- Enforce UTC on insert (PostgreSQL)
ALTER DATABASE iot_db SET timezone TO 'UTC';

Storage difference: Both TIMESTAMP and TIMESTAMPTZ use 8 bytes in PostgreSQL–there’s no storage penalty for timezone awareness.

1278.6 Knowledge Check Quizzes

Quiz 1: Time-Series Databases

Question 1: Which database type is MOST appropriate for storing time-series sensor data with high write throughput requirements?

Time-series databases are purpose-built for sensor data: optimized for timestamp-ordered inserts, automatic downsampling, efficient time-range queries, and data retention policies. They achieve 10-20x better write performance than generic SQL for time-series workloads.

Quiz 2: Indexing and Optimization

Question 2: An IoT application queries sensor data primarily by device_id and timestamp. Which indexing strategy provides the best query performance?

Composite index (device_id, timestamp) optimizes queries like WHERE device_id = 'sensor1' AND timestamp > '2024-01-01'. Database can use the index for both filters in a single lookup. Order matters: leading column (device_id) enables index use, then timestamp for range scans.

Quiz 3: Retention Policies

Question 3: A smart city deploys 10,000 sensors generating 1 reading/sec each. After 1 year, the database has 315 billion records. Which retention strategy is most practical?

Multi-tier retention balances detail and cost by keeping raw data only when it is most useful (debugging and model training) and downsampling older data for trends.

Record counts (10,000 sensors at 1 Hz): - Raw forever: ~315B rows/year - Raw for 30 days: ~25.9B rows - Hourly aggregates for 1 year: ~87.6M rows - Daily aggregates: ~3.65M rows/year

At a 10-year horizon, the policy stores roughly 26B rows versus ~3.15T rows if you retained all raw data (~120x more).

Quiz 4: TimescaleDB Hypertables

Question 4: What is the primary advantage of TimescaleDB hypertables over standard PostgreSQL tables for IoT data?

Hypertables automatically partition data into time-based chunks (e.g., daily or weekly). Benefits: 1. Query optimization: Time-range queries only scan relevant chunks 2. Efficient retention: Drop old chunks instantly vs slow DELETE 3. Parallel processing: Multiple chunks processed concurrently

Example: 1 billion records over 365 days. Standard table: Full scan for recent data (slow). Hypertable with daily chunks: Scan 1 chunk (2.7M records) = 365x faster!

Quiz 5: Troubleshooting Query Performance

Question 5: A temperature sensor table has 100 million records. A query for data from the last 24 hours scans the entire table. What’s the most likely problem?

Symptom: Query WHERE timestamp > NOW() - INTERVAL '1 day' performs full table scan.

Diagnosis: 1. No timestamp index: Database must read all 100M rows. Solution: CREATE INDEX ON sensor_data(timestamp) 2. Outdated statistics: Query planner doesn’t know data distribution. Solution: ANALYZE sensor_data

After fixing: Index seeks directly to relevant records. 1 day of data (~100k records) vs 100M total = 1000x less data scanned.

1278.7 Summary

Time-series databases provide 10-100x performance improvement over generic SQL for sensor data
TimescaleDB hypertables automatically partition data by time, enabling efficient queries and retention
Compression reduces storage by 90-95% for older data
Retention policies implement multi-tier storage (raw -> aggregated -> archived)
Continuous aggregates pre-compute dashboards for millisecond response times
Proper indexing with time-partitioned tables avoids cardinality explosion
Timezone-aware timestamps prevent silent data corruption

1278.8 What’s Next

Continue with Data Quality Monitoring to learn how to validate and monitor data quality in IoT database systems.

Related Chapters: - Database Selection Framework - Choosing the right database - CAP Theorem - Distributed systems trade-offs - Sharding Strategies - Horizontal scaling