Explain why traditional databases fail for high-velocity IoT time-series data
Describe the core design principles of time-series database architecture
Differentiate LSM trees, columnar storage, and time-based partitioning as TSDB design strategies
Calculate write amplification and compression ratios for different storage approaches
Identify the key characteristics that make time-series workloads unique
In 60 Seconds
Time-series data is a sequence of values indexed by timestamp – the fundamental data structure of IoT systems where sensors report readings continuously over time. Unlike transactional data, time-series data is append-only, rarely updated, and queries almost always include a time range filter. Understanding time-series characteristics (monotonic insertion, high compression ratios on repeated floats, time-range-dominant queries) is essential for designing IoT storage systems that scale to millions of readings per day.
10.2 MVU: Minimum Viable Understanding
Core concept: Time-series databases are purpose-built storage systems optimized for timestamped data with high write throughput, time-based queries, and efficient compression. Why it matters: IoT sensors generate millions of readings per hour; traditional databases collapse under this load while TSDBs handle it with 10-100x better performance and 90%+ storage savings. Key takeaway: Time-series databases use fundamentally different architectures (LSM trees, columnar storage) that eliminate the write amplification problem plaguing traditional databases with IoT workloads.
10.3 Introduction
Your smart factory generates about 18 million sensor readings per hour (5,000 sensors x 3,600 s/hour at 1 Hz) from temperature, vibration, and pressure sensors across production lines. Traditional relational databases struggle: sustained writes and time-range queries slow to a crawl, and your operations team misses critical anomalies because dashboards can’t refresh fast enough.
This is the time-series data challenge: extremely high write volumes, time-based queries, and the need for real-time analytics. Standard databases like MySQL or PostgreSQL weren’t designed for this workload. They optimize for transactional consistency and complex relationships, not the append-only, time-stamped nature of sensor data.
Time-series databases (TSDBs) solve this problem with specialized architectures: write-optimized storage engines, time-based indexing, built-in downsampling, and query languages designed for temporal analysis. Compared to a generic time-series schema in a traditional database, they often sustain much higher ingest rates and reduce long-term storage via compression and tiered retention.
This chapter first quantifies why traditional databases fail for IoT workloads–with concrete write throughput, storage, and query benchmarks–then examines the TSDB architecture (LSM trees, columnar storage, time-based partitioning) that solves these problems, and closes with a worked sizing example for an offshore wind farm.
For Beginners: What is Time-Series Data?
Time-series data is a sequence of data points indexed by time. Think of your fitness tracker: it records your heart rate every second throughout your run. Each measurement has two parts:
Timestamp: When the measurement was taken (2025-12-15 09:30:15)
Value: What was measured (heart rate = 145 bpm)
Your tracker might collect 3,600 heart rate readings during a 1-hour run. If it records continuously at 1 Hz, that’s 86,400 readings per day and about 31.5 million readings per year for heart rate alone–and modern trackers measure dozens of metrics simultaneously.
IoT systems face this same challenge at massive scale: smart buildings with thousands of sensors, industrial facilities with hundreds of thousands of measurement points, or connected vehicles transmitting hundreds of parameters every second.
Why regular databases struggle:
They’re designed for data that gets updated frequently (like your account balance)
IoT data is almost never updated–only inserted (append-only pattern)
Queries like “show me the average temperature over the last hour” are inefficient in row-based databases
Storage explodes because general-purpose databases don’t compress time-series data well
Time-series databases are specifically engineered for this pattern: massive write volumes, rare updates, time-based queries, and excellent compression.
For Kids: Meet the Sensor Squad!
Time-series databases are like magical diaries that remember everything sensors see, hear, and feel - and can flip back through thousands of pages in the blink of an eye!
10.3.1 The Sensor Squad Adventure: The Mystery of the Missing Cookies
One morning, the Sensor Squad discovered a mystery in the Smart Kitchen: someone had been sneaking cookies from the cookie jar! The kitchen manager asked Thermo the temperature sensor, Lumi the light sensor, Speedy the motion detector, and Droplet the humidity sensor to solve the case.
“We need to check our diaries!” said Thermo excitedly. You see, every sensor in the Sensor Squad keeps a special time-diary where they write down what they observe every single second - that is 86,400 entries every day! Thermo’s diary had entries like “7:00:01 AM - 72 degrees… 7:00:02 AM - 72 degrees… 11:43:15 PM - 72 degrees… 11:43:16 PM - 74 degrees!” Interesting - the temperature went up slightly at 11:43 PM!
Lumi checked her light diary: “11:43:10 PM - darkness… 11:43:12 PM - BRIGHT!” The refrigerator door had opened! Speedy found motion at 11:43:11 PM near the counter. But the real clue came from Droplet: “11:43:20 PM - humidity spike near the cookie jar!” Someone had breathed right over it while grabbing cookies. By putting all their time-stamped diaries together, they discovered the cookie thief struck at exactly 11:43 PM - and the tiny wet nose print Droplet detected meant it was Whiskers the cat! The Sensor Squad solved the mystery because their time-diaries remembered EVERYTHING.
10.3.2 Key Words for Kids
Word
What It Means
Time-series
A list of measurements with timestamps, like a diary that writes the time for every entry
Timestamp
The exact date and time when something was measured - like writing “Tuesday, 3:15 PM” in your diary
Query
Asking the database a question, like “What happened at 11:43 PM?”
Compression
Squishing data to take up less space, like writing “same as before” instead of repeating the same thing 1000 times
Retention
How long to keep old diary entries before throwing them away to make room for new ones
10.3.3 Try This at Home!
Make Your Own Time-Series Diary!
Get a notebook and pick something to measure - like the temperature outside, how bright it is, or how many cars pass by your window
Every 15 minutes for one afternoon, write down the time and your measurement (example: “2:00 PM - 68 degrees, sunny”)
At the end of the day, look for patterns! Did it get warmer or cooler? Was there more traffic at certain times?
Try to answer questions from your data: “What time was the warmest?” or “When did the most cars go by?”
You just created time-series data - the same kind of information that sensors collect millions of times per day! Real time-series databases can search through billions of these entries faster than you can say “Sensor Squad!”
Common Misconception: “We Can Just Use PostgreSQL with Time Indexes”
The Myth: Many teams assume adding time-based indexes to PostgreSQL will solve IoT data problems.
The Reality: A production smart factory learned this the hard way:
Initial approach: PostgreSQL with time indexes on 2,000 sensors at 1 Hz
Result after 3 months:
Database size: 487 GB (should have been ~50 GB with proper compression)
Query time for 24-hour dashboard: 45 seconds (users expect <2 seconds)
Write queue backlog during peak hours: 12-minute lag
SSD replacement after 8 months due to write amplification (expected 5-year lifespan)
After migrating to TimescaleDB:
Database size: 52 GB (89% reduction)
Query time: 1.2 seconds (97% faster)
Zero write backlog
SSD write load reduced by 85%
Key lesson: Generic databases lack critical time-series optimizations–columnar storage, adaptive compression, automatic time-based partitioning, and specialized query engines. Adding indexes doesn’t compensate for fundamental architectural differences.
Cross-Hub Connections
Learning Resources:
Knowledge Gaps Hub - Addresses common time-series database misunderstandings and when to use which platform
Simulations Hub - Interactive storage calculator and retention policy simulator
Videos Hub - Video tutorials on InfluxDB, TimescaleDB, and Prometheus setup
Practice Your Skills:
Quizzes Hub - Test your understanding of compression ratios, query optimization, and platform selection
10.4 Why Traditional Databases Fail
Estimated time: ~10 min | Difficulty: Intermediate | Unit: P10.C15.U01
To understand why time-series databases exist, let’s examine concrete performance numbers from real-world IoT deployments.
10.4.1 Write Performance Comparison
Database Type
Writes/Second
Notes
MySQL (InnoDB)
5,000 - 8,000
With tuning, single instance
PostgreSQL
8,000 - 12,000
With tuning, single instance
MongoDB
15,000 - 25,000
Document model, single instance
InfluxDB
500,000+
Optimized for time-series
TimescaleDB
200,000 - 500,000
PostgreSQL with time-series extension
Prometheus
1,000,000+
Samples ingested/sec (pull-based scraping model)
The Problem: A medium-sized smart building with 2,000 sensors reporting every second generates 2,000 writes per second–and at peak, alarm conditions can spike this to 10,000+ writes/second. Even this modest sustained load stresses a traditional database, causing:
Write queue backups (data loss during network hiccups)
Traditional databases suffer from write amplification: writing one sensor reading triggers multiple disk operations.
Figure 10.1: Traditional Database Write Amplification for IoT Data
Alternative View: TSDB vs Traditional DB Write Comparison
This side-by-side comparison shows how time-series databases minimize write amplification:
Time-series databases eliminate random I/O by batching writes in memory and flushing sequentially, reducing write amplification from 600x to ~1.25x.
Result: Writing 32 bytes of sensor data triggers ~19 KB of actual disk writes due to index maintenance, transaction logging, and page updates. This roughly 600x write amplification destroys SSD endurance and limits throughput.
Putting Numbers to It
How does write amplification affect SSD lifespan? A typical enterprise SSD rated for 10 DWPD (drive writes per day) on a 1TB drive:
Actual writes with 600x amplification: \(13.8 \times 600 = 8{,}280\text{ GB/day}\)
Result: Traditional database consumes 83% of SSD write budget for just 13.8GB of sensor data! TSDBs with ~1.25x amplification use only \(13.8 \times 1.25 = 17\text{ GB/day}\) actual writes—480x less SSD wear, extending drive life from 1.2 years to decades.
Use this calculator to explore how write amplification affects SSD lifespan for different IoT deployments. Adjust the sensor count, sampling rate, and write amplification factor to see the impact.
Traditional databases also waste storage on time-series data:
Database
Compression Ratio
1 Year for 1,000 Sensors (1 reading/sec, 32 bytes)
PostgreSQL (uncompressed)
1:1
1,008 GB (1 TB)
PostgreSQL (with compression)
3:1
336 GB
TimescaleDB
10:1 to 15:1
67-100 GB
InfluxDB
20:1 to 30:1
33-50 GB
Time-series databases achieve 10x to 30x better compression through:
Columnar storage: Group same-type values together (all timestamps, then all temperature values)
Delta encoding: Store differences between consecutive values (temperature changes by 0.1C, not absolute 23.7C each time)
Run-length encoding: Compress repeated values (sensor offline = same value for hours)
Dictionary encoding: Replace repeated strings with integer references
10.4.5 Interactive: TSDB Storage Sizing
Calculate how much storage your IoT deployment needs with different database approaches. This calculator applies the compression ratios from the table above.
The difference comes from data organization: time-series databases automatically partition data by time, so queries only scan relevant chunks. Traditional databases require manual partitioning strategies that are hard to maintain.
10.5 Time-Series Database Architecture
Estimated time: ~15 min | Difficulty: Advanced | Unit: P10.C15.U02
Time-series databases use fundamentally different architectures optimized for append-only, time-indexed data.
10.5.1 Core Design Principles
Write-Optimized Storage: LSM (Log-Structured Merge) trees instead of B-trees
Time-Based Partitioning: Automatic chunking by time intervals
Columnar Storage: Group data by column, not row
Built-in Downsampling: Automatic aggregation and retention policies
Specialized Query Languages: Time-aware functions and operators
10.5.2 High-Level Data Flow
Figure 10.2: LSM Tree Write Path with SST File Compaction
Alternative View: LSM Tree Time-Based Organization
This view emphasizes how time-series data naturally fits the LSM tree model with time-based organization:
Data automatically flows through time-based tiers: recent data stays hot for fast queries while older data gets compressed and downsampled for cost-efficient storage.
How It Works:
Write Buffer: Incoming data accumulates in memory (extremely fast)
Write-Ahead Log (WAL): Durability guarantee without immediate disk writes
Memtable: When buffer fills, data is sorted and prepared for disk
SST Files (Sorted String Tables): Immutable files written to disk in levels
Compaction: Background process merges and compresses files, removing old data
This architecture enables:
Fast writes: Only sequential I/O, no random seeks
Efficient reads: Data sorted by time, only scan relevant files
Automatic compression: Each compaction level applies stronger compression
Units: 1 byte -> ~0.01 bytes amortized (run-length encoding over thousands of identical values)
Total: ~29 bytes/row (plus row overhead) -> ~4-6 bytes/row = 5-7x compression from columnar encoding alone, before general-purpose compression algorithms (LZ4, Zstd) that can double the ratio to 10-15x.
Tradeoff: Lossless vs Lossy Compression for Time-Series Data
Compression ratio: 20:1 to 50:1 (precision reduced from float64 to float16)
Storage for 10K sensors at 1Hz/year: 25-60 GB compressed
CPU overhead: 2-5% of write throughput
Query latency impact: +1-2ms for decompression
Data fidelity: 95-99% - values within 0.1-1% of original
Use cases: Dashboards, trending, capacity planning where precision is not critical
Decision Factors:
Choose Lossless when: Regulatory requirements mandate exact data retention, ML models are sensitive to precision, post-incident forensics requires original sensor values, compliance audits verify data integrity
Choose Lossy when: Storage cost is primary concern, data is used for visualization and trending only, sensor accuracy is already lower than storage precision (e.g., 0.5C sensor stored as float64), long-term archives where trends matter more than exact values
Hybrid approach: Lossless for hot tier (last 7 days), lossy for cold tier archival - achieves 30:1+ effective compression while preserving recent precision
10.5.5 Putting It All Together
Figure 10.4 shows how these components–write-optimized ingestion, time-partitioned storage, columnar compression, and time-aware query execution–combine into a complete TSDB architecture.
Figure 10.4: Time-series database architecture optimized for IoT workloads
The storage layer (Figure 10.5) implements the LSM tree and time-based partitioning concepts we discussed above, with each level applying progressively stronger compression.
Figure 10.5: Time-series storage with LSM tree and time-based partitioning
Beyond storage, modern TSDB deployments include an analytics pipeline (Figure 10.6) that performs real-time aggregation using tumbling and sliding windows–a topic we explore further in the Stream Processing chapter.
Figure 10.6: Time-series analytics pipeline for IoT monitoring
10.6 Worked Example: Selecting a TSDB for a Wind Farm
Worked Example: Time-Series Database Sizing for Offshore Wind Farm SCADA
Scenario: Orsted operates a 100-turbine offshore wind farm in the North Sea (Hornsea One). Each turbine has a SCADA system reporting 350 sensor channels (blade pitch, yaw angle, generator temperature, vibration, wind speed, power output). The operations team needs a time-series database for real-time monitoring, fault detection, and regulatory performance reporting.
Given:
100 turbines x 350 channels = 35,000 time-series
Sampling rates: High-speed vibration at 20 Hz (10 channels/turbine), standard SCADA at 1 Hz (200 channels/turbine), slow environmental at 0.1 Hz (140 channels/turbine)
Data types: Float64 for analog values (8 bytes), Int16 for status codes (2 bytes)
Key insight: Time-series compression is extremely effective for SCADA data because consecutive readings from a steady-state turbine are nearly identical (delta-of-delta encoding). A turbine generating 6 MW at constant wind speed produces power readings like 6.012, 6.013, 6.011, 6.014 – delta-of-delta encodes each value in 1-2 bits instead of 64 bits.
Step 3 – Select TSDB platform:
Criterion
InfluxDB OSS
TimescaleDB
Prometheus
Write throughput (41K pts/s)
Handles easily (>500K pts/s)
Handles easily (>300K pts/s)
Pull model, less suitable for push
SQL support
InfluxQL/Flux (custom)
Full PostgreSQL SQL
PromQL (custom)
Downsampling (warm/cold tiers)
Continuous queries built-in
Requires TimescaleDB Toolkit
Recording rules
Regulatory reporting
Flux export to CSV
SQL joins with metadata tables
Not designed for reporting
Grafana integration
Native
Native
Native
Operational complexity
Single binary, simple
Requires PostgreSQL expertise
Designed for metrics, not SCADA
Verdict for this use case
Good for monitoring
Best fit (SQL for reports)
Poor fit
Selected: TimescaleDB on PostgreSQL 15
Why: The monthly regulatory report requires joining turbine performance data with maintenance logs, weather records, and contractual availability targets. Standard SQL makes this straightforward. InfluxDB would require exporting data to a relational database for this join logic.
Step 4 – Hardware specification:
Component
Specification
Cost (Monthly, AWS)
TimescaleDB instance
r6g.xlarge (4 vCPU, 32 GB RAM)
GBP 380
Storage (GP3 SSD)
200 GB, 3,000 IOPS
GBP 20
Backup (S3)
96 GB + daily snapshots
GBP 8
Total
GBP 408/month
Per-turbine cost: GBP 408 / 100 = GBP 4.08/turbine/month for full SCADA data storage.
Result: TimescaleDB stores 41,400 data points/second from 100 turbines using only 96 GB of compressed disk. Dashboard queries on the hot tier (last 1 hour) return in <100ms. Monthly regulatory reports run full SQL with JOINs in <30 seconds. The 15:1 compression ratio keeps 10 years of historical data in 12 GB.
Key Insight: For IoT TSDB selection, the query pattern matters more than the ingestion rate. Most modern TSDBs handle 40K points/second without issues. The differentiator is downstream: if you need SQL JOINs for business reporting, choose TimescaleDB. If you need the simplest operational model for pure monitoring, choose InfluxDB. Prometheus is purpose-built for infrastructure metrics, not industrial SCADA. :::
Interactive Quiz: Sequence the Steps
Common Pitfalls
1. Treating time-series data like transactional records
Time-series data is append-only and almost never needs UPDATE or DELETE operations. Designing schemas with row-level update capability (e.g., using UPSERT to correct sensor readings) forces expensive index maintenance that kills write throughput. Instead, append corrections as new readings with a quality flag, and handle outlier exclusion at query time.
2. Using local device time as the authoritative timestamp
IoT devices have clocks that drift, may be unset after power loss, or may report in local time without timezone. A temperature sensor reporting timestamps from January 1, 2000 (RTC reset) will corrupt time-range queries across your entire dataset. Always use server-side ingestion time as the primary timestamp, storing device-reported time as a separate field for reference.
3. Querying raw data for dashboard aggregations
Running SELECT AVG(temperature) FROM readings WHERE device_id = 123 AND time > NOW() - INTERVAL ‘30 days’ on 30 days of 1-second readings (2.6 million rows) causes full chunk scans on every dashboard refresh. Pre-compute continuous aggregates at 1-minute, 1-hour, and 1-day granularities during initial schema design – retrofitting this later requires a disruptive migration.
🏷️ Label the Diagram
💻 Code Challenge
10.7 Summary
Time-series databases address the fundamental mismatch between IoT workloads and traditional database architectures.
Key Takeaways:
Traditional databases fail for IoT: Write amplification (600x) and poor compression make them unsuitable for high-velocity time-series data.
LSM trees enable high write throughput: By batching writes in memory and flushing sequentially, TSDBs achieve 100x better write performance.
Columnar storage enables compression: Grouping similar data types together and applying delta/dictionary/RLE encoding achieves 10-30x compression.
Time-based partitioning accelerates queries: Automatic chunking by time means queries only scan relevant data.
Architecture determines performance: The fundamental design choices (LSM vs B-tree, columnar vs row-based) create 10-100x differences in IoT workload performance.
Now that you understand the architectural foundations of time-series databases, you can explore specific platforms, query techniques, or retention strategies depending on your needs.
If you want to…
Read this next
Compare InfluxDB, TimescaleDB, and Prometheus for your IoT use case