14  Data Retention and Downsampling

14.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Design appropriate retention policies and downsampling strategies for IoT deployments
  • Calculate storage requirements for multi-tier retention architectures
  • Implement retention policies in InfluxDB and TimescaleDB
  • Identify and troubleshoot common time synchronization pitfalls in distributed IoT systems
  • Select appropriate aggregation functions for different data preservation requirements
In 60 Seconds

Retention policies automate the lifecycle of IoT data: raw high-resolution readings are kept for days or weeks in hot storage, then replaced with downsampled aggregates for months, and finally archived to cold object storage for years. Without retention policies, IoT storage grows unboundedly – a 1,000-sensor system at 1-second intervals generates 86.4 million rows per day. The key design decision is matching retention tier duration to the minimum resolution needed for each business function: real-time alerting needs seconds, trend analysis needs hours, compliance reporting needs years.

14.2 Key Concepts

  • Retention Policy: An automated database rule that drops time chunks older than a configured threshold, preventing unbounded storage growth while optionally triggering data migration to cheaper storage tiers
  • Downsampling: Replacing high-resolution raw readings (1-second) with lower-resolution aggregates (1-minute average, 1-hour max/min) for long-term retention, preserving trend visibility at 60x storage reduction
  • Data Tiering: Automatically migrating time chunks from hot (SSD, fast queries) to warm (HDD, moderate cost) to cold (object storage, archived) storage based on data age and access frequency
  • Retention Period: The duration for which raw sensor data is retained at full resolution before being downsampled or deleted – typically 7-30 days for operational monitoring, 90-365 days for compliance, years for billing records
  • Continuous Aggregate Policy: A TimescaleDB feature that automatically refreshes pre-computed summaries (hourly averages, daily min/max) as raw data arrives, used in conjunction with retention to preserve aggregated history after raw data expires
  • Storage Cost Projection: The calculation of total storage cost over the retention period for each tier, used to design retention policy parameters that meet budget constraints while satisfying data access requirements
  • Compliance Retention: Legal or regulatory requirements mandating that specific data categories (energy usage, medical telemetry, financial transactions) be retained for defined periods – these override performance-driven retention decisions
  • Tiered Storage Automation: Using TimescaleDB’s multi-node or AWS S3 tiering, or scheduled export jobs, to automatically move aged chunks from hot databases to cold object storage without manual intervention

14.3 MVU: Minimum Viable Understanding

Core Concept: Multi-tier retention keeps full-resolution data for recent time periods (hours/days) while progressively aggregating older data to coarser resolutions (minutes, hours, days), achieving 90-98% storage reduction. Why It Matters: Without retention policies, a 5,000-sensor deployment generates 5TB/year raw; with proper tiering, the same data fits in ~10-15GB (with compression) while preserving all analytical capabilities for trending and compliance. Key Takeaway: Implement retention from day one–keep 7 days raw for anomaly detection, 30 days at 1-minute resolution for operations, 1 year at hourly for compliance, and daily aggregates forever for trends. Always preserve min/max alongside averages to catch spikes.

Data retention is about deciding how long to keep IoT data and at what level of detail. Think of it like photo storage: you keep recent photos in full resolution, older ones as thumbnails, and eventually delete duplicates. Similarly, you keep recent sensor data in full detail, summarize older data into averages, and archive or delete the oldest data to manage storage costs.

14.4 Why Retention Matters

Estimated time: ~15 min | Difficulty: Advanced | Unit: P10.C15.U04

IoT systems generate massive data volumes, but not all data is equally valuable over time. Retention policies and downsampling strategies balance storage costs with analytical requirements.

14.4.1 The Retention Challenge

Example: Smart Building with 5,000 Sensors

  • Sampling rate: 1 reading per second
  • Data size: 32 bytes per reading (timestamp + sensor_id + value + metadata)
  • Daily data: 5,000 sensors x 86,400 seconds x 32 bytes = 13.8 GB/day raw
  • Annual data: 13.8 GB x 365 = 5.04 TB/year raw

With compression (10:1):

  • Daily: 1.38 GB/day
  • Annual: 504 GB/year
Pitfall: Delaying Retention Policy Implementation

The Mistake: Deploying a time-series database without retention policies, planning to “add them later when storage becomes a problem,” resulting in expensive migrations and data loss.

Why It Happens: During development, data volumes are small and storage seems cheap. Teams focus on features, not data lifecycle. By the time storage costs spike, the database contains months of uncompressed data that cannot be easily migrated to a tiered retention scheme. Dropping old chunks loses data that should have been downsampled.

The Fix: Implement retention policies on day one, even in development:

-- TimescaleDB: Set up retention BEFORE inserting data
-- 1. Create continuous aggregate for downsampling
CREATE MATERIALIZED VIEW sensor_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', time) AS bucket,
       sensor_id,
       AVG(value) as avg_val,
       MIN(value) as min_val,
       MAX(value) as max_val,
       COUNT(*) as sample_count
FROM sensor_data
GROUP BY bucket, sensor_id;

-- 2. Add retention policies IMMEDIATELY
SELECT add_retention_policy('sensor_data', INTERVAL '7 days');
SELECT add_retention_policy('sensor_hourly', INTERVAL '1 year');

-- 3. Schedule downsampling refresh
SELECT add_continuous_aggregate_policy('sensor_hourly',
  start_offset => INTERVAL '2 hours',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '1 hour');

Storage impact calculation:

  • Without retention: 5,000 sensors x 1 Hz x 32 bytes x 365 days = 5.04 TB/year
  • With 7-day raw (rolling) + 1-year hourly: 96.8 GB raw + 1.4 GB hourly = ~98 GB rolling (98% reduction vs annual raw)

Even with compression, multi-year storage becomes expensive and queries slow down as data volumes grow. The solution is downsampling – retaining full resolution for recent data while progressively aggregating older data into coarser summaries.

14.4.2 Multi-Tier Retention Strategy

The following diagrams illustrate how multi-tier retention works in practice. Figure 14.1 shows the data flow between tiers, Figure 14.2 shows the lifecycle over time, and Figure 14.3 compares different strategies by cost.

Data diagram showing multi-tier time series database retention with progressive downsampling from raw to minute to hourly to daily aggregates
Figure 14.1: Multi-Tier Data Retention with Progressive Downsampling
Diagram illustrating retention timeline with data aging from full resolution through progressive downsampling tiers
Figure 14.2: Timeline showing data lifecycle through retention tiers. Data starts at full resolution and progressively downsamples as it ages, with each tier reducing storage requirements while preserving trend analysis capability.
Diagram illustrating retention cost analysis comparing three strategies: no retention, aggressive downsampling, and balanced multi-tier approach
Figure 14.3: Retention Strategy Cost Analysis: Comparing no strategy, aggressive downsampling, and balanced multi-tier approaches

Storage Calculation:

Tier Resolution Retention Points/Sensor Total Storage (5,000 sensors)
Tier 1 1 second 7 days 604,800 96.8 GB (raw) -> 9.7 GB (compressed)
Tier 2 1 minute 30 days 43,200 6.9 GB (raw) -> 0.7 GB (compressed)
Tier 3 1 hour 1 year 8,760 1.4 GB (raw) -> 0.14 GB (compressed)
Tier 4 1 day Forever 365/year 0.06 GB/year
Total - - - ~10.6 GB (vs. 504 GB without downsampling)

Savings: 98% reduction in storage while preserving analytical capabilities.

How do the tiers mathematically combine? For 5,000 sensors at 1 Hz with 32-byte readings:

\[ \begin{aligned} \text{Tier 1 (7d raw):} &\quad 5{,}000 \times 604{,}800 \times 32\text{ bytes} = 96.8\text{ GB raw} \\ &\quad \div 10\text{ compression} = 9.7\text{ GB} \\ \text{Tier 2 (30d, 1-min):} &\quad 5{,}000 \times 43{,}200 \times 32 = 6.9\text{ GB raw} \\ &\quad \div 10 = 0.7\text{ GB} \\ \text{Tier 3 (1y hourly):} &\quad 5{,}000 \times 8{,}760 \times 32 = 1.4\text{ GB raw} \\ &\quad \div 10 = 0.14\text{ GB} \\ \text{Tier 4 (daily forever):} &\quad 5{,}000 \times 365 \times 32 = 58\text{ MB/year} \end{aligned} \]

Total storage: 10.6 GB vs. 504 GB without downsampling = 98% reduction. This isn’t magic—it’s aggressive time-based aggregation: 86,400 raw readings per day become 1,440 minute-averages (60x reduction), then 24 hourly (60x again), then 1 daily (24x again).

Adjust the parameters below to see how multi-tier retention reduces storage for your IoT deployment.

Tradeoff: Write-Optimized vs Read-Optimized Storage Layout

Option A: Write-Optimized (LSM Tree, Append-Only)

  • Write throughput: 500K-1M writes/sec (sequential appends)
  • Write latency: 0.5-2ms (memory buffer only)
  • Read latency for recent data: 1-5ms (in-memory memtable)
  • Read latency for historical data: 10-50ms (multiple SST file scans)
  • Space amplification: 10-30% overhead from compaction tombstones
  • Best for: High-velocity sensor ingestion where writes dominate (>10:1 write:read ratio)

Option B: Read-Optimized (B-Tree, In-Place Updates)

  • Write throughput: 10K-50K writes/sec (random I/O for updates)
  • Write latency: 5-20ms (index maintenance on each write)
  • Read latency for recent data: 2-10ms (B-tree traversal)
  • Read latency for historical data: 5-15ms (consistent B-tree performance)
  • Space amplification: Minimal (in-place updates)
  • Best for: Dashboard-heavy workloads where reads dominate (>5:1 read:write ratio)

Decision Factors:

  • Choose Write-Optimized when: Ingestion rate exceeds 50K writes/sec, real-time alerting uses only recent data (last hour), batch analytics can tolerate 50ms query latency, storage cost is more important than query speed
  • Choose Read-Optimized when: Dashboards require sub-10ms response for any time range, historical queries are frequent (compliance, trend analysis), write volume is moderate (<50K/sec), consistent query latency is critical for SLAs
  • Hybrid approach: InfluxDB and TimescaleDB use LSM for writes with read-optimized continuous aggregates for frequently-accessed historical summaries

14.4.3 Implementing Retention in InfluxDB

InfluxDB 1.x (InfluxQL):

-- Retention Policy: 7 days for raw data
CREATE RETENTION POLICY "raw_data" ON "iot_sensors"
  DURATION 7d
  REPLICATION 1
  DEFAULT

-- Continuous Query: Downsample to 1-minute averages
CREATE CONTINUOUS QUERY "downsample_1m" ON "iot_sensors"
BEGIN
  SELECT mean(value) AS value, max(value) AS max_value, min(value) AS min_value
  INTO "30_days"."sensor_data_1m"
  FROM "raw_data"."sensor_data"
  GROUP BY time(1m), sensor_id, location
END

-- Continuous Query: Downsample to 1-hour averages
CREATE CONTINUOUS QUERY "downsample_1h" ON "iot_sensors"
BEGIN
  SELECT mean(value) AS value, max(value) AS max_value, min(value) AS min_value
  INTO "1_year"."sensor_data_1h"
  FROM "30_days"."sensor_data_1m"
  GROUP BY time(1h), sensor_id, location
END

InfluxDB 2.x+ (Flux): In InfluxDB 2.x, retention is configured per-bucket rather than per-measurement, and downsampling uses Flux tasks instead of continuous queries:

// Flux task: Downsample raw data to 1-minute aggregates every minute
// Note: Run separate tasks for mean, min, and max to preserve spike detection
option task = {name: "downsample_1m_mean", every: 1m}

from(bucket: "raw_7d")
  |> range(start: -task.every)
  |> filter(fn: (r) => r._measurement == "sensor_data")
  |> aggregateWindow(every: 1m, fn: mean, createEmpty: false)
  |> set(key: "_field", value: "value_mean")
  |> to(bucket: "downsampled_30d")

// Separate task for min/max (always preserve these for anomaly detection)
// option task = {name: "downsample_1m_max", every: 1m}
// ... same pattern with fn: max and _field: "value_max"

14.4.4 Implementing Retention in TimescaleDB

-- Enable compression on hypertable
ALTER TABLE sensor_data SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'sensor_id, location',
  timescaledb.compress_orderby = 'time DESC'
);

-- Compression policy: Compress data older than 2 days (saves space while retained)
SELECT add_compression_policy('sensor_data', INTERVAL '2 days');

-- Retention policy: Drop raw data older than 7 days
-- Note: compression interval must be shorter than retention interval
SELECT add_retention_policy('sensor_data', INTERVAL '7 days');

-- Continuous aggregate: 1-minute averages
CREATE MATERIALIZED VIEW sensor_data_1m
WITH (timescaledb.continuous) AS
SELECT
  time_bucket('1 minute', time) AS bucket,
  sensor_id,
  location,
  AVG(temperature) as avg_temp,
  MAX(temperature) as max_temp,
  MIN(temperature) as min_temp
FROM sensor_data
GROUP BY bucket, sensor_id, location;

-- Refresh policy: Update continuous aggregate every 10 minutes
SELECT add_continuous_aggregate_policy('sensor_data_1m',
  start_offset => INTERVAL '1 hour',
  end_offset => INTERVAL '1 minute',
  schedule_interval => INTERVAL '10 minutes');

Tradeoff: Per-Device Partitioning vs Time-Only Partitioning

Option A: Time-Only Partitioning (Chunks by time intervals)

  • Storage overhead: Minimal - single partition per time window
  • Write throughput: Maximum - all devices write to same active chunk
  • Query “all devices last hour”: Very fast - scan single chunk (5-20ms)
  • Query “one device last year”: Slow - scan all 365 daily chunks (200-500ms)
  • Index size: Small - one time index per chunk
  • Best for: Dashboards showing aggregate metrics across all devices

Option B: Time + Device Partitioning (Chunks by time AND device_id)

  • Storage overhead: Higher - N partitions per time window (N = device groups)
  • Write throughput: Good - devices distributed across chunks
  • Query “all devices last hour”: Slower - scan N chunks (20-100ms)
  • Query “one device last year”: Fast - scan only device’s chunks (10-50ms)
  • Index size: Larger - indexes per device partition
  • Best for: Device-specific historical analysis, fleet management, per-device SLAs

Decision Factors:

  • Choose Time-Only when: Queries aggregate across all devices (avg temperature across building), device count is small (<1,000), storage budget is tight, write throughput is critical
  • Choose Time + Device when: Queries focus on individual device history (maintenance logs), device count is large (>10,000), per-device analytics are frequent, compliance requires device-level audit trails
  • TimescaleDB approach: Use space_partitioning on device_id with 8-64 partitions to balance both patterns; InfluxDB uses tag-based series for similar effect

14.4.5 Downsampling Strategy Guidelines

When to downsample:

  • Anomaly detection requires high resolution (keep longer)
  • Compliance regulations mandate raw data (check industry standards)
  • Historical trending doesn’t need full resolution (aggregate aggressively)

What aggregations to keep:

  • Mean: General trending (temperature, humidity)
  • Max/Min: Detect threshold violations (peak power usage)
  • Median: Robust to outliers (network latency)
  • Percentiles (p95, p99): Tail behavior analysis (response times)
  • Count: Event frequency (motion detection triggers)
  • Sum: Cumulative metrics (energy consumption)

Common mistake: Only storing averages. If a sensor spiked to 95C for 5 seconds in a 1-minute window, averaging to 25C hides the critical event. Always keep max/min for anomaly detection.

Tradeoff: Aggressive vs Conservative Downsampling

Option A: Aggressive Downsampling - Maximize storage savings - Raw data retention: 24-48 hours only - Downsample to 1-minute averages after 48 hours - Downsample to hourly after 7 days - Storage for 10K sensors at 1Hz: ~50GB/year (99% reduction) - Query latency on historical data: 10-50ms (pre-aggregated) - Risk: May miss brief anomalies in historical analysis

Option B: Conservative Downsampling - Preserve detail for analysis - Raw data retention: 30 days - Downsample to 1-minute after 30 days (keep min/max/avg/count) - Downsample to hourly after 1 year - Storage for 10K sensors at 1Hz: ~500GB/year (95% reduction) - Query latency on historical data: 50-200ms (more data to scan) - Benefit: Can detect patterns that span multiple data points

Decision Factors:

  • Choose Aggressive when: Anomaly detection uses only real-time data (last 24h), regulatory requirements don’t mandate raw data retention, storage budget is severely constrained, queries are primarily dashboard/trend focused
  • Choose Conservative when: ML model training requires historical raw data, compliance requires 30+ day raw retention (HIPAA, GDPR audit), post-incident analysis needs second-level detail, pattern detection spans hours/days (not minutes)
  • Critical: Always preserve min/max alongside averages to catch spikes hidden by averaging

14.5 Time Synchronization Pitfalls

Time-series databases depend on accurate timestamps. These common pitfalls cause data corruption and query failures:

Common Pitfall: Ignoring Clock Drift in Distributed Systems

The mistake: Using device local time without synchronization, assuming clocks stay accurate.

Symptoms:

  • Events appear out of order in time-series queries
  • Time-based queries return wrong or incomplete results
  • Aggregation windows miss or duplicate data points
  • Data from different sensors doesn’t correlate properly

Why it happens: Device clocks drift over time, especially without network synchronization. A typical RTC can drift seconds per day. After 30 days, a device could be off by minutes. Deep sleep modes can accelerate drift.

How to diagnose:

  1. Compare device timestamps against a known-accurate NTP server
  2. Check for gaps or overlaps in time-series data
  3. Look for events that should correlate but have timestamp mismatches
  4. Monitor time sync status in device telemetry

The fix:

# Synchronize periodically with NTP
def sync_time():
    ntp_time = ntp_client.get_time()
    device.set_time(ntp_time)

# Re-sync after deep sleep
def wake_from_sleep():
    sync_time()

# Include sync quality indicator in data
reading = {
    "value": sensor.read(),
    "timestamp": device.get_time(),
    "time_quality": "ntp_synced"  # or "free_running"
}

Prevention: Implement NTP or PTP synchronization. Re-sync after sleep cycles. Use GPS time for outdoor deployments. Include time quality indicators in data so downstream systems can handle uncertainty.

Common Pitfall: Wrong Timestamp Units (Seconds vs Milliseconds)

The mistake: Mixing timestamp units across systems without explicit conversion.

Symptoms:

  • Dates appear in 1970 or far in the future (year 57000+)
  • Time calculations off by factor of 1000
  • Database errors on timestamp insertion
  • Queries return no results despite data existing

Why it happens: Unix timestamps can be in seconds, milliseconds, microseconds, or nanoseconds. Python’s time.time() returns seconds. JavaScript’s Date.now() returns milliseconds. Databases vary in their expectations.

How to diagnose:

  1. Check timestamp magnitude: seconds (~1.7B), milliseconds (~1.7T)
  2. Look for dates in 1970 (timestamp interpreted as 0 or near-zero)
  3. Verify database schema timestamp precision settings
  4. Test with known timestamps across system boundaries

The fix:

import time
from datetime import datetime, timezone

# WRONG: Python gives seconds, but database expects milliseconds
timestamp = time.time()  # 1736424000 (seconds)
db.insert(timestamp)     # Interpreted as 1970-01-20!

# CORRECT: Be explicit about units
timestamp_seconds = time.time()
timestamp_ms = int(time.time() * 1000)

# BEST: Use ISO 8601 format to avoid ambiguity
timestamp_iso = datetime.now(timezone.utc).isoformat()
# "2026-01-09T12:00:00+00:00"

Prevention: Document timestamp units in API specifications. Standardize on milliseconds for IoT systems. Validate timestamp ranges on receipt (reject values outside reasonable bounds). Use ISO 8601 format when human readability matters.

Common Pitfall: Timezone Handling Confusion

The mistake: Storing local time without timezone information, or mixing timezones from different devices.

Symptoms:

  • Events appear at wrong times in dashboards
  • DST transitions cause gaps or duplicate data
  • Cross-region data correlation fails completely
  • Reports show times that don’t match user expectations

Why it happens: Developers often use datetime.now() which returns local time without timezone info. Devices in different timezones appear to have the same timestamp but are actually hours apart.

How to diagnose:

  1. Check if timestamps include timezone offset
  2. Look for gaps during DST transitions (spring) or duplicates (fall)
  3. Compare events from devices in different timezones
  4. Verify dashboard times match actual event times

The fix:

from datetime import datetime, timezone
from zoneinfo import ZoneInfo  # Python 3.9+ (use pytz for older versions)

# WRONG: Local time without timezone info
timestamp = datetime.now()  # No TZ info - ambiguous!

# CORRECT: Always use UTC internally
timestamp = datetime.now(timezone.utc)

# For display: Convert to user's local timezone
user_tz = ZoneInfo('America/New_York')
local_time = timestamp.astimezone(user_tz)

Prevention: Use UTC for all storage and transmission. Store timezone-aware timestamps only. Convert to local time exclusively for display. Test DST transitions explicitly (March and November in US).

A fourth critical pitfall – using processing time instead of event time – directly impacts retention policies. See the detailed Processing Time vs Event Time callout in the reference material below for implementation guidance.

14.6 Case Study: Tesla’s Vehicle Telemetry Retention Strategy

Tesla collects telemetry from over 4 million vehicles, each generating approximately 100 sensor channels (speed, battery voltage, motor temperature, GPS, brake pressure, steering angle, camera triggers, etc.). Understanding their retention approach illustrates enterprise-scale decision-making.

Data generation per vehicle:

Sensor Category Channels Sample Rate Daily Volume
Drivetrain (motor, battery, inverter) 45 10 Hz 1.2 GB
Chassis (speed, steering, brakes) 30 50 Hz 3.8 GB
Climate (cabin temp, HVAC) 12 1 Hz 33 MB
GPS + navigation 5 1 Hz 14 MB
Camera triggers (event snapshots) 8 Event-driven 200 MB avg
Total per vehicle per day 100 Mixed ~5.2 GB

At 4 million vehicles, the fleet generates approximately 20 petabytes per day in raw telemetry. Even storing just one month of raw data (600 PB) at $0.004/GB/month (S3 Glacier) would cost $2.4 million per month – and that is just cold storage, not accounting for retrieval costs or managed database pricing which would be orders of magnitude higher. Clearly, aggressive retention tiering is required.

Tesla’s reported approach (based on patent filings and engineering talks):

Tier Retention Period Resolution Storage Per Vehicle Fleet Storage
Hot (edge, in-car) 1 hour Full (10-50 Hz) 5.2 GB (circular buffer) N/A (local)
Event capture 30 days cloud Full, 30-sec clips ~500 MB/month 2 PB/month
Aggregated telemetry 1 year 1-minute averages ~15 GB/year 60 PB/year
Fleet statistics Indefinite Hourly per model/region Negligible ~1 TB total

Key design decisions:

  1. Edge-first processing: The vehicle’s onboard computer runs anomaly detection locally. Only anomalies trigger a 30-second data upload (15 seconds before and after the event). This reduces cloud ingestion by approximately 99.99% compared to continuous upload.

  2. Min/max preservation for safety: Battery voltage and motor temperature retention always includes min/max values, not just averages. A 1-minute average battery voltage of 395V looks normal, but a min of 340V within that minute indicates a cell balancing fault that could lead to thermal runaway.

  3. Compliance-driven retention: Autonomous driving event data (Autopilot activations, disengagements, near-misses) is retained for 3 years at full resolution per NHTSA investigation requirements – even though the storage cost is 50x higher per event than the aggregated tier.

Lesson for IoT architects: Tesla’s approach demonstrates that retention strategy is driven by three factors in order of priority: (1) regulatory requirements (what MUST you keep), (2) safety value (what COULD save lives if analyzed), (3) cost optimization (how cheaply can you store what remains). Start with regulatory, then safety, and optimize cost last.

14.7 Summary

Retention policies and time synchronization are critical for sustainable IoT time-series systems.

Key Takeaways:

  1. Implement retention from day one: Multi-tier policies reduce storage by 95-98%. Don’t wait until storage becomes a problem.

  2. Always preserve min/max: When downsampling, averages hide critical spikes. Store min, max, and count alongside mean.

  3. Choose the right partition strategy: Time-only for aggregate dashboards, time+device for per-device analytics.

  4. Synchronize clocks: Use NTP, re-sync after sleep, and include time quality indicators in telemetry.

  5. Use UTC everywhere: Store and transmit UTC, convert to local time only for display.

  6. Use event time, not processing time: Timestamp data at the source, not when it arrives at the server.

The Sensor Squad’s diary is getting TOO BIG! How do they keep the important stuff?

Sammy the Sensor has been writing down the temperature every single second for a whole year. That is 31 MILLION entries! His diary is SO heavy he cannot even lift it!

“We need to clean up!” says Max the Microcontroller. But they cannot just throw away old data – what if they need it later?

So Max invents a clever system called data retention tiers:

Tier 1 - This Week (Full Detail) “For the last 7 days, keep EVERY single reading. We might need to zoom into exactly what happened at 3:47:12 PM last Tuesday!”

Tier 2 - This Month (1-Minute Summaries) “For the last 30 days, keep one summary per minute. Instead of 60 readings, we keep the average, minimum, and maximum. That is 57 readings we can delete!”

Tier 3 - This Year (Hourly Summaries) “For the last year, keep one summary per hour. We can still see daily patterns, but we save 3,599 out of every 3,600 readings!”

Tier 4 - Forever (Daily Summaries) “For anything older than a year, keep just one number per day. Was January 15, 2024 hot or cold? That is all we need!”

Lila the LED does the math: “Sammy’s 31 million readings shrink to just 50,000! That is a 99.8% reduction!”

Bella the Battery adds an important warning: “But ALWAYS keep the minimum and maximum along with the average. If you only keep averages, you will miss the 5-second spike when the oven almost caught fire!”

“Think of it like photos,” says Max. “Today’s photos: keep them all. Last month: keep the best ones. Last year: keep one per trip. 10 years ago: keep one per year. You still remember your life, but your photo album fits on one shelf instead of filling a whole room!”

14.7.1 Try This at Home!

Write down the temperature every hour for 3 days (or check a weather app). Now summarize each day with just 3 numbers: the average, the coldest, and the hottest temperature. See how much space you saved? You went from 72 numbers to just 9, but you still know the important stuff about each day!

A hospital IoT platform monitors 500 patient beds with continuous vital sign sensors (heart rate, SpO2, blood pressure, temperature). HIPAA requires 6-year retention of health records, but storing raw 1 Hz sensor data for 6 years is cost-prohibitive. Design a compliant multi-tier retention strategy.

Regulatory requirements (HIPAA § 164.316):

  • Retain “documentation of actions, activities, or assessments” for 6 years
  • Must prove no gaps in monitoring during patient admission
  • Audit trail required for all clinical decisions

Data generation (per bed):

  • 4 vital signs × 1 Hz × 16 bytes/reading = 64 bytes/second
  • Per bed daily: 64 bytes/s × 86,400 = 5.53 MB/day
  • 500 beds: 5.53 MB × 500 = 2.76 GB/day raw

Compliant tiered retention strategy:

Tier Retention Period Resolution Regulatory Justification Storage
Tier 1: Critical Events 6 years Full resolution (1 Hz) for ±5 min around alerts Proves clinical decision-making context 500 beds × 5 alerts/day × 600 sec × 64 bytes = 96 MB/day → 210 GB/6 years
Tier 2: Raw Monitoring 72 hours Full 1 Hz for all patients Recent data for diagnosis, troubleshooting 2.76 GB/day × 3 days = 8.3 GB rolling
Tier 3: 1-min Summaries 90 days 1-minute avg/min/max Operational review, trend analysis 2.76 GB/day ÷ 60 × 90 days = 4.1 GB
Tier 4: Hourly Summaries 6 years Hourly avg/min/max + event counts Compliance retention, proves no gaps 2.76 GB/day ÷ 3600 × 365 × 6 = 1.7 GB
Total 6-year storage ~224 GB (vs 6 TB raw for 6 years)

Retention policy implementation (TimescaleDB):

-- Tier 1: Critical events (6-year retention)
CREATE TABLE critical_events (
    time TIMESTAMPTZ NOT NULL, patient_id INT,
    vital_type TEXT, alert_type TEXT,
    context_data JSONB);  -- ±5 min raw data around alert
SELECT create_hypertable('critical_events', 'time');
SELECT add_retention_policy('critical_events', INTERVAL '6 years');

-- Tier 2: Raw vitals at 1 Hz (72-hour rolling window)
CREATE TABLE vitals_raw (
    time TIMESTAMPTZ NOT NULL, patient_id INT,
    heart_rate INT, spo2 INT,
    bp_systolic INT, bp_diastolic INT, temperature REAL);
SELECT create_hypertable('vitals_raw', 'time');
SELECT add_retention_policy('vitals_raw', INTERVAL '72 hours');

-- Tier 3: 1-min continuous aggregate (90 days)
CREATE MATERIALIZED VIEW vitals_1min
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 minute', time) AS bucket, patient_id,
       AVG(heart_rate) as hr_avg, MIN(heart_rate) as hr_min,
       MAX(heart_rate) as hr_max, COUNT(*) as sample_count
FROM vitals_raw GROUP BY bucket, patient_id;
SELECT add_retention_policy('vitals_1min', INTERVAL '90 days');

-- Tier 4: Hourly aggregate (6-year compliance)
CREATE MATERIALIZED VIEW vitals_hourly
WITH (timescaledb.continuous) AS
SELECT time_bucket('1 hour', bucket) AS hour, patient_id,
       AVG(hr_avg) as hr_avg, MIN(hr_min) as hr_min,
       MAX(hr_max) as hr_max, SUM(sample_count) as total_samples
FROM vitals_1min GROUP BY hour, patient_id;
SELECT add_retention_policy('vitals_hourly', INTERVAL '6 years');

Compliance verification query:

-- Prove continuous monitoring for patient admission (no gaps >5 minutes)
SELECT hour, patient_id, total_samples,
       CASE WHEN total_samples < 3300 THEN 'GAP DETECTED' ELSE 'OK' END as status
FROM vitals_hourly
WHERE patient_id = 12345
  AND hour BETWEEN '2023-01-15 08:00' AND '2023-01-17 18:00'
ORDER BY hour;
-- Expected: 3600 samples/hour at 1 Hz. <3300 means >5 min gap in that hour.

Storage cost analysis (AWS S3 for compliance tier):

  • Tier 1 (critical events): 210 GB × $0.023/GB/month (S3 Standard) = $4.83/month
  • Tier 4 (hourly summaries): 1.7 GB × $0.023/GB/month = $0.04/month
  • 6-year compliance storage: ~$5/month on S3 vs ~$138/month for raw 6 TB on S3 Standard (or far more on a managed database)

This design achieves HIPAA compliance at 98% cost reduction by preserving full-resolution data only for critical events and using hourly summaries with sample counts to prove continuous monitoring.

Data Type Best Aggregation Avoid Rationale Example
Temperature Mean, Min, Max Median (lossy), Last (ignores history) Temperature changes gradually; mean preserves trend, min/max catch spikes HVAC monitoring: mean=22.5°C, max=25°C (brief spike)
Pressure Mean, Min, Max, StdDev Mode (rare spikes dominate) Pressure fluctuates; stddev indicates stability Industrial boiler: mean=150 PSI, stddev=5 (stable) vs 20 (oscillating)
Binary (motion, door) Sum (event count), Duration in state Mean (meaningless for 0/1) Binary sensors: count state changes, measure time in each state Motion sensor: 15 triggers/hour, 18 min active
Energy (kWh) Sum, Max (peak demand) Mean (loses total consumption) Energy is cumulative; sum = total consumed, max = peak load Building: sum=450 kWh/day, max=75 kW (peak)
Network latency Mean, p50, p95, p99 Min (irrelevant), Max alone (outlier-sensitive) Latency is long-tailed; percentiles capture user experience API: mean=50ms (good), p99=2s (some users suffer)
Count/rate Sum, Max rate Mean (double-counts) Counts are cumulative; sum preserves total Requests/sec: sum=8.6M requests/day, max=150 req/s
GPS coordinates Median, Centroid Mean (drifts with outliers) Location has outliers (GPS glitches); median or centroid robust Vehicle route: median filters GPS errors

General principles:

  1. Always include min/max for anomaly detection (even if mean is primary metric)
  2. Use percentiles for long-tailed distributions (latency, response time)
  3. Preserve counts for rate calculations (events/sec requires sum of events + time window)
  4. Use sum for cumulative metrics (energy, data transfer, requests)
  5. Include sample count to detect data gaps (if count < expected, monitoring failed)
Common Mistake: Using Processing Time Instead of Event Time for Retention

Teams configure retention policies based on when data arrives at the database (processing time) instead of when the event actually occurred (event time), causing regulatory compliance failures and incorrect data expiration.

Symptoms: Late-arriving data goes into wrong aggregation windows, real-time dashboards show values that differ from batch reports, and retention policies prematurely delete valid historical data.

What goes wrong: A manufacturing plant uses TimescaleDB with a 30-day retention policy on the created_at timestamp (when the database received the record). Edge gateways buffer sensor data locally during network outages and transmit in batches when connectivity resumes. After a 5-day network outage, the gateway uploads 5 days of buffered data. The retention policy immediately deletes data older than 30 days based on created_at (processing time), even though the sensor readings are only 5 days old by event time. Result: 5 days of production data is lost.

Why it fails: Retention policies must operate on event time (when the measurement was taken), not processing time (when the database received it). Late-arriving data is valid historical data, not stale data to be deleted.

The correct approach:

  1. Always use event timestamp for retention policies:

    -- BAD: Hypertable partitioned by processing time
    CREATE TABLE sensor_data (
        created_at TIMESTAMPTZ DEFAULT NOW(),  -- Processing time
        event_time TIMESTAMPTZ,                -- Event time
        value REAL
    );
    SELECT create_hypertable('sensor_data', 'created_at');  -- Partitioned by processing time!
    SELECT add_retention_policy('sensor_data', INTERVAL '30 days');  -- Drops based on created_at!
    
    -- GOOD: Retention based on event time
    SELECT create_hypertable('sensor_data', 'event_time');  -- Partition by event_time
    SELECT add_retention_policy('sensor_data', INTERVAL '30 days');  -- Now uses event_time
  2. Handle late-arriving data gracefully:

    # Validate event_time on ingestion and reject implausible timestamps
    from datetime import datetime, timedelta, timezone
    
    MAX_LATE_ARRIVAL = timedelta(days=7)
    MAX_FUTURE_DRIFT = timedelta(minutes=5)
    
    def validate_event_time(event_time: datetime) -> bool:
        """Accept events up to 7 days old, reject future timestamps."""
        now = datetime.now(timezone.utc)
        if event_time > now + MAX_FUTURE_DRIFT:
            return False  # Reject future timestamps (likely clock error)
        if event_time < now - MAX_LATE_ARRIVAL:
            return False  # Too old - route to dead-letter queue for review
        return True
  3. Separate processing timestamp for audit:

    -- Track both event time and processing time
    CREATE TABLE sensor_data (
        event_time TIMESTAMPTZ NOT NULL,    -- When measurement was taken
        processed_at TIMESTAMPTZ DEFAULT NOW(),  -- When DB received it
        value REAL,
        PRIMARY KEY (event_time, sensor_id)
    );
    
    -- Audit late arrivals
    SELECT sensor_id, event_time, processed_at,
           (processed_at - event_time) as latency
    FROM sensor_data
    WHERE (processed_at - event_time) > INTERVAL '1 hour'
    ORDER BY latency DESC
    LIMIT 100;
  4. Always use event time in application code:

    # WRONG: Using server arrival time
    def process_message(msg):
        timestamp = datetime.now()  # Processing time - wrong!
        store(timestamp, msg.value)
    
    # CORRECT: Use event time from message
    def process_message(msg):
        timestamp = msg.event_time  # Time event actually occurred
        store(timestamp, msg.value)
  5. Use watermarks for stream processing:

    • In Apache Flink/Kafka Streams: configure watermark delay (e.g., 10 minutes) to accommodate late events
    • Data arriving >10 minutes late goes to side output for manual review, not main pipeline
    # Spark Structured Streaming watermark example
    stream.withWatermark("event_time", "5 minutes")
    # In Flink: use WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofMinutes(5))

Real consequence: A smart grid utility stored power consumption data with a 90-day retention policy. Their edge gateways incorrectly set _time to the upload timestamp rather than the measurement timestamp. When a substation’s network connection failed for 3 weeks, the gateway buffered data locally. Upon reconnection, all 3 weeks of backlogged data was written with current timestamps. Later, the retention policy dropped data based on these incorrect timestamps rather than actual measurement time – meaning data that was 90 days old by upload time (but only 87 days old by measurement time) got deleted prematurely. The utility faced regulatory fines for incomplete billing records. The fix: ensure edge gateways write the actual measurement timestamp as _time, not the upload time. Add server-side validation to reject timestamps that are implausibly far from the current time. The lesson: ensure event time is correctly captured at the source, and validate timestamps on ingestion.

14.8 Concept Relationships

Prerequisites - Read these first: - Time-Series Fundamentals - Compression mechanisms (delta encoding, columnar) - Time-Series Platforms - Retention implementation in InfluxDB/TimescaleDB

Related Concepts:

Practical Applications:

Compliance Context:

14.9 What’s Next

Topic Chapter What You’ll Learn
Query Optimization Query Optimization for IoT Write efficient queries that span multiple retention tiers, using time-range filters and continuous aggregates to avoid full table scans
Stream Processing Stream Processing Apply real-time transformations and filtering before data reaches storage, reducing ingestion volume and enabling event-driven architectures
Data Quality Data Quality Monitoring Detect missing data, outliers, and sensor drift before downsampling locks in corrupted aggregates that cannot be corrected later
Sharding Strategies Sharding Strategies Distribute retention policies across clustered database nodes, configure per-shard retention, and manage rebalancing when tiers expire
Hands-On Practice Time-Series Practice Apply retention concepts through guided exercises including multi-tier policy design, storage cost estimation, and compliance verification queries