12 Data Retention and Downsampling

Raw Windows, Rollups, Archive Manifests, Restore Checks, Late Arrivals, Compliance Holds, and Deletion Evidence

data-storage

time

series

retention

Prerequisites: Time-Series Database Fundamentals, Time-Series Query Optimization

This enables: Time-Series Practice and Labs, Stream Processing, Anomaly Detection

12.1 Start With the Deletion Request

Retention becomes serious the first time someone asks to delete raw readings. Before that job runs, the team should know which summaries replace the detail, which archive can be restored, which investigation holds stop deletion, and who approved the loss of resolution. Downsampling is safe only when that evidence survives.

Data Dora

“Every reading has a time and a cost — decide retention before you decide the database.”

In this chapter, Dora keeps her ledger open at every deletion gate: what arrived, what survives as rollup or archive, and what a restore will cost.

12.2 In 60 Seconds

Retention is not just “delete old data.” It is a lifecycle contract that says which raw readings stay queryable, which summaries replace them, which archives can be restored, which data is held for investigation or compliance, and who approved each loss of detail.

Downsampling is only safe when the aggregate preserves enough evidence: sample count, min, max, quality status, gap flags, source window, and refresh time. A smaller table that hides missing data or erases incident spikes is not an optimization. It is a data-quality problem.

Learning Objectives

After this chapter, you should be able to:

Design a retention contract that separates raw telemetry, rollups, archives, deletion, and legal holds.
Choose downsampling fields that preserve enough evidence for dashboards and investigations.
Run a retention dry run before deleting raw data.
Explain how late arrivals and rollup refresh windows affect retention safety.
Compare TimescaleDB-style policies, InfluxDB-style retention periods, and Prometheus-style monitoring retention without treating them as interchangeable.
Build a release packet that proves retention, archive, restore, query replay, and owner approval.

12.3 Retention Is a Release Contract

Retention decisions are product decisions. A storage engineer can implement the job, but the owner of the data must decide what detail can be lost, when it can be lost, and how an incident can still be reconstructed later.

Raw windowHigh-resolution readings used for alerts, replay, and recent diagnosis.

Rollup tierLower-resolution summaries used for dashboards, reports, and trend queries.

ArchiveDurable copy with manifest, schema, checksum, and restore procedure.

DeleteApproved removal after rollup, archive, query replay, and hold checks pass.

HoldException path for investigations, compliance, warranty, or customer commitments.

Do Not Delete Before You Can Restore

An archive that has never been restored is only a file. Before raw data expires, prove that a sample partition can be restored, queried, and compared with the original dashboard or investigation result.

12.4 Lifecycle Map

Retention should be visible as a lifecycle with raw windows, rollups, archive proof, deletion approval, and hold exceptions.

Raw

Keep recent detail

Preserve high-resolution readings long enough for alert confirmation, support cases, late arrivals, and incident replay.

Rollup

Summarize with evidence

Store aggregate values with counts, min/max, gap flags, quality state, source window, and refresh timestamp.

Restore on demand

Keep manifests, schema versions, checksums, partition keys, access owners, and tested restore steps.

Hold

Pause deletion

Retain raw or archive detail when an investigation, contract, compliance rule, or customer dispute requires it.

12.5 Downsampling That Preserves Evidence

Downsampling replaces many high-resolution samples with fewer summary rows. The mistake is treating the average as the whole story. For IoT telemetry, a useful rollup normally needs fields that preserve coverage, extremes, and quality.

-- Example rollup shape. Adapt table names and policy details to your platform.
SELECT
  time_bucket('15 minutes', observed_at) AS bucket_start,
  site_id,
  device_id,
  metric_name,
  avg(value) AS avg_value,
  min(value) AS min_value,
  max(value) AS max_value,
  count(*) AS sample_count,
  count(*) FILTER (WHERE quality_status <> 'valid') AS nonvalid_count,
  max(received_at) AS latest_received_at
FROM readings
WHERE observed_at >= :source_start
  AND observed_at <  :source_end
GROUP BY bucket_start, site_id, device_id, metric_name;

Rollup field

What it preserves

Why it matters

Release check

sample_count

Coverage inside the bucket.

Averages from sparse buckets should not look as trustworthy as complete buckets.

Compare expected and actual samples for representative devices.

min and max

Short spikes and dips.

An average can erase the exact event that support or safety teams need.

Replay known incident windows against raw data and the rollup.

quality counts

Late, estimated, invalid, duplicate, or rejected samples.

Dashboards should not hide ingestion or device-health problems.

Verify quality filters in dashboard and investigation queries.

refresh metadata

When the rollup was calculated and which source window it covers.

Late arrivals can make a previous aggregate incomplete.

Record refresh policy, late-arrival window, and repair procedure.

Dora’s Retention Ledger

Reading: raw samples grouped into 15-minute buckets per site, device, and metric.
Keep: each bucket’s average with min, max, sample count, non-valid count, and latest receive time.
Cost: an average alone erases the spikes and sparse coverage that support and safety teams need.

Downsampling Rule

Keep the smallest summary that still answers the product question and explains its own quality. If the dashboard needs to show spikes, a mean alone is not enough. If the report needs compliance proof, a summary without restore evidence is not enough.

12.6 Retention Dry Run

A dry run should run before the first automated deletion and before any policy change that shortens raw retention. It should report the raw partitions that would be affected, the rollups that replace them, archive status, restore status, query replay results, hold exceptions, and approval owner.

A retention dry run proves replacement evidence before deletion becomes automatic.

retention_dry_run:
  policy_name: room_temperature_raw_expiry
  candidate_window: before 2026-05-01T00:00:00Z
  raw_partitions_found: list_only
  rollup_check: sample_count, min_value, max_value, gap_count, refreshed_at
  archive_check: manifest, schema_version, checksum, object_path
  restore_check: restored_partition_query_matches_expected_result
  hold_check: no_active_investigation_or_contract_exception
  decision: approve, delay, or revise_policy

Dry-run gate

Block deletion when...

Evidence to capture

Rollup completeness

Any source interval has no matching summary or has suspicious sample counts.

Coverage report by bucket, device, metric, and quality state.

Archive manifest

Raw partitions are not listed with schema version, checksum, and owner.

Manifest path, partition keys, source query, checksum, and access policy.

Restore sample

No one has restored and queried a sample archive partition.

Restore timestamp, restored row count, comparison query, and reviewer sign-off.

Hold check

An investigation, customer dispute, compliance rule, or warranty case needs raw detail.

Hold identifier, scope, owner, expiry review date, and release condition.

Dora’s Retention Ledger

Reading: raw partitions before the candidate window, listed but not yet touched.
Keep: rollups with sample counts, min and max, gap counts, plus an archive manifest and checksum.
Cost: a restored sample partition must match the expected query result before deletion is approved.

12.7 Platform Patterns

Retention is implemented differently across systems. The review habit should stay stable even when the command changes.

TimescaleDB-style

Drop chunks on schedule

Retention policies can drop old chunks from hypertables or continuous aggregates. Review chunk age, policy schedule, and rollup dependency first.

InfluxDB-style

Retain by database or table

Current InfluxDB versions use retention periods rather than old retention-policy assumptions. Review version, scope, mutability, and migration plan.

Prometheus-style

Monitoring retention

Prometheus local storage retention is for monitoring samples. Use it for operational metrics, not as the only customer-history archive.

External lifecycle

Object storage, lake tables, or exported files need manifests, checksums, schema versions, access control, and tested restore paths.

12.7.1 TimescaleDB-Style Policies

TimescaleDB retention commonly uses a background policy that drops chunks older than a chosen interval. That command is powerful because it deletes data. Use it after rollup, archive, restore, and hold checks pass.

-- Example: drop raw chunks after the approved raw-retention window.
SELECT add_retention_policy(
  'readings',
  drop_after => INTERVAL '30 days',
  if_not_exists => true
);

Continuous aggregates can preserve summaries after raw chunks expire, but they need refresh windows and late-data rules. The release review should prove that a late reading inside the allowed correction window refreshes the affected buckets before the raw source disappears.

12.7.2 InfluxDB-Style Retention Periods

InfluxDB guidance changed across major versions, so avoid copy-pasting old retention-policy examples into a modern chapter. In InfluxDB 3 Core, retention periods are associated with databases at creation time; Enterprise also supports more advanced retention controls such as table-level retention and database retention updates. Review the exact edition and version before writing migration instructions.

influxdb_retention_review:
  edition: core_or_enterprise
  scope: database_or_table
  duration: approved_by_data_owner
  downsample_target: separate_database_or_table
  migration_needed_if_duration_changes: yes_or_no

12.7.3 Prometheus-Style Monitoring Retention

Prometheus local storage retention is a monitoring setting, usually controlled by retention time or retention size flags. It is useful for service health, gateway queues, scrape health, ingest lag, and alert history. It is not a complete substitute for a telemetry store or archive when raw customer readings must be replayed.

monitoring_retention_review:
  metrics_role: service_health
  retention_time: supports_on_call_investigation_window
  retention_size: fits_disk_budget
  remote_write_or_archive: documented_if_needed
  not_source_of_truth_for: regulated_or_customer_raw_history

12.8 Partition-Aligned Deletion Mechanics

Retention is cheap only when the storage layout gives the engine a whole unit to remove. In time-partitioned systems, old raw data should expire as a complete chunk, partition, shard, or block. That is very different from a large DELETE WHERE observed_at < cutoff job that scans old rows, writes tombstones or dead tuples, updates indexes, and then needs vacuum or compaction cleanup.

Use the same scale as the capacity worksheet. A fleet with 5,000 devices, 8 metrics per device, and one sample per second creates 40,000 points per second. One raw day contains 3,456,000,000 points before indexes, replicas, metadata, and compression. A row-by-row delete for that day has to evaluate billions of row-level decisions. Dropping a one-day chunk removes the partition that already owns that time range. The exact implementation differs by engine, but the review question is stable: does the deletion unit match the retention boundary?

Dora’s Retention Ledger

Reading: 5,000 devices with 8 metrics at one sample per second — 40,000 points every second.
Keep: each raw day as a whole chunk, aligned to the retention cutoff.
Cost: row-by-row deletion of one day weighs 3.456 billion decisions; an aligned chunk drops in one.

Retention shape

Why it matters

Release evidence

Aligned chunk

The cutoff falls on a chunk boundary, so a whole old chunk can be dropped.

Chunk interval, cutoff timestamp, candidate chunk list, and dry-run output.

Straddling chunk

The cutoff lands inside a chunk, so the engine must wait or use a slower row path.

Scheduler timezone, chunk boundary, and policy correction before automation.

Compressed warm data

Old chunks may be compressed before deletion, reducing warm storage cost while rollup and archive checks finish.

Measured compression ratio, restore sample, and query replay on the compressed tier.

Held partition

An investigation or compliance hold overrides the normal deletion unit.

Machine-readable hold scope, owner, expiry review date, and release condition.

Boundary alignment is the trap. If chunks are one day long and the approved cutoff is midnight UTC, the engine can drop whole day chunks. If the job uses local time and the cutoff lands at noon UTC, the chunk straddles the boundary; the policy either waits until the whole chunk is expired or falls back to a slow partial-window path. Monitoring stores have the same idea with blocks: a Prometheus-style store removes old complete blocks, not arbitrary rows inside a block.

Chunk Drop Rule

Design retention boundaries when the schema is created. The chunk interval, policy schedule, timezone, archive partition key, and hold scope should all agree before the first deletion job is enabled.

12.9 Late Arrivals and Rollup Refresh

Late data makes retention tricky because the reading belongs to an old observed time but arrives after a rollup may already have been calculated. A retention contract should define how late is still acceptable and what gets refreshed.

Case

Retention response

Evidence field

Late within window

Accept the reading, mark it late, and refresh affected rollup buckets.

quality_status = late_valid, refresh job ID, revised bucket timestamp.

Late after raw expiry

Route to an exception path or reject according to the contract.

Reject reason, owner approval, and whether archive repair is possible.

Duplicate replay

Deduplicate by stable key or preserve sequence information.

Dedupe key, sequence ID, source gateway, replay batch ID.

Held incident window

Pause deletion for the affected raw partitions until the hold is released.

Hold ID, scope, owner, review date, release condition.

Observed Time vs Received Time

Retention usually follows the event timestamp used for analysis, not the time the server happened to receive the record. Keep both observed_at and received_at so late arrivals, clock skew, and ingestion delay remain visible.

12.10 Capacity Worksheet

Capacity math is useful as a design input, but it is not release evidence by itself. Use formulas, then compare the estimate with actual sample records, compression output, rollup size, archive size, and query behavior.

raw_points_per_day = device_count * samples_per_device_per_day
raw_bytes_per_day = raw_points_per_day * average_record_bytes

rollup_points_per_day = device_count * buckets_per_day * metrics_per_device
rollup_bytes_per_day = rollup_points_per_day * average_rollup_record_bytes

archive_manifest_count = partitions_per_period * data_classes
restore_test_scope = sample_partition + representative_queries + reviewer_signoff

Estimate

Use workload inputs

Device count, sample rate, metrics, dimensions, payload size, quality fields, and expected growth.

Compare

Measure real output

Check actual table, chunk, object, and rollup sizes after representative sample ingestion.

Replay

Test restore cost

Restore a sample partition and run the queries that support teams would use during an incident.

Revise

Update the contract

If real data disagrees with the worksheet, revise retention before automation deletes anything.

12.11 Label the Retention Evidence

Label the Diagram

12.12 Code Challenge: Retention Decision Record

Code Challenge

12.13 Retention Triage

12.14 Common Pitfalls

12.14.1 Treating Retention as a Storage-Only Setting

Storage pressure may start the conversation, but the retention contract belongs to the product, operations, compliance, and support owners. The database job should implement an approved lifecycle decision.

12.14.2 Keeping Averages Without Evidence

An average without sample count, min, max, quality status, and gap markers can hide missing samples and incident spikes. It may make a dashboard cheaper while making the system less explainable.

12.14.3 Confusing Archive With Backup

Backup helps recover the platform. Archive helps recover historical data for a specific partition, schema, and query. Retention design needs both concepts, but they are not the same artifact.

12.14.4 Forgetting Version and Edition Differences

Retention commands and scopes differ across TimescaleDB, InfluxDB versions, Prometheus, managed services, and self-hosted deployments. Teach the evidence model first, then confirm the exact implementation docs for the system being used.

12.14.5 Letting Legal Holds Live Outside the Data Path

If hold exceptions live only in email or a ticket comment, deletion jobs can still remove the wrong partitions. Holds need machine-readable scope, owner, expiry review, and release condition.

12.15 Release Checklist

Contract

Lifecycle owner

Raw window, rollup tiers, archive window, delete rule, hold scope, and data owner are named.

Rollup

Evidence fields

Sample count, min/max, quality counts, gap flags, source window, and refresh timestamp are present.

Restore proof

Manifest, schema version, checksum, partition key, access owner, and restore sample are checked.

Delete

Dry-run approval

Candidate partitions are listed, replacement coverage is proven, and holds are checked before deletion.

Exception

Hold workflow

Investigation, compliance, warranty, and customer-dispute holds are machine-readable and reviewable.

Review

Next review date

Retention is revisited when device count, sample rate, legal needs, query patterns, or product promises change.

12.16 Self-Assessment

12.17 Summary

Data retention is the lifecycle discipline behind time-series systems. It decides how long raw detail remains, what summaries replace it, what archive can be restored, which data is held, and when deletion is approved.

The strongest retention design is evidence-first: preserve quality in rollups, dry-run every deletion policy, restore a sample before trusting an archive, make holds visible to automation, and record owner approval. That keeps storage manageable without letting quality drift or incident evidence disappear.

12.18 Concept Relationships

Time-Series Database Fundamentals explains timestamp contracts, chunks, compression, and lifecycle mechanics.
Time-Series Query Optimization explains how query tiers, rollups, freshness, and raw replay affect dashboard performance.
Time-Series Database Platforms compares platform roles and version-sensitive implementation differences.
Time-Series Practice and Labs turns retention dry runs, restore checks, and release packets into hands-on drills.
Stream Processing can create rollups and quality flags before data reaches long-term storage.

12.19 What’s Next

If you need to…	Read next
Practice retention dry runs and restore checks	Time-Series Practice and Labs
Tune dashboards that read raw and rollup tiers	Time-Series Query Optimization
Process late arrivals before storage	Stream Processing
Detect unusual telemetry before downsampling hides detail	Anomaly Detection
Revisit storage roles across the whole module	Data Storage and Databases

12.20 Official References

Tiger Data data retention - retention overview and relationship to continuous aggregates.
Tiger Data add_retention_policy - scheduled chunk deletion policies for hypertables and continuous aggregates.
Tiger Data continuous aggregates - rollup behavior and refresh concepts.
InfluxDB 3 Core data retention - current database retention-period behavior and constraints.
Prometheus storage - local storage retention time and retention size settings.

12.21 Key Takeaway

Retention policy is a product and cost decision. Keep high-resolution data only while it is useful, then downsample or archive so storage cost does not grow faster than insight.