13  Time-Series Query Optimization

13.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Construct efficient queries for common IoT sensor data patterns
  • Implement last-value, time-range, and anomaly detection queries
  • Apply query optimization best practices for time-series workloads
  • Select appropriate pagination strategies for large IoT datasets
  • Evaluate query caching trade-offs against real-time data freshness requirements
In 60 Seconds

Time-series query patterns differ fundamentally from transactional SQL: the most common operations are time-window aggregations, rolling statistics, gap detection, and cross-device comparisons – all requiring time-range predicates to trigger chunk exclusion. Mastering time_bucket(), window functions (LAG, LEAD), continuous aggregates, and gap-fill techniques transforms slow full-table scans into millisecond responses on billion-row datasets. The golden rule: always include a time range predicate to exploit partition pruning.

13.2 Key Concepts

  • Chunk Exclusion: The TimescaleDB query optimizer technique that skips time chunks entirely when the query’s time range predicate does not overlap with a chunk’s bounds, reducing I/O from terabytes to megabytes for recent-data queries
  • Window Function: A SQL function (LAG, LEAD, RANK, NTILE) that computes a value for each row using a sliding window of surrounding rows without collapsing the result, enabling running totals, delta calculations, and rank ordering in time-series queries
  • Rolling Average: A time-series statistic computed over a sliding time window (e.g., 1-hour moving average of temperature) that smooths short-term noise while preserving underlying trends – implemented with window functions or time_bucket
  • Interpolation: Estimating missing data point values between known readings using linear or cubic methods, commonly used to fill gaps caused by sensor downtime before visualization or ML feature extraction
  • Last-Value Carry Forward: A gap-fill strategy that propagates the most recent known value into periods with missing readings, appropriate for slow-changing sensors where the latest reading is the best estimate of missing values
  • Percentile Query: A time-series aggregation computing the Nth percentile (p50, p95, p99) of sensor readings within a time window, used for SLA monitoring and outlier detection without being skewed by extreme values
  • Cross-Device Comparison: A query joining or unioning time-series data from multiple devices aligned to common time buckets, enabling fleet-level health monitoring and anomaly detection by comparing device behavior against peer baselines
  • Approximate Distinct Count: A probabilistic algorithm (HyperLogLog) for counting unique values (e.g., unique devices reporting in the last hour) with sub-1% error at 1% of the memory cost of exact counting

13.3 MVU: Minimum Viable Understanding

Core Concept: Efficient time-series queries always specify time ranges, use appropriate data granularity, leverage pre-computed aggregates, and limit result sets to what the application can actually render. Why It Matters: A poorly-written dashboard query can scan millions of rows and take 30+ seconds; the same query optimized with time bounds, continuous aggregates, and result limits returns in under 100ms. Key Takeaway: Always constrain queries by time range, use downsampled data for historical analysis, leverage continuous aggregates for dashboard queries, and cache aggressively (most dashboards tolerate 10-60 second staleness).

Querying time-series data is about asking the right questions of your sensor history. Instead of reading every single temperature measurement, you learn to ask smart questions like ‘what was the average this week?’ or ‘when did temperature spike above 30 degrees?’ Efficient queries let you get fast answers even from databases containing billions of readings.

13.4 Query Optimization for IoT

Estimated time: ~15 min | Difficulty: Advanced | Unit: P10.C15.U05

Efficient queries are essential for real-time dashboards, anomaly detection, and historical analysis. This section covers the four most common query patterns in IoT systems, then walks through best practices for keeping them fast as data volumes grow. Each pattern includes side-by-side examples in InfluxDB (Flux) and TimescaleDB (SQL) – the two most widely used time-series databases in IoT deployments.

13.4.1 Common Query Patterns

13.4.1.1 Last Value Queries

Scenario: Display current sensor readings on a dashboard.

InfluxDB:

from(bucket: "iot_sensors")
  |> range(start: -5m)
  |> filter(fn: (r) => r["_measurement"] == "temperature")
  |> last()

TimescaleDB:

SELECT DISTINCT ON (sensor_id)
  sensor_id,
  time,
  temperature
FROM sensor_data
WHERE time > NOW() - INTERVAL '5 minutes'
ORDER BY sensor_id, time DESC;

Optimization Tips:

  • Query recent time range only (last 5-10 minutes)
  • Use indexes on sensor_id + time
  • Cache results for 5-30 seconds (acceptable staleness)

13.4.1.2 Time-Range Aggregations

Scenario: Show hourly averages for the last 24 hours.

InfluxDB:

from(bucket: "iot_sensors")
  |> range(start: -24h)
  |> filter(fn: (r) => r["_measurement"] == "temperature")
  |> filter(fn: (r) => r["location"] == "building_a")
  |> aggregateWindow(every: 1h, fn: mean)

TimescaleDB:

SELECT
  time_bucket('1 hour', time) AS hour,
  AVG(temperature) as avg_temp
FROM sensor_data
WHERE time > NOW() - INTERVAL '24 hours'
  AND location = 'building_a'
GROUP BY hour
ORDER BY hour;

Optimization Tips:

  • Use continuous aggregates/downsampled data if available
  • Partition tables by time (automatic in TSDBs)
  • Limit result set size (frontend can’t render 10,000 points)

13.4.1.3 Anomaly Detection Queries

Scenario: Detect sensors with values 2 standard deviations above mean.

InfluxDB:

// Step 1: Compute per-sensor statistics over the last hour
stats = from(bucket: "iot_sensors")
  |> range(start: -1h)
  |> filter(fn: (r) => r["_measurement"] == "temperature")
  |> aggregateWindow(every: 1h, fn: mean, createEmpty: false)

stddevs = from(bucket: "iot_sensors")
  |> range(start: -1h)
  |> filter(fn: (r) => r["_measurement"] == "temperature")
  |> aggregateWindow(every: 1h, fn: stddev, createEmpty: false)

// Step 2: Get recent readings and compare against statistics
from(bucket: "iot_sensors")
  |> range(start: -1h)
  |> filter(fn: (r) => r["_measurement"] == "temperature")
  |> map(fn: (r) => ({r with z_score: (r._value - stats._value) / stddevs._value}))
  |> filter(fn: (r) => r.z_score > 2.0 or r.z_score < -2.0)

Note: Flux does not natively support z-score computation in a single pass. In practice, anomaly detection in InfluxDB is often implemented using the anomalyDetection community function, Kapacitor tasks, or by exporting statistics and comparing in application code. The query above illustrates the conceptual approach.

TimescaleDB:

WITH stats AS (
  SELECT
    sensor_id,
    AVG(temperature) as mean_temp,
    STDDEV(temperature) as stddev_temp
  FROM sensor_data
  WHERE time > NOW() - INTERVAL '1 hour'
  GROUP BY sensor_id
  HAVING STDDEV(temperature) > 0  -- Exclude sensors with zero variance
)
SELECT
  s.sensor_id,
  s.time,
  s.temperature,
  st.mean_temp,
  (s.temperature - st.mean_temp) / st.stddev_temp as z_score
FROM sensor_data s
JOIN stats st ON s.sensor_id = st.sensor_id
WHERE s.time > NOW() - INTERVAL '1 hour'
  AND ABS((s.temperature - st.mean_temp) / st.stddev_temp) > 2.0
ORDER BY ABS((s.temperature - st.mean_temp) / st.stddev_temp) DESC;

Optimization Tips:

  • Pre-compute statistics in continuous aggregates
  • Use window functions efficiently (avoid full table scans)
  • Consider streaming anomaly detection (outside database)

13.4.1.4 Multi-Sensor Correlations

Scenario: Find when temperature and humidity both exceed thresholds.

TimescaleDB (using time_bucket to handle sensors with slightly different timestamps):

SELECT
  time_bucket('1 minute', t.time) AS bucket,
  t.sensor_id as temp_sensor,
  h.sensor_id as humidity_sensor,
  AVG(t.temperature) AS avg_temp,
  AVG(h.humidity) AS avg_humidity
FROM sensor_data t
JOIN sensor_data h ON
  time_bucket('1 minute', t.time) = time_bucket('1 minute', h.time) AND
  t.location = h.location
WHERE t.time > NOW() - INTERVAL '1 hour'
  AND t.sensor_type = 'temperature'
  AND h.sensor_type = 'humidity'
GROUP BY bucket, t.sensor_id, h.sensor_id
HAVING AVG(t.temperature) > 30
  AND AVG(h.humidity) > 80;

Why time_bucket instead of exact time match? IoT sensors rarely report at precisely the same instant. A temperature sensor at 10:00:00.123 and a humidity sensor at 10:00:00.456 would miss an exact-time join. Using time_bucket('1 minute', ...) aligns readings within the same minute, making correlations reliable.

Optimization Tips:

  • Co-locate related sensors in same partition (by location)
  • Use time-bucketing to reduce join complexity
  • Consider denormalized schemas (store temp+humidity in same row)

13.4.1.5 Paginated Queries

Scenario: An API endpoint returns historical sensor data, but clients cannot handle millions of rows at once.

Understanding Pagination for IoT Data APIs

Core Concept: Pagination divides large query results into manageable chunks (pages), allowing clients to request data incrementally rather than retrieving millions of records in a single response.

Why It Matters: IoT systems accumulate massive datasets – a single sensor with 1-second readings generates 86,400 records daily. Without pagination, requesting “all temperature readings” for a week returns 600,000+ records, overwhelming client memory, saturating network bandwidth, and causing API timeouts. Well-designed pagination keeps response times consistent regardless of total dataset size.

Key Takeaway: For time-series IoT data, prefer cursor-based pagination using timestamps over offset-based pagination because offset pagination becomes increasingly slow as offsets grow large, while timestamp cursors leverage time-based indexes efficiently.

Offset-based (slow for large offsets):

-- Page 51 (offset 50,000): database must skip 50,000 rows before returning 1,000
SELECT time, sensor_id, temperature
FROM sensor_data
WHERE time > NOW() - INTERVAL '7 days'
ORDER BY time
LIMIT 1000 OFFSET 50000;  -- Becomes slower as offset grows

Cursor-based (fast at any position):

-- Use the last timestamp from previous page as cursor
SELECT time, sensor_id, temperature
FROM sensor_data
WHERE time > '2025-01-15T10:30:00Z'  -- Cursor: last timestamp from previous page
  AND time <= NOW()                    -- Upper bound of query range
ORDER BY time
LIMIT 1000;

Optimization Tips:

  • Always use cursor-based pagination for time-series data (leverages time index)
  • Include a compound cursor (time + sensor_id) when multiple rows share the same timestamp
  • Set maximum page sizes (e.g., 1,000-5,000 rows) to bound response time and memory

13.4.2 Query Performance Best Practices

Flowchart showing how queries are routed to raw data tables, continuous aggregates, or cached results based on the requested time range and aggregation level
Figure 13.1: Query Routing Based on Time Range and Aggregation Needs

This view shows a step-by-step decision process for optimizing time-series queries:

Decision tree for query optimization showing steps from identifying query type through choosing time range, aggregation level, index usage, and caching strategy

Following this decision tree systematically can improve query performance by 10-100x.

Key Principles:

  1. Always specify time ranges: Never query unbounded time
  2. Use appropriate granularity: Don’t fetch 1-second data for a 1-year chart
  3. Leverage indexes: Filter on indexed tags/columns first
  4. Limit result sets: Return max 1,000-10,000 points (downsample if needed)
  5. Pre-aggregate when possible: Use continuous aggregates for dashboards
  6. Cache aggressively: Most dashboards tolerate 10-60 second staleness

Why limit result sets to 10,000 points? Consider a dashboard rendering 1 year of data at 1 Hz:

\[ \text{Data points for 1 year} = 1\text{ Hz} \times 31{,}536{,}000\text{ sec} = 31{,}536{,}000\text{ points} \]

Even if the database returns this instantly, the browser must: - Allocate \(31{,}536{,}000 \times 16\text{ bytes} \approx 480\text{ MB}\) JavaScript memory - Render 31M SVG/Canvas points (browsers freeze >50K points)

Downsample to hourly aggregates: \(365 \times 24 = 8{,}760\text{ points}\) (3,600× reduction). This renders instantly and shows the same trends. Rule of thumb: Match query granularity to display resolution—a 1920px-wide chart can only show ~1,920 distinct points anyway!

Tradeoff: Query Caching vs Real-Time Data Freshness

Option A: Aggressive Caching (30-60 second TTL)

  • Query latency: 1-5ms (cache hit), 50-500ms (cache miss)
  • Database load: 10-50 QPS (reduced from 500+ QPS)
  • Infrastructure cost: $200-500/month for Redis cache layer
  • Data freshness: 30-60 seconds stale (acceptable for dashboards)
  • Cache hit rate: 80-95% for dashboard queries
  • Best for: Real-time dashboards with 5+ users, aggregation-heavy queries

Option B: Real-Time Queries (No Caching)

  • Query latency: 50-500ms consistent (depends on data volume)
  • Database load: 500+ QPS for active dashboards
  • Infrastructure cost: $1,000-3,000/month for larger database instance
  • Data freshness: Sub-second (true real-time)
  • Cache hit rate: N/A
  • Best for: Alerting systems, control loops, single-user ad-hoc analysis

Decision Factors:

  • Choose Aggressive Caching when: Dashboard refreshes every 10-60 seconds anyway, multiple users view same data (shared cache), query involves expensive aggregations (time_bucket, percentiles), database is a cost bottleneck
  • Choose Real-Time when: Sub-second freshness is critical (safety systems, trading), each user queries unique data (no cache benefit), query latency is already <50ms, system is event-driven (WebSocket push vs polling)
  • Hybrid approach: Cache aggregations (hourly stats) aggressively; serve latest readings directly from database – balances freshness for critical metrics with efficiency for historical views

See the Decision Framework in the Summary section below for a detailed decision matrix with specific query patterns and recommended solutions.

13.5 Hands-On: Exploring Query Granularity

Understanding how sampling interval, aggregation window, and downsampling interact is essential for writing efficient queries. The interactive tool below lets you experiment with these parameters on simulated 24-hour temperature data and immediately see the impact on data volume and storage.

Adjust the controls below to see how different query strategies affect data volume, storage requirements, and visual fidelity.

Try These Experiments:

  1. See the impact of sampling rate: Set interval to “1 second”, then “1 hour” - notice the 3,600x difference in points
  2. Compare aggregation types: With “1 hour” aggregation, toggle between mean/max/min to see how each preserves different signal characteristics
  3. Understand downsampling: Set aggregation to “15 minutes”, then increase downsample ratio - notice how storage drops but visual quality degrades
  4. Visualize retention policy: Enable the retention toggle to see how multi-tier storage reduces total footprint by 95%+

Use this calculator to estimate the database load for your IoT dashboard and see the impact of continuous aggregates and caching.

Try these scenarios:

  • Factory dashboard: 50 users, 5 queries/refresh, 10 sec refresh, 2847 ms raw query – see how caching brings utilization from overloaded to <1%
  • Small team monitoring: 5 users, 2 queries/refresh, 30 sec refresh – raw queries may already be feasible without caching
  • Public dashboard: 500 users, 1 query/refresh, 5 sec refresh – even with aggregates, caching is essential

Common Pitfalls

SELECT AVG(temperature) FROM readings WHERE device_id = 42 – without a time constraint – scans every chunk in the hypertable, even chunks from years ago. Always include WHERE time > NOW() - INTERVAL ‘N days’ or equivalent. A missing time predicate on a 5-year, 1-billion-row hypertable causes queries that take minutes instead of milliseconds.

Computing a 24-hour moving average by joining a table to itself with a date range subquery creates O(n²) complexity that becomes prohibitively slow at scale. Use SQL window functions (AVG(value) OVER (ORDER BY time ROWS BETWEEN 23 PRECEDING AND CURRENT ROW)) which process data in a single pass.

Dashboard panels refreshing every 30 seconds that query raw 1-second data for 30-day charts rebuild expensive aggregations on every refresh. Create continuous aggregates at 1-minute and 1-hour granularities during schema setup. If aggregates were not created initially, adding them retroactively requires populating historical data and updating all dashboard queries simultaneously.

13.6 Summary

Efficient query design is critical for responsive IoT dashboards and analytics.

Key Takeaways:

  1. Always specify time ranges: Unbounded queries scan entire tables and can take minutes.

  2. Use appropriate data sources: Recent data from raw tables, historical data from continuous aggregates.

  3. Limit result sets: Frontends cannot render 100,000 points; downsample to 1,000-10,000.

  4. Cache strategically: Use hybrid caching–cache aggregates, serve latest readings directly.

  5. Pre-compute common aggregations: Continuous aggregates make dashboards 10-100x faster.

  6. Optimize for your query patterns: Last-value, time-range, anomaly detection, and correlation queries each have different optimization strategies.

Scenario: A factory operations dashboard serves 50 concurrent users monitoring 10,000 sensors. The “Line Status” page fires 5 queries per refresh (one per widget: line averages, alerts, throughput, quality, energy) every 10 seconds, generating 25 queries/second total (50 users × 5 queries / 10 sec). Database CPU hits 90%, queries timeout, and operators lose real-time visibility. Optimize the system to handle load with <100ms P95 latency.

Given:

  • Users: 50 concurrent operators
  • Dashboard refresh: Every 10 seconds, 5 queries per refresh per user
  • Total query rate: 50 × 5 / 10 = 25 queries/sec
  • Query pattern: SELECT AVG(temperature) FROM sensors WHERE time > NOW() - INTERVAL '5 minutes' GROUP BY line_id
  • Database: TimescaleDB on 4-core server
  • Query latency (current): 3-8 seconds P95, frequent timeouts
  • Data volume: 10,000 sensors × 1 reading/sec = 10,000 rows/sec = 864M rows/day

Step 1: Analyze the performance bottleneck

-- Explain query shows sequential scan across 3M rows for 5-minute window
EXPLAIN ANALYZE SELECT AVG(temperature), line_id
FROM sensor_readings
WHERE time > NOW() - INTERVAL '5 minutes'
GROUP BY line_id;

-- Output shows:
-- Seq Scan on sensor_readings (cost=0.00..189234.56 rows=3002345)
-- Planning time: 2.3 ms
-- Execution time: 2847 ms per query × 25 queries/sec = 71,175 ms/sec of CPU

Problem: Each query scans 3 million raw rows even though users only need 20 aggregated results (one per production line).

Step 2: Create continuous aggregate (pre-compute 1-minute averages)

-- Continuous aggregate materializes results incrementally
CREATE MATERIALIZED VIEW line_status_1min
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 minute', time) AS bucket,
    line_id,
    AVG(temperature) as avg_temp,
    COUNT(*) as reading_count
FROM sensor_readings
GROUP BY bucket, line_id
WITH NO DATA;

-- Refresh policy: update every 30 seconds, lag 1 minute behind real-time
SELECT add_continuous_aggregate_policy('line_status_1min',
    start_offset => INTERVAL '2 hours',
    end_offset => INTERVAL '1 minute',
    schedule_interval => INTERVAL '30 seconds');

Step 3: Rewrite dashboard query to use continuous aggregate

-- OLD query (slow): Scans 3M raw rows
SELECT AVG(temperature), line_id
FROM sensor_readings
WHERE time > NOW() - INTERVAL '5 minutes'
GROUP BY line_id;

-- NEW query (fast): Scans 100 pre-aggregated rows (5 minutes ÷ 1-minute buckets × 20 lines)
SELECT AVG(avg_temp) as temperature, line_id
FROM line_status_1min
WHERE bucket > NOW() - INTERVAL '5 minutes'
GROUP BY line_id;

-- Query execution time: 12 ms (237x faster!)
-- CPU usage: 12 ms × 25 queries/sec = 300 ms/sec (7.5% of 4-core capacity)

Step 4: Add Redis cache for dashboard results

import redis
import json
from datetime import datetime, timedelta

redis_client = redis.Redis(host='localhost', port=6379)

def get_line_status():
    """Dashboard query with 15-second cache"""
    cache_key = "dashboard:line_status"
    cached = redis_client.get(cache_key)

    if cached:
        return json.loads(cached)  # Cache hit: <1ms

    # Cache miss: Query database (12ms)
    query = """
        SELECT line_id, AVG(avg_temp) as temperature
        FROM line_status_1min
        WHERE bucket > NOW() - INTERVAL '5 minutes'
        GROUP BY line_id
    """
    results = db.execute(query).fetchall()

    # Cache for 15 seconds (multiple users share same result)
    redis_client.setex(cache_key, 15, json.dumps(results))
    return results

Step 5: Calculate final performance

Before optimization:
- 50 users × 5 queries/refresh ÷ 10 sec refresh interval = 25 queries/sec total
- Average query time: 2,847 ms
- Total CPU demand: 25 × 2,847 ms = 71,175 ms/sec (impossible on 4-core = 4,000 ms/sec capacity)
- Result: Timeouts, slow dashboard, operators lose visibility

After optimization (continuous aggregate):
- Query time: 12 ms
- Total CPU demand: 25 × 12 ms = 300 ms/sec (7.5% utilization on 4-core)
- Result: Fast and stable, with room for growth

After optimization (continuous aggregate + cache):
- Cache hit rate: ~95% (50 users sharing 15-second cache, refresh every 10 sec)
- Actual database queries: 25 × 0.05 (5% miss rate) = 1.25 queries/sec
- Total CPU demand: 1.25 × 12 ms = 15 ms/sec (<1% utilization)
- Result: <20ms P99 latency, system has capacity for 100× user growth

Step 6: Monitor with alerting

-- Create monitoring view
CREATE VIEW dashboard_health AS
SELECT
    COUNT(*) FILTER (WHERE query_duration_ms > 100) as slow_queries_5min,
    AVG(query_duration_ms) as avg_latency_ms,
    (SELECT cache_hit_rate FROM redis_stats) as cache_hit_pct
FROM query_log
WHERE query_start > NOW() - INTERVAL '5 minutes';

-- Alert if P95 latency exceeds 100ms
SELECT * FROM dashboard_health WHERE avg_latency_ms > 100;

Result: Dashboard latency reduced from 2.8 seconds to <20ms P99. System CPU utilization dropped from overloaded (71 seconds of CPU work per second on a 4-second capacity) to <1%, providing capacity for 100x user growth.

Key Decisions:

  1. Continuous aggregate pre-computes 1-minute averages incrementally (not on-query)
  2. 15-second cache exploits temporal locality (50 users see same 10-second refresh)
  3. 1-minute granularity balances freshness with query efficiency (5-minute window = 100 rows not 3M)
  4. 30-second refresh policy keeps aggregate <90 seconds stale (acceptable for operations dashboards)

Key Insight: The combination of continuous aggregates (reducing query time 237x by scanning 100 rows instead of 3M) and Redis caching (reducing database queries 20x with 95% hit rate) achieves a ~4,750x total reduction in database CPU load. Always optimize in layers – pre-aggregation first, then caching, then horizontal scaling if needed.

Use continuous aggregates (pre-compute periodically) when:

  • Query runs frequently (>10 times/hour) on the same time range and grouping
  • Query scans large datasets (millions of rows) to produce small results (hundreds of rows)
  • Result freshness can tolerate 30-60 second staleness
  • Multiple users request identical or similar aggregations (dashboards, reports)
  • Example: Factory dashboard showing hourly production averages, refreshed by 50 users

Use on-demand queries (no pre-aggregation) when:

  • Query runs infrequently (<1 time/hour) or with unpredictable parameters
  • Ad-hoc analysis requiring flexibility (data scientists exploring)
  • Result must be real-time fresh (<5 seconds stale)
  • Query scans small datasets (thousands of rows) already
  • Example: One-time forensic query investigating equipment failure at specific timestamp

Cache aggressively (Redis, CDN) when:

  • Result changes slowly relative to query frequency
  • Many users request identical data within short time window
  • Computation cost >> cache lookup cost
  • Acceptable staleness: seconds to minutes
  • Example: Public dashboards with thousands of viewers

Hybrid approach decision matrix:

Query Pattern Users Frequency Freshness Need Solution
Dashboard “current status” 50+ Every 10 sec <1 min Continuous aggregate (1-min) + Redis cache (15 sec)
Historical trend report 5 Once/hour Not critical On-demand query with materialized view
Real-time anomaly alert System Continuous <5 sec Streaming (Kafka + Flink), not database query
Ad-hoc investigation 1 Rarely N/A Direct SQL on raw data

Trade-off visualization:

Continuous Aggregate Cost = (Storage for materialized view) + (Incremental refresh CPU)
  Storage: ~1% of raw data (aggregates are tiny)
  Refresh CPU: Amortized across all queries (refresh once, serve 1000s)

On-Demand Query Cost = (Scan cost × Query frequency)
  Scan cost: Proportional to rows scanned
  Multiplier: Every query pays full cost

Break-even point:
  If Query_Frequency × Scan_Cost > Refresh_CPU + Storage_Cost
    → Use continuous aggregate

Example calculation (factory dashboard: 25 queries/sec):

On-demand (raw data):     25 queries/sec × 2,847ms = 71,175 ms/sec CPU (overloaded)
Continuous aggregate:     25 queries/sec × 12ms    =    300 ms/sec CPU (99.6% reduction)
+ Redis cache (95% hit):  1.25 queries/sec × 12ms  =     15 ms/sec CPU (99.98% reduction)
Background refresh cost:  1 refresh/30sec × 200ms  =    6.7 ms/sec average CPU (negligible)

Implementation checklist:

Common Mistake: Caching Without Invalidation Strategy

The Mistake: A team implements aggressive 5-minute Redis caching for a sensor dashboard to reduce database load. During a factory emergency, operators see stale “normal” readings for 5 minutes while actual sensors are screaming alerts—the cache serves obsolete data, delaying critical response.

Why It Happens:

  • Teams set cache TTL arbitrarily (“5 minutes seems reasonable”)
  • Caching is treated as “set and forget” without considering invalidation scenarios
  • No distinction between “nice to have fresh” vs “must be fresh” data
  • Development/testing uses static data, so staleness isn’t noticed

The Consequences:

  • Operators make decisions based on stale data (safety risk in industrial settings)
  • Anomaly detection systems miss critical events (false sense of normalcy)
  • Users lose trust in dashboard (“why is this number stuck?”)
  • Expensive incident post-mortems discover cache was the culprit

The Fix:

  1. Classify data by freshness requirements:

    Critical (< 5 sec stale): Emergency stops, safety alarms → NO CACHE or 5-sec TTL
    Operational (< 30 sec stale): Production line status → 15-sec cache with pub/sub invalidation
    Analytical (< 5 min stale): Trend charts, efficiency metrics → 1-min cache
    Historical (< 1 hour stale): Weekly reports → 10-min cache
  2. Implement cache invalidation on event publish:

    def sensor_alert_triggered(sensor_id, alert_type):
        """When alert fires, invalidate related cached dashboards"""
        # Delete affected cache keys immediately
        cache_keys_to_invalidate = [
            f"dashboard:line_status",  # Main operations dashboard
            f"dashboard:sensor:{sensor_id}",  # Specific sensor detail view
            f"alerts:active",  # Active alerts widget
        ]
        for key in cache_keys_to_invalidate:
            redis_client.delete(key)
        # Optionally: force background refresh
        refresh_dashboard_async()
  3. Use cache tags for bulk invalidation:

    # Tag cache entries by production line
    redis_client.setex(
        "dashboard:line_status:line_a",
        300,  # 5-minute TTL as fallback
        json.dumps(data),
    )
    redis_client.sadd("cache_tag:line_a", "dashboard:line_status:line_a")
    
    # Invalidate all line_a caches on emergency
    def invalidate_line(line_id):
        keys = redis_client.smembers(f"cache_tag:{line_id}")
        if keys:
            redis_client.delete(*keys)
            redis_client.delete(f"cache_tag:{line_id}")
  4. Add cache bypass for critical paths:

    def get_sensor_status(sensor_id, bypass_cache=False):
        if bypass_cache or is_critical_alert_active(sensor_id):
            return query_database_directly(sensor_id)  # Skip cache
        # Normal caching path
        return get_cached_or_query(sensor_id)
  5. Monitor cache staleness:

    # Store cache write timestamp alongside data
    cache_value = {
        "data": sensor_readings,
        "cached_at": datetime.now().isoformat(),
    }
    redis_client.setex("dashboard:line_status", ttl=300, value=json.dumps(cache_value))
    
    # Alert if cache is served beyond staleness threshold
    def get_with_staleness_check(cache_key, max_age_seconds):
        cached = redis_client.get(cache_key)
        if cached:
            value = json.loads(cached)
            age_seconds = (datetime.now() - datetime.fromisoformat(value["cached_at"])).total_seconds()
            if age_seconds > max_age_seconds:
                logger.warning(f"Serving stale cache: {cache_key} age={age_seconds}s (threshold={max_age_seconds}s)")
            return value["data"]
        return None

How One Team Caught This: A factory implemented a “cache health dashboard” showing cache hit rates AND staleness distribution. During an emergency drill, they discovered 47% of dashboard views showed data >2 minutes old despite active alarms. Policy was changed to 10-second TTL for operational dashboards + event-driven invalidation for alerts.

Key Prevention: Never cache critical data without an invalidation strategy. Implement event-driven cache invalidation (publish alerts → invalidate cache), classify data by freshness requirements, and monitor actual staleness in production. A 5-minute cache saving $100/month in database costs isn’t worth a safety incident costing millions.

The Sensor Squad learns to ask smart questions instead of reading the whole diary!

Sammy the Sensor has a million temperature readings stored in his diary. Lila the LED wants to know: “What was the temperature at exactly 3:00 PM yesterday?”

The SLOW way: Read ALL million entries until you find the right one. That would take forever!

The FAST way: “Jump straight to yesterday’s page, then find the 3:00 PM entry!” says Max the Microcontroller. “That is called a TIME-RANGE query – I tell the diary WHEN to look, so it does not search everywhere.”

Max teaches the Sensor Squad four types of smart questions:

Question 1: “What is the temperature RIGHT NOW?” (Last Value Query) Instead of reading all the data, Max just grabs the very last entry. Super fast – like checking the time on a clock instead of reading the entire history of clocks!

Question 2: “What was the average temperature yesterday?” (Time-Range Aggregation) Max only looks at yesterday’s entries and calculates the average. He skips everything from last week, last month, and last year.

Question 3: “Was anything weird this week?” (Anomaly Detection) Max calculates the normal range and flags anything outside it. “If the average is 22C and the range is 20-24C, then a reading of 35C is DEFINITELY weird!”

Question 4: “When temperature goes up, does humidity go down?” (Correlation Query) Max compares two sensor streams side by side. Bella the Battery says: “It is like noticing that every time it rains, you eat more soup. The two things are connected!”

“The secret,” says Max, “is to ALWAYS tell the database WHAT TIME PERIOD you care about. Never ask ‘show me everything’ – that is like asking a library to read you every book!”

13.6.1 Try This at Home!

Open a weather app that shows hourly forecasts. Try asking yourself these four types of questions: (1) What is the temperature NOW? (last value), (2) What was the average temperature yesterday? (aggregation), (3) Was any hour unusually hot or cold? (anomaly), (4) When it was windy, was it also cold? (correlation). You are doing the same thing databases do!

13.7 Concept Relationships

Prerequisites - Read these first: - Time-Series Fundamentals - Understanding time-based partitioning - Time-Series Platforms - Query languages (Flux, SQL, PromQL) - Time-Series Retention - Querying across retention tiers

Related Concepts:

Practical Applications:

13.8 What’s Next

Now that you can construct and optimize time-series queries, the next chapter puts these skills into practice with real-world scenarios.

Topic Chapter What You’ll Learn
Hands-On Practice Time-Series Practice and Labs Apply query optimization through a Tesla case study, hands-on ESP32 lab, and smart grid worked examples
ML-Based Anomaly Detection Anomaly Detection Extend the z-score anomaly detection queries from this chapter into full ML-based detection pipelines
Data Quality Data Quality Monitoring Add quality filters to your queries to handle missing, duplicated, or corrupted sensor readings
Distributed Queries Sharding Strategies Route queries across distributed shards when a single database can no longer handle your data volume
Real-Time Alternatives Stream Processing Replace polling-based database queries with continuous stream processing for sub-second latency