1291 Time-Series Practice and Labs

Prerequisites: Query Optimization for IoT

This enables: Stream Processing | Anomaly Detection

1291.1 Learning Objectives

By the end of this chapter, you will be able to:

Apply time-series concepts through hands-on ESP32 lab exercises
Analyze real-world scale challenges from the Tesla fleet telemetry case study
Design complete time-series storage strategies using worked examples
Implement circular buffers and downsampling on resource-constrained devices
Calculate NTP synchronization intervals for distributed IoT deployments

MVU: Minimum Viable Understanding

Core Concept: Real-world time-series systems combine edge processing, adaptive sampling, and multi-tier retention to handle extreme scale while maintaining analytical capability. Why It Matters: Tesla’s 300 million points/second demonstrates that naive approaches fail catastrophically–only through edge aggregation, adaptive sampling, and tiered retention can such systems become practical. Key Takeaway: Start with standard tools (InfluxDB, TimescaleDB), implement retention from day one, and add edge processing when cloud ingestion becomes a bottleneck.

1291.2 Real-World Case Study: Tesla Fleet Telemetry

Tesla operates one of the world’s largest IoT time-series systems, collecting data from over 1.5 million vehicles worldwide.

1291.2.1 Scale Challenge

Data Generation: - Vehicles: 1.5 million active vehicles - Sensors per vehicle: ~200 sensors (battery, motors, HVAC, GPS, cameras, etc.) - Sampling rate: 1 Hz average (some sensors faster, some slower) - Data points per second: 1.5M vehicles x 200 sensors x 1 Hz = 300 million points/second - Daily data: 300M x 86,400 seconds = 25.9 trillion points/day

Storage Requirements (without optimization): - Raw: 25.9T points x 32 bytes = 829 TB/day - Annual: 829 TB x 365 = 302 PB/year

This is physically impossible to store economically. Tesla’s approach:

1291.2.2 Tesla’s Time-Series Strategy

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#E8F4F8','primaryTextColor':'#2C3E50','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#FEF5E7','tertiaryColor':'#F4ECF7','edgeLabelBackground':'#ffffff','textColor':'#2C3E50','fontSize':'14px'}}}%%
flowchart TD
    A[Vehicle Sensors<br/>300M points/sec] --> B{On-Vehicle Filtering}

    B -->|Critical Safety| C[Immediate Upload<br/>100% data]
    B -->|Anomaly Detected| D[Burst Upload<br/>High-res context]
    B -->|Normal Operation| E[Aggregate On-Vehicle<br/>1-minute averages]

    C --> F[Tesla Cloud<br/>Time-Series DB]
    D --> F
    E --> F

    F --> G[Hot Storage<br/>7 days full resolution]
    G --> H[Warm Storage<br/>30 days, 1-min aggregates]
    H --> I[Cold Storage<br/>1 year, 1-hour aggregates]
    I --> J[Archive<br/>S3 Glacier, daily aggregates]

    style A fill:#E8F4F8
    style C fill:#E74C3C,color:#fff
    style D fill:#E67E22,color:#fff
    style E fill:#16A085,color:#fff
    style F fill:#2C3E50,color:#fff

Figure 1291.1: Tesla Vehicle Time Series Data Flow with Multi-Tier Storage

Key Optimizations:

Edge Aggregation: Vehicles pre-process 90% of data locally
- Only upload anomalies and aggregates
- Reduces cloud ingestion to ~30M points/second (10x reduction)
Adaptive Sampling: Sample rates adjust based on context
- Parked: Sample every 5 minutes
- Driving: Sample every second
- Hard braking: Sample at 100 Hz for 10 seconds
Multi-Tier Retention:
- Hot (7 days): Full resolution for recent analysis
- Warm (30 days): Downsampled for trend analysis
- Cold (1 year): Aggregates for long-term patterns
- Archive: Compliance and model training
Custom Time-Series Engine:
- Tesla built custom infrastructure (not off-the-shelf)
- Columnar storage with extreme compression (50:1 ratios)
- Distributed across data centers globally

Results: - Actual storage: ~15 PB/year (vs. 302 PB raw) - Query latency: <100ms for recent data analysis - Powers Autopilot improvements, range predictions, battery health monitoring

Lessons for IoT Architects:

Edge processing is essential at scale: Don’t send everything to the cloud
Adaptive strategies: Sample rates and retention policies should match data value
Domain-specific compression: Tesla’s battery telemetry compresses 100:1 because voltage changes slowly
Start with standard tools: Use InfluxDB or TimescaleDB initially, only build custom if you reach their limits

Show code

{
  const container = document.getElementById('kc-timeseries-12');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "Tesla vehicles generate 300 million sensor readings per second fleet-wide. Without optimization, this would require 829 TB/day of storage. Tesla achieves ~15 PB/year actual storage (98% reduction). Which combination of techniques makes this possible?",
      options: [
        {text: "Aggressive cloud-side compression and larger SSDs", correct: false, feedback: "Incorrect. Cloud-side compression alone (even 50:1) cannot handle the fundamental data volume problem. At 829 TB/day raw, even 50:1 compression yields 16 TB/day - still far more than 15 PB/year (~41 TB/day). The solution requires reducing data BEFORE it reaches the cloud."},
        {text: "Edge aggregation (90% data processed locally), adaptive sampling (variable rates based on context), and multi-tier retention with aggressive downsampling", correct: true, feedback: "Correct! Tesla uses three techniques: (1) Edge aggregation - vehicles pre-process 90% of data locally, uploading only anomalies and aggregates. (2) Adaptive sampling - parked cars sample every 5 minutes vs. 100 Hz during hard braking. (3) Multi-tier retention - hot/warm/cold storage with progressive downsampling. Combined, these achieve 98% reduction."},
        {text: "Storing only safety-critical data and discarding all other sensor readings", correct: false, feedback: "Incorrect. Tesla uses non-safety data for Autopilot improvements, battery health analysis, and range predictions. Discarding this data would prevent product improvements. The solution preserves analytically valuable data through smart aggregation, not deletion."},
        {text: "Using custom hardware accelerators for real-time compression in vehicles", correct: false, feedback: "Incorrect. While Tesla has custom hardware, the data reduction primarily comes from algorithmic techniques (edge aggregation, adaptive sampling, downsampling) not hardware compression. These techniques reduce what needs to be compressed/stored, rather than just compressing everything faster."}
      ],
      difficulty: "hard",
      topic: "scale-optimization"
    }));
  }
}

1291.3 Understanding Check

Scenario: Smart Building Deployment

You’re deploying an IoT system for a 20-story office building with the following sensor network:

Temperature sensors: 500 sensors (1 per room), report every 10 seconds
Occupancy sensors: 500 sensors (motion detection), report on change (avg 1/minute)
Energy meters: 50 meters (per floor + equipment), report every 30 seconds
Air quality sensors: 100 sensors (CO2, VOC), report every 60 seconds

Each reading is 32 bytes (timestamp + sensor_id + value + metadata).

1291.3.1 Questions:

How many data points per day does this system generate?
What is the raw storage requirement per month (without compression)?
If you implement this retention policy, what’s the storage after 1 year?
- Tier 1: Raw data, 7 days retention
- Tier 2: 1-minute averages, 30 days retention
- Tier 3: 1-hour averages, 1 year retention
- Tier 4: Daily aggregates, forever
Which database would you recommend and why?

1291.3.2 Solutions:

1. Data points per day:

Temperature: 500 sensors x (86,400 sec/day / 10 sec) = 4,320,000 points/day
Occupancy: 500 sensors x (1,440 min/day / 1 min) = 720,000 points/day
Energy: 50 meters x (86,400 sec/day / 30 sec) = 144,000 points/day
Air quality: 100 sensors x (86,400 sec/day / 60 sec) = 144,000 points/day

Total: 5,328,000 points/day

2. Raw storage per month:

Daily: 5,328,000 points x 32 bytes = 170.5 MB/day
Monthly: 170.5 MB x 30 = 5.1 GB/month raw

3. Storage after 1 year with retention policy:

Tier 1 (Raw, 7 days): - 5,328,000 points/day x 7 days x 32 bytes = 1.19 GB - With 10:1 compression: 119 MB

Tier 2 (1-minute averages, 30 days): - Original: 5,328,000 points/day -> Downsampled: ~88,800 points/day (1-minute buckets) - 88,800 x 30 days x 32 bytes = 85.2 MB - With 10:1 compression: 8.5 MB

Tier 3 (1-hour averages, 1 year): - 88,800 points/day -> 1,480 points/day (1-hour buckets) - 1,480 x 365 days x 32 bytes = 17.3 MB - With 10:1 compression: 1.7 MB

Tier 4 (Daily aggregates, forever): - 1,480 points/day -> ~25 points/day (daily buckets, multiple aggregations) - 25 x 365 days x 32 bytes = 0.3 MB/year - Negligible growth

Total storage after 1 year: ~130 MB

vs. no retention policy: 5.1 GB/month x 12 = 61.2 GB raw (6.1 GB compressed)

Savings: 98% reduction

4. Database recommendation:

Recommended: TimescaleDB

Reasoning:

Write throughput is manageable: 5.3M points/day / 86,400 seconds = ~62 writes/second (well within TimescaleDB capacity)
Need for correlations: Building management systems need to join sensor data with:
- Room assignments (which tenant, department)
- Energy billing data
- Maintenance schedules
- Occupancy reservations
SQL compatibility: Facilities team likely familiar with SQL, easier integration with existing building management software
PostgreSQL ecosystem: Rich tooling for dashboards (Grafana), reporting, and analytics

Alternative: InfluxDB would work if: - Write rates increased 10x (more sensors added) - No need to correlate with relational business data - Team willing to learn Flux query language

Not recommended: Prometheus–designed for short-term infrastructure monitoring, not multi-year IoT data retention.

1291.4 Worked Example: Smart Grid Query Optimization

1291.5 Worked Example: Time-Series Query Optimization for Smart Grid

Scenario: A utility company operates a smart grid with 10,000 smart meters, each reporting energy consumption every 15 minutes. The operations team needs dashboards showing: - Real-time consumption (last hour, per-meter detail) - Daily peak demand (last 30 days, aggregated by region) - Monthly trends (last 12 months, company-wide)

Goal: Design a query and storage strategy that keeps dashboard latency under 2 seconds while minimizing storage costs.

What we do: Estimate raw data ingestion and storage requirements.

Calculations:

Meters: 10,000
Readings per meter per day: 24 hours x 4 readings/hour = 96
Total readings per day: 10,000 x 96 = 960,000
Bytes per reading: ~50 bytes (timestamp + meter_id + kWh + voltage + metadata)
Daily raw data: 960,000 x 50 = 48 MB/day
Annual raw data: 48 MB x 365 = 17.5 GB/year (uncompressed)

Why: Understanding data volume determines partition strategy, retention policies, and hardware requirements. At 960K writes/day (11 writes/second average), even a basic TSDB handles this easily–but dashboards querying across 350M+ annual rows need optimization.

What we do: Configure time-based partitioning to isolate recent vs historical data.

TimescaleDB Configuration:

-- Create hypertable with 1-day chunks
CREATE TABLE meter_readings (
    time        TIMESTAMPTZ NOT NULL,
    meter_id    INTEGER NOT NULL,
    region_id   INTEGER NOT NULL,
    kwh         DOUBLE PRECISION,
    voltage     DOUBLE PRECISION
);

SELECT create_hypertable('meter_readings', 'time',
    chunk_time_interval => INTERVAL '1 day');

-- Add indexes for common query patterns
CREATE INDEX idx_meter_time ON meter_readings (meter_id, time DESC);
CREATE INDEX idx_region_time ON meter_readings (region_id, time DESC);

Why: Daily chunks mean queries for “last hour” scan only 1 partition (~40K rows) instead of the entire table. The meter_id and region_id indexes accelerate filtered queries without excessive write overhead.

What we do: Pre-compute hourly and daily aggregates to accelerate dashboard queries.

Continuous Aggregate Setup:

-- Hourly aggregates (for daily peak analysis)
CREATE MATERIALIZED VIEW meter_hourly
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 hour', time) AS hour,
    region_id,
    COUNT(*) as reading_count,
    AVG(kwh) as avg_kwh,
    MAX(kwh) as peak_kwh,
    SUM(kwh) as total_kwh
FROM meter_readings
GROUP BY hour, region_id
WITH NO DATA;

-- Refresh policy: update every 15 minutes, cover last 2 hours
SELECT add_continuous_aggregate_policy('meter_hourly',
    start_offset => INTERVAL '2 hours',
    end_offset => INTERVAL '15 minutes',
    schedule_interval => INTERVAL '15 minutes');

-- Daily aggregates (for monthly trend analysis)
CREATE MATERIALIZED VIEW meter_daily
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', time) AS day,
    region_id,
    AVG(kwh) as avg_kwh,
    MAX(kwh) as daily_peak_kwh,
    SUM(kwh) as total_kwh
FROM meter_readings
GROUP BY day, region_id
WITH NO DATA;

SELECT add_continuous_aggregate_policy('meter_daily',
    start_offset => INTERVAL '3 days',
    end_offset => INTERVAL '1 day',
    schedule_interval => INTERVAL '1 day');

Why: Continuous aggregates pre-compute results incrementally. A 30-day peak demand query now scans 30 rows per region (720 total for 24 regions) instead of 28.8M raw readings–a 40,000x reduction.

What we do: Implement tiered retention to balance detail vs storage cost.

Retention Policy:

-- Compress data older than 7 days (10:1 compression typical)
ALTER TABLE meter_readings SET (
    timescaledb.compress,
    timescaledb.compress_segmentby = 'meter_id',
    timescaledb.compress_orderby = 'time DESC'
);

SELECT add_compression_policy('meter_readings', INTERVAL '7 days');

-- Drop raw data older than 90 days (aggregates retained)
SELECT add_retention_policy('meter_readings', INTERVAL '90 days');

-- Keep hourly aggregates for 2 years
SELECT add_retention_policy('meter_hourly', INTERVAL '2 years');

-- Keep daily aggregates for 10 years
SELECT add_retention_policy('meter_daily', INTERVAL '10 years');

Why: This tiered approach provides: - Last 7 days: Full resolution, uncompressed (fast per-meter queries) - 7-90 days: Full resolution, compressed (10x storage reduction) - 90+ days: Only aggregates retained (99% storage reduction)

What we do: Write queries that leverage partitions and aggregates.

Optimized Queries:

-- Dashboard 1: Real-time consumption (last hour)
-- Scans: ~40K rows from current chunk
SELECT meter_id, time, kwh
FROM meter_readings
WHERE time > NOW() - INTERVAL '1 hour'
  AND region_id = 5
ORDER BY time DESC;
-- Latency: ~50ms

-- Dashboard 2: Daily peak demand (last 30 days)
-- Scans: ~720 rows from hourly aggregate
SELECT day, MAX(peak_kwh) as daily_peak
FROM meter_hourly
WHERE hour > NOW() - INTERVAL '30 days'
GROUP BY date_trunc('day', hour)
ORDER BY day;
-- Latency: ~20ms

-- Dashboard 3: Monthly trends (last 12 months)
-- Scans: ~288 rows from daily aggregate
SELECT date_trunc('month', day) as month,
       SUM(total_kwh) as monthly_consumption
FROM meter_daily
WHERE day > NOW() - INTERVAL '12 months'
GROUP BY month
ORDER BY month;
-- Latency: ~15ms

Why: Each query hits the appropriate data tier–raw data for recent detail, hourly aggregates for medium-term analysis, daily aggregates for long-term trends. All queries complete in under 100ms.

Outcome: All three dashboard queries complete in under 100ms (well under the 2-second requirement).

Storage Summary: | Data Tier | Retention | Size (Year 1) | Size (Year 5) | |———–|———–|—————|—————| | Raw (uncompressed) | 7 days | 336 MB | 336 MB | | Raw (compressed) | 90 days | 430 MB | 430 MB | | Hourly aggregates | 2 years | 15 MB | 30 MB | | Daily aggregates | 10 years | 0.5 MB | 2.5 MB | | Total | - | ~780 MB | ~800 MB |

Comparison without optimization: 87.5 GB after 5 years (109x more storage).

Key Decisions Made: 1. Daily chunks: Isolates recent data for fast queries 2. Continuous aggregates: Pre-computes common dashboard queries 3. Tiered retention: Keeps detail where needed, aggregates everywhere else 4. Compression after 7 days: Balances query speed vs storage 5. Index strategy: region_id + time indexes match query patterns

1291.6 Worked Examples: Time Synchronization

Worked Example: Calculating Sync Interval to Prevent Data Gaps

Scenario: A fleet management system tracks 5,000 vehicles using GPS sensors. Each vehicle reports position data to a central time-series database (InfluxDB). Due to cellular network variability, GPS timestamps from vehicles can drift from server time. The operations team needs to ensure data can be correctly ordered despite clock discrepancies.

Given: - Number of vehicles: 5,000 - Reporting interval: 10 seconds per vehicle - Vehicle GPS clock accuracy: +/-2 seconds (cellular NTP sync) - Server clock accuracy: +/-50 ms (GPS-disciplined NTP) - Data retention window for real-time view: 1 hour - Out-of-order arrival tolerance: Up to 30 seconds late

Steps: 1. Calculate timestamp uncertainty between any two vehicles: - Vehicle A clock: +/-2 seconds - Vehicle B clock: +/-2 seconds - Combined uncertainty = +/-4 seconds (worst case: A is +2s, B is -2s)

Determine minimum sampling interval for unambiguous ordering:
- Events must be >4 seconds apart to guarantee correct ordering
- Current interval (10 seconds) > 4 seconds: OK for ordering
Configure InfluxDB write buffer for late arrivals:
- Maximum expected lateness: 30 seconds
- Set cache-max-memory-size to handle 30 seconds of pending writes
- Buffer size = 5,000 vehicles x 3 readings (30s / 10s) x 100 bytes = 1.5 MB
Design timestamp handling strategy:
- Primary timestamp: Vehicle GPS timestamp (event time)
- Secondary timestamp: Server receipt time (processing time)
- Store both: time (GPS) and received_at (server)
- Query by GPS time for correct route reconstruction
Configure retention policy for clock drift tolerance:
- Hot tier: 1 hour at full resolution (for real-time dashboards)
- Use 5-second GROUP BY time() to absorb +/-2 second drift variations

Result: Configure InfluxDB with 1.5 MB write cache, dual timestamps, and 5-second aggregation buckets. This absorbs +/-4 second clock drift while correctly ordering 98.5% of position updates.

Key Insight: In distributed IoT systems, clock drift is inevitable. Design your data model to store both event time (when it happened) and ingestion time (when you received it). Query by event time for analytics but use ingestion time for troubleshooting data pipeline issues.

Worked Example: NTP Sync Interval for Edge Gateway Fleet

Scenario: A smart agriculture company deploys 200 edge gateways across remote farms. Each gateway aggregates data from 50 soil sensors and forwards to the cloud every 5 minutes. The gateways have low-cost oscillators and intermittent cellular connectivity. The team must design NTP synchronization to ensure timestamp accuracy for cross-farm analysis.

Given: - Edge gateways: 200 devices - Gateway oscillator drift: +/-100 ppm (low-cost crystal) - Cellular connectivity: Available 80% of time (intermittent) - NTP round-trip time: 200-500 ms (cellular network) - Required timestamp accuracy: +/-500 ms for cross-farm comparison - Data upload interval: 5 minutes (300 seconds)

Steps: 1. Calculate maximum drift between NTP syncs: - Drift rate = 100 ppm = 100 microseconds per second = 0.1 ms/s - Over 5-minute upload interval: 0.1 ms/s x 300s = 30 ms drift - Over 1 hour without sync: 0.1 ms/s x 3600s = 360 ms drift

Calculate NTP sync accuracy limits:
- Network RTT asymmetry: Assume 10% asymmetry
- Asymmetry error = RTT x asymmetry = 350 ms x 10% = 35 ms
- Best achievable NTP accuracy: ~50-100 ms over cellular
Determine maximum sync interval:
- Error budget: 500 ms target accuracy
- NTP sync error: 100 ms (typical)
- Available for drift: 500 - 100 = 400 ms
- Max sync interval = 400 ms / 0.1 ms/s = 4,000 seconds = 66 minutes
Account for connectivity gaps:
- 20% downtime means potential 20% longer gaps between syncs
- Apply safety factor: 66 min x 0.8 = 53 minutes recommended
- Round to 30 minutes for practical implementation
Configure NTP client for intermittent connectivity:
- Primary: Cloud NTP server (time.google.com)
- Secondary: GPS time from connected sensors (if available)
- Retry interval on failure: 5 minutes
- Maximum poll interval: 30 minutes
- Store last-known offset for gap periods

Result: Configure edge gateways to sync NTP every 30 minutes. During connectivity gaps up to 66 minutes, clocks remain within 400 ms accuracy. Combined with NTP sync error (~100 ms), total accuracy stays within 500 ms target.

Key Insight: Low-cost IoT devices with 100 ppm oscillators need hourly NTP syncs for sub-second accuracy. For tighter requirements (<100 ms), either upgrade to TCXO oscillators (+/-2 ppm) or implement GPS-disciplined timing at edge gateways.

1291.7 Hands-On Lab: Time-Series Data Logger

Lab Overview

In this hands-on lab, you will build a time-series data logger using an ESP32 microcontroller. You will learn how to collect timestamped sensor data, implement efficient storage using circular buffers, apply downsampling techniques, and query historical data through serial commands.

1291.7.1 Learning Objectives

By completing this lab, you will be able to:

Collect timestamped sensor data from multiple sensors on an ESP32
Implement a circular buffer for efficient memory management on resource-constrained devices
Apply downsampling techniques to reduce storage requirements while preserving data trends
Query historical data using custom serial commands
Understand the trade-offs between data resolution and storage capacity

1291.7.2 Prerequisites

Basic C/C++ programming knowledge
Familiarity with Arduino IDE concepts
Understanding of time-series data concepts (covered earlier in this chapter)

1291.7.3 Wokwi Simulator

Use the embedded simulator below to complete this lab. The ESP32 environment comes pre-configured with essential libraries.

Simulator Tips

Click inside the simulator and press Ctrl+Shift+M (or Cmd+Shift+M on Mac) to open the Serial Monitor
Use the temperature sensor on the virtual breadboard or simulate readings with the built-in random values
You can save your project to Wokwi by creating a free account

1291.7.4 Step-by-Step Instructions

1291.7.4.1 Step 1: Set Up the Basic Data Structure

First, define the data structures for storing timestamped sensor readings. Copy this code into the simulator:

#include <Arduino.h>
#include <time.h>

// Configuration constants
#define BUFFER_SIZE 100        // Number of readings to store
#define SAMPLE_INTERVAL 1000   // Sample every 1 second (ms)
#define DOWNSAMPLE_FACTOR 5    // Average every 5 readings for storage

// Data point structure - 12 bytes per reading
struct DataPoint {
  uint32_t timestamp;    // Unix timestamp (seconds since epoch)
  float temperature;     // Temperature in Celsius
  float humidity;        // Relative humidity percentage
};

// Circular buffer for raw data (high resolution)
DataPoint rawBuffer[BUFFER_SIZE];
int rawHead = 0;
int rawCount = 0;

// Circular buffer for downsampled data (long-term storage)
DataPoint downsampledBuffer[BUFFER_SIZE];
int dsHead = 0;
int dsCount = 0;

// Accumulator for downsampling
float tempAccum = 0;
float humidAccum = 0;
int accumCount = 0;

// Timing
unsigned long lastSampleTime = 0;
uint32_t startTime = 0;

1291.7.4.2 Step 2: Implement the Circular Buffer Operations

Add these functions to manage the circular buffer efficiently:

// Add a data point to a circular buffer
void addToBuffer(DataPoint* buffer, int& head, int& count,
                 DataPoint point, int maxSize) {
  buffer[head] = point;
  head = (head + 1) % maxSize;
  if (count < maxSize) {
    count++;
  }
}

// Get the index of a point at a given offset from newest
int getIndex(int head, int count, int offset, int maxSize) {
  if (offset >= count) return -1;
  return (head - 1 - offset + maxSize) % maxSize;
}

// Calculate storage usage
void printStorageStats() {
  int rawBytes = rawCount * sizeof(DataPoint);
  int dsBytes = dsCount * sizeof(DataPoint);

  Serial.println("\n=== Storage Statistics ===");
  Serial.printf("Raw buffer: %d/%d points (%d bytes)\n",
                rawCount, BUFFER_SIZE, rawBytes);
  Serial.printf("Downsampled: %d/%d points (%d bytes)\n",
                dsCount, BUFFER_SIZE, dsBytes);
  Serial.printf("Total memory: %d bytes\n", rawBytes + dsBytes);
  Serial.printf("Compression ratio: %.1fx\n",
                rawCount > 0 ? (float)(rawCount * sizeof(DataPoint)) /
                              (dsCount * sizeof(DataPoint) + 1) : 1.0);
}

1291.7.4.3 Step 3: Implement Sensor Reading and Downsampling

Add the sensor collection and downsampling logic:

// Simulate sensor readings (replace with real sensors in production)
DataPoint readSensors() {
  DataPoint dp;
  dp.timestamp = startTime + (millis() / 1000);

  // Simulate temperature: 20-30C with some variation
  dp.temperature = 25.0 + sin(millis() / 10000.0) * 5.0 +
                   random(-10, 10) / 10.0;

  // Simulate humidity: 40-60% with some variation
  dp.humidity = 50.0 + cos(millis() / 15000.0) * 10.0 +
                random(-5, 5) / 10.0;

  return dp;
}

// Process and store a new reading with downsampling
void processSensorReading() {
  DataPoint reading = readSensors();

  // Store in raw buffer (high resolution)
  addToBuffer(rawBuffer, rawHead, rawCount, reading, BUFFER_SIZE);

  // Accumulate for downsampling
  tempAccum += reading.temperature;
  humidAccum += reading.humidity;
  accumCount++;

  // When we have enough samples, create downsampled point
  if (accumCount >= DOWNSAMPLE_FACTOR) {
    DataPoint dsPoint;
    dsPoint.timestamp = reading.timestamp;
    dsPoint.temperature = tempAccum / accumCount;
    dsPoint.humidity = humidAccum / accumCount;

    addToBuffer(downsampledBuffer, dsHead, dsCount, dsPoint, BUFFER_SIZE);

    // Reset accumulator
    tempAccum = 0;
    humidAccum = 0;
    accumCount = 0;
  }

  // Print latest reading
  Serial.printf("[%lu] Temp: %.2fC, Humidity: %.2f%%\n",
                reading.timestamp, reading.temperature, reading.humidity);
}

1291.7.4.4 Step 4: Implement Query Commands

Add serial command processing for querying historical data:

// Query last N readings from specified buffer
void queryLastN(DataPoint* buffer, int head, int count,
                int n, const char* bufferName) {
  Serial.printf("\n=== Last %d readings from %s ===\n", n, bufferName);
  Serial.println("Timestamp\tTemp(C)\tHumidity(%)");
  Serial.println("----------------------------------------");

  int toShow = min(n, count);
  for (int i = 0; i < toShow; i++) {
    int idx = getIndex(head, count, i, BUFFER_SIZE);
    if (idx >= 0) {
      Serial.printf("%lu\t\t%.2f\t\t%.2f\n",
                    buffer[idx].timestamp,
                    buffer[idx].temperature,
                    buffer[idx].humidity);
    }
  }
}

// Calculate statistics for a buffer
void calculateStats(DataPoint* buffer, int head, int count) {
  if (count == 0) {
    Serial.println("No data available.");
    return;
  }

  float minTemp = 999, maxTemp = -999, sumTemp = 0;
  float minHum = 999, maxHum = -999, sumHum = 0;

  for (int i = 0; i < count; i++) {
    int idx = getIndex(head, count, i, BUFFER_SIZE);
    if (idx >= 0) {
      sumTemp += buffer[idx].temperature;
      sumHum += buffer[idx].humidity;
      if (buffer[idx].temperature < minTemp)
        minTemp = buffer[idx].temperature;
      if (buffer[idx].temperature > maxTemp)
        maxTemp = buffer[idx].temperature;
      if (buffer[idx].humidity < minHum)
        minHum = buffer[idx].humidity;
      if (buffer[idx].humidity > maxHum)
        maxHum = buffer[idx].humidity;
    }
  }

  Serial.println("\n=== Statistics ===");
  Serial.printf("Temperature - Min: %.2f, Max: %.2f, Avg: %.2f\n",
                minTemp, maxTemp, sumTemp / count);
  Serial.printf("Humidity    - Min: %.2f, Max: %.2f, Avg: %.2f\n",
                minHum, maxHum, sumHum / count);
}

// Process serial commands
void processCommand(String cmd) {
  cmd.trim();
  cmd.toUpperCase();

  if (cmd == "HELP") {
    Serial.println("\n=== Available Commands ===");
    Serial.println("HELP     - Show this help message");
    Serial.println("RAW N    - Show last N raw readings");
    Serial.println("DS N     - Show last N downsampled readings");
    Serial.println("STATS    - Show statistics for all data");
    Serial.println("STORAGE  - Show storage usage");
    Serial.println("CLEAR    - Clear all buffers");
  }
  else if (cmd.startsWith("RAW ")) {
    int n = cmd.substring(4).toInt();
    queryLastN(rawBuffer, rawHead, rawCount, n, "Raw Buffer");
  }
  else if (cmd.startsWith("DS ")) {
    int n = cmd.substring(3).toInt();
    queryLastN(downsampledBuffer, dsHead, dsCount, n, "Downsampled Buffer");
  }
  else if (cmd == "STATS") {
    Serial.println("\n--- Raw Data Statistics ---");
    calculateStats(rawBuffer, rawHead, rawCount);
    Serial.println("\n--- Downsampled Data Statistics ---");
    calculateStats(downsampledBuffer, dsHead, dsCount);
  }
  else if (cmd == "STORAGE") {
    printStorageStats();
  }
  else if (cmd == "CLEAR") {
    rawHead = rawCount = 0;
    dsHead = dsCount = 0;
    tempAccum = humidAccum = accumCount = 0;
    Serial.println("All buffers cleared.");
  }
  else {
    Serial.println("Unknown command. Type HELP for available commands.");
  }
}

1291.7.4.5 Step 5: Setup and Main Loop

Complete the program with the setup and loop functions:

void setup() {
  Serial.begin(115200);
  delay(1000);

  // Initialize pseudo-random time (in real deployment, use NTP)
  startTime = 1704067200; // Jan 1, 2024 00:00:00 UTC

  Serial.println("\n====================================");
  Serial.println("  Time-Series Data Logger v1.0");
  Serial.println("====================================");
  Serial.println("Collecting sensor data...");
  Serial.printf("Sample interval: %d ms\n", SAMPLE_INTERVAL);
  Serial.printf("Downsample factor: %d\n", DOWNSAMPLE_FACTOR);
  Serial.printf("Buffer size: %d readings\n", BUFFER_SIZE);
  Serial.println("\nType HELP for available commands.\n");
}

void loop() {
  // Collect sensor data at specified interval
  unsigned long currentTime = millis();
  if (currentTime - lastSampleTime >= SAMPLE_INTERVAL) {
    lastSampleTime = currentTime;
    processSensorReading();
  }

  // Process serial commands
  if (Serial.available()) {
    String command = Serial.readStringUntil('\n');
    processCommand(command);
  }
}

1291.7.5 Testing Your Implementation

Start the simulator and open the Serial Monitor (Ctrl+Shift+M)
Wait 10-15 seconds to collect some data
Try these commands:
- HELP - View all available commands
- RAW 10 - Show the last 10 raw readings
- DS 5 - Show the last 5 downsampled readings
- STATS - View min/max/average statistics
- STORAGE - See memory usage and compression ratio

1291.7.6 Challenge Exercises

Challenge 1: Add a Third Sensor

Modify the code to include a third sensor (e.g., pressure or light level). Update the DataPoint structure and all related functions.

Hint: You will need to update the struct, the readSensors() function, and all print statements.

Challenge 2: Implement Time-Range Queries

Add a new command RANGE start end that queries data between two timestamps. For example, RANGE 1704067210 1704067220 should show all readings in that 10-second window.

Hint: Use the queryTimeRange() function pattern shown in the chapter.

Challenge 3: Multi-Resolution Downsampling

Implement a three-tier storage system: - Raw data: 1-second resolution (last 60 readings) - Medium: 10-second averages (last 100 readings) - Long-term: 1-minute averages (last 100 readings)

This mirrors how production time-series databases like InfluxDB implement retention policies.

Hint: Create three separate circular buffers and three downsampling accumulators.

Challenge 4: Anomaly Detection

Add automatic anomaly detection that prints a warning when temperature or humidity readings deviate more than 2 standard deviations from the running average.

Hint: Maintain running sum and sum-of-squares to calculate variance efficiently without storing all historical values.

1291.7.7 What You Learned

In this lab, you implemented core time-series database concepts on a microcontroller:

Concept	Implementation
Timestamped data	Each reading includes a Unix timestamp
Circular buffer	Fixed-size buffer that overwrites oldest data
Downsampling	Averaging multiple readings to reduce storage
Query interface	Serial commands for data retrieval
Storage efficiency	Compression ratio tracking

These same principles apply to production time-series databases like InfluxDB and TimescaleDB, which use similar techniques at much larger scale with additional optimizations like columnar compression and time-based partitioning.

Show code

{
  const container = document.getElementById('kc-timeseries-11');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "An ESP32-based environmental monitor collects temperature every second and stores readings in a 1000-element circular buffer (12 bytes each). The device downsamples to 1-minute averages before uploading to the cloud. After 30 minutes of operation, how many data points exist in the raw buffer and how many have been uploaded?",
      options: [
        {text: "1800 raw points in buffer, 30 uploaded - the circular buffer holds all 30 minutes of data", correct: false, feedback: "Incorrect. A circular buffer of size 1000 can only hold 1000 points maximum. After 1000 seconds (~16.7 minutes), older readings are overwritten by newer ones."},
        {text: "1000 raw points in buffer (most recent 16.7 minutes), 30 uploaded (one per minute)", correct: true, feedback: "Correct! The 1000-element circular buffer holds the most recent 1000 seconds of data (~16.7 minutes). Older data is overwritten. Meanwhile, 30 one-minute averages have been computed and uploaded (one per minute for 30 minutes). This demonstrates the edge processing pattern."},
        {text: "30 raw points in buffer (one per minute), 30 uploaded - downsampling reduces buffer usage", correct: false, feedback: "Incorrect. The raw buffer stores high-resolution data (every second), not the downsampled data. Downsampling creates separate data for upload, but the raw buffer continues collecting at 1-second intervals."},
        {text: "1000 raw points in buffer, 1800 uploaded - all raw data is uploaded in addition to averages", correct: false, feedback: "Incorrect. The scenario specifically states the device 'downsamples to 1-minute averages before uploading'. Only the averaged data (30 points over 30 minutes) is transmitted, saving bandwidth and cloud storage."}
      ],
      difficulty: "easy",
      topic: "edge-buffering"
    }));
  }
}

Related Chapters

Foundations:

Big Data Overview - Data management fundamentals and IoT data challenges
Data Storage and Databases - Storage concepts and database types
Data in the Cloud - Cloud storage strategies and service models

Advanced Topics:

Stream Processing - Real-time data pipelines with Kafka, Flink
Edge Data Acquisition - Data collection patterns at the edge
Interoperability - Data format standards and exchange protocols

Architecture:

Edge Computing Patterns - Edge processing architectures
Cloud Computing - Cloud infrastructure for IoT

1291.8 Summary

This chapter applied time-series concepts through practical examples and hands-on exercises.

Key Takeaways:

Edge processing is essential at scale: Tesla’s 98% data reduction comes from on-vehicle aggregation, not just cloud compression.
Adaptive strategies match data value: Sample faster during interesting events (hard braking), slower during routine operation (parked).
Standard tools handle most workloads: Start with InfluxDB or TimescaleDB–only build custom when you exceed their limits.
Embedded systems can implement TSDB concepts: Circular buffers, downsampling, and time-based queries work on microcontrollers.
Time synchronization requires planning: Calculate drift rates, sync intervals, and design for connectivity gaps.

1291.9 What’s Next

In the next chapter on Stream Processing, we’ll explore how to process IoT data in real-time before it reaches the database. Learn how to use Apache Kafka, Apache Flink, and cloud streaming services to detect anomalies, trigger alerts, and perform complex event processing on sensor streams at microsecond latency–transforming raw time-series data into actionable insights.