7  Data Quality Monitoring

In 60 Seconds

Data quality monitoring for IoT databases tracks six critical dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. By scoring data quality at ingestion, creating continuous aggregate dashboards, and applying tiered handling strategies (reject, quarantine, or flag), you can ensure that only trustworthy data drives your analytics and decisions.

Key Concepts
  • Data Quality Score: A composite metric (0-100) calculated from completeness, accuracy, consistency, timeliness, and validity dimensions, attached to each ingested sensor reading to enable downstream filtering
  • Schema Validation: The process of checking incoming IoT records against an expected structure – verifying field presence, data types, and value ranges – before committing to storage
  • Completeness: A quality dimension measuring the percentage of expected sensor readings that were actually received; gaps indicate connectivity failures, device crashes, or power interruptions
  • Timeliness: A quality dimension measuring whether sensor data arrives within an acceptable latency window; late data may indicate network congestion, buffering failures, or clock drift
  • Quarantine Strategy: A data quality handling approach that writes low-quality records to a separate table for manual review rather than discarding or silently accepting them
  • Continuous Aggregate: A TimescaleDB materialized view that automatically refreshes pre-computed aggregations (hourly average, daily max) as new data arrives, enabling fast dashboard queries without full table scans
  • Reject-Quarantine-Flag Pattern: Three data quality handling strategies: reject (discard bad data immediately), quarantine (isolate for review), or flag (store with quality metadata for downstream filtering)
  • Quality Monitoring Dashboard: A real-time visualization tracking per-device quality scores, missing reading rates, and validation failure counts to detect sensor degradation before it impacts applications

Learning Objectives

After completing this chapter, you will be able to:

  • Define and measure the six dimensions of data quality for IoT
  • Implement data quality checks at ingestion time
  • Design quality monitoring dashboards and alerts
  • Classify low-quality data using reject, quarantine, and flag strategies
  • Evaluate data quality validation trade-offs against database performance

Data quality monitoring is like quality control on a factory line for IoT data. Before using sensor readings for important decisions, you need to check: Are the values accurate? Are any readings missing? Did they arrive on time? Just as a restaurant checks ingredient freshness before cooking, IoT systems need automatic checks to catch bad data before it leads to bad decisions.

7.1 Introduction

A single miscalibrated sensor can corrupt months of analytics, trigger false alarms, or cause an ML model to learn the wrong patterns. In large IoT deployments with thousands of devices, bad data is not an exception – it is an inevitability. Sensors drift, batteries die mid-transmission, firmware bugs corrupt payloads, and network retries create duplicates.

Data quality monitoring provides the systematic defenses against these failures. Rather than discovering corrupt data weeks later during analysis, quality monitoring catches problems at ingestion – before they propagate through dashboards, models, and automated decisions. This chapter covers the six dimensions of data quality, how to implement scoring and validation in time-series databases, and the strategies for handling data that fails quality checks.

7.2 The Six Dimensions of Data Quality

Data Quality Framework for IoT

IoT systems must monitor data quality across six critical dimensions:

Dimension Definition IoT Example Impact of Poor Quality
Accuracy Data correctly represents reality Sensor reports 25C when actual is 24C Wrong decisions, false alerts
Completeness All required data is present Temperature readings but missing humidity Incomplete analysis, ML failures
Consistency Data is uniform across systems Device A uses Celsius, Device B Fahrenheit Aggregation errors, wrong comparisons
Timeliness Data is up-to-date and available when needed Sensor data arrives 10 minutes late Missed real-time opportunities
Validity Data conforms to defined formats/rules GPS coordinates outside valid range Database errors, query failures
Uniqueness No duplicate records exist Same reading stored twice Inflated counts, wasted storage

7.3 Implementing Data Quality Checks at Ingestion

Data quality should be validated as data enters the database. This involves defining a schema with quality metadata, writing a scoring function, and integrating scoring into the insert path.

7.3.1 Schema with Quality Metadata

-- TimescaleDB: Add quality score to sensor readings
CREATE TABLE sensor_readings (
  time TIMESTAMPTZ NOT NULL,
  device_id TEXT NOT NULL,
  temperature DOUBLE PRECISION,
  humidity DOUBLE PRECISION,
  -- Quality metadata
  quality_score INTEGER CHECK (quality_score BETWEEN 0 AND 100),
  quality_flags JSONB,
  PRIMARY KEY (device_id, time)
);

-- Create hypertable
SELECT create_hypertable('sensor_readings', 'time');

7.3.2 Quality Scoring Function

-- Function to calculate quality score
CREATE OR REPLACE FUNCTION calculate_quality_score(
  temp DOUBLE PRECISION,
  humidity DOUBLE PRECISION,
  reading_time TIMESTAMPTZ
) RETURNS INTEGER AS $$
DECLARE
  score INTEGER := 100;
  age_seconds DOUBLE PRECISION;
BEGIN
  -- Check temperature validity (-50 to +50C range)
  IF temp IS NULL OR temp < -50 OR temp > 50 THEN score := score - 40; END IF;

  -- Check humidity validity (0-100% range)
  IF humidity IS NULL OR humidity < 0 OR humidity > 100 THEN score := score - 40; END IF;

  -- Check data freshness (penalize if >5 minutes old)
  age_seconds := EXTRACT(EPOCH FROM (NOW() - reading_time));
  IF age_seconds > 300 THEN score := score - 20; END IF;

  RETURN GREATEST(score, 0);
END;
$$ LANGUAGE plpgsql;

7.3.3 Inserting with Quality Scoring

-- Insert with automatic quality scoring
INSERT INTO sensor_readings (time, device_id, temperature, humidity, quality_score)
VALUES (
  NOW(),
  'sensor_001',
  23.5,
  55.0,
  calculate_quality_score(23.5, 55.0, NOW())
);

7.4 Quality Monitoring Queries

Continuous quality monitoring dashboards should track:

1. Quality score distribution – find devices with degraded readings:

SELECT device_id, AVG(quality_score) as avg_quality,
       COUNT(*) FILTER (WHERE quality_score < 70) as low_quality_count
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id
HAVING AVG(quality_score) < 80  -- alert threshold
ORDER BY avg_quality ASC;

2. Missing data detection – gaps in the expected time-series:

SELECT device_id, time_bucket('1 hour', time) AS hour,
       COUNT(*) as readings_count,
       60 - COUNT(*) as missing_readings  -- expect 60/hour at 1/min
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id, hour
HAVING COUNT(*) < 50
ORDER BY missing_readings DESC;

3. Near-duplicate detection – readings from the same device within 1 second (retry bugs or network duplication):

SELECT r1.device_id, r1.time AS time_1, r2.time AS time_2,
       r1.temperature AS temp_1, r2.temperature AS temp_2
FROM sensor_readings r1
JOIN sensor_readings r2
  ON r1.device_id = r2.device_id
  AND r2.time BETWEEN r1.time AND r1.time + INTERVAL '1 second'
  AND r2.time > r1.time
WHERE r1.time > NOW() - INTERVAL '24 hours';

4. Outlier detection – values beyond 3 standard deviations:

WITH stats AS (
  SELECT device_id, AVG(temperature) as mu, STDDEV(temperature) as sigma
  FROM sensor_readings WHERE time > NOW() - INTERVAL '7 days'
  GROUP BY device_id
)
SELECT r.time, r.device_id, r.temperature,
       ABS(r.temperature - s.mu) / NULLIF(s.sigma, 0) as z_score
FROM sensor_readings r JOIN stats s ON r.device_id = s.device_id
WHERE r.time > NOW() - INTERVAL '24 hours'
  AND ABS(r.temperature - s.mu) > 3 * s.sigma
ORDER BY z_score DESC;

7.5 Quality Monitoring Views and Alerts

Production Monitoring Best Practices

Alert Thresholds (tune based on your system):

  1. Quality Score < 70: Investigate device/sensor issues
  2. Missing Data > 10%: Network or device failure likely
  3. Duplicate Rate > 1%: Application or network retry issues
  4. Outlier Rate > 5%: Sensor calibration or environmental anomaly

Continuous aggregates for monitoring views:

-- Create materialized view for hourly quality metrics
CREATE MATERIALIZED VIEW hourly_quality_summary
WITH (timescaledb.continuous) AS
SELECT
  device_id,
  time_bucket('1 hour', time) AS hour,
  AVG(quality_score) as avg_quality,
  COUNT(*) as reading_count,
  COUNT(*) FILTER (WHERE quality_score < 70) as poor_quality_count,
  AVG(temperature) as avg_temperature,
  STDDEV(temperature) as stddev_temperature
FROM sensor_readings
GROUP BY device_id, hour;

-- Auto-refresh every 5 minutes
SELECT add_continuous_aggregate_policy('hourly_quality_summary',
  start_offset => INTERVAL '1 day',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '5 minutes');

-- Query for quick dashboard updates
SELECT * FROM hourly_quality_summary
WHERE hour > NOW() - INTERVAL '24 hours'
  AND avg_quality < 80
ORDER BY avg_quality ASC;

Alert mechanism using PostgreSQL NOTIFY:

-- Trigger that fires an alert when quality drops below threshold
CREATE OR REPLACE FUNCTION notify_quality_alert()
RETURNS TRIGGER AS $$
BEGIN
  IF NEW.quality_score < 50 THEN
    PERFORM pg_notify(
      'quality_alert',
      json_build_object(
        'device_id', NEW.device_id,
        'score', NEW.quality_score,
        'time', NEW.time
      )::text
    );
  END IF;
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER quality_alert_trigger
  AFTER INSERT ON sensor_readings
  FOR EACH ROW
  EXECUTE FUNCTION notify_quality_alert();

-- Application listens with: LISTEN quality_alert;
-- Integrate with Grafana alerts or PagerDuty webhooks for production use.

7.6 Data Quality vs. Database Performance Trade-offs

Quality validation has computational costs that must be weighed against ingestion throughput:

Quality Check Performance Impact When to Apply
Range validation Low (simple comparison) Every insert (always)
Duplicate detection Medium (index lookup) Every insert (recommended)
Outlier detection High (statistical aggregation) Batch processing (hourly/daily)
Cross-sensor consistency Very High (joins across datasets) Offline validation (weekly)

Optimization strategy:

  1. Ingestion time: Fast checks only (range, format, uniqueness)
  2. Real-time (1-5 min delay): Lightweight statistical checks (moving average validation)
  3. Batch (hourly/daily): Complex validations (cross-correlation, trend analysis, ML-based anomaly detection)

Quality-checked data also interacts with storage tiering – data quality scores can inform which tier data resides in:

Tradeoff: Hot/Warm/Cold Storage Tiers for IoT Data

Option A: Single-Tier (Hot Storage Only) – All data on fast SSD-backed database

  • Query latency: 1-10ms for any time range
  • Storage cost: $0.10-0.25/GB/month (AWS RDS gp3 SSD, varies by region)
  • 1TB for 1 year: $1,200-3,000/year
  • Operational simplicity: Single database to manage

Option B: Multi-Tier (Hot/Warm/Cold) – Data moves to cheaper storage as it ages

  • Hot tier (SSD, 0-7 days): $0.10-0.25/GB/month, 1-10ms latency
  • Warm tier (HDD/compressed, 7-90 days): $0.02-0.05/GB/month, 50-200ms latency
  • Cold tier (S3/Glacier, 90+ days): $0.004-0.01/GB/month, 100ms-12hr latency
  • 1TB for 1 year: $300-800/year (75-80% cost reduction)

Decision Factors:

  • Choose Single-Tier when: Dataset under 100GB, uniform query patterns across all time ranges, operational simplicity is paramount
  • Choose Multi-Tier when: Dataset exceeds 500GB, 90%+ queries target recent data (last 7 days), storage cost is a concern, compliance requires multi-year retention

Compare storage costs interactively:

7.7 Handling Low-Quality Data

When quality issues are detected, choose the appropriate strategy:

When to use: Critical applications where bad data is worse than no data

-- Reject invalid readings at ingestion
CREATE OR REPLACE FUNCTION reject_invalid_data()
RETURNS TRIGGER AS $$
BEGIN
  -- Calculate quality score
  NEW.quality_score := calculate_quality_score(
    NEW.temperature,
    NEW.humidity,
    NEW.time
  );

  -- Reject if quality too low
  IF NEW.quality_score < 50 THEN
    RAISE EXCEPTION 'Data quality too low: %', NEW.quality_score;
  END IF;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER check_data_quality
  BEFORE INSERT ON sensor_readings
  FOR EACH ROW
  EXECUTE FUNCTION reject_invalid_data();

When to use: Need to investigate issues while keeping pipeline flowing

-- Create quarantine table
CREATE TABLE sensor_readings_quarantine (
  LIKE sensor_readings INCLUDING ALL,
  quarantine_reason TEXT,
  quarantine_time TIMESTAMPTZ DEFAULT NOW()
);

-- Modified trigger: quarantine instead of reject
CREATE OR REPLACE FUNCTION quarantine_invalid_data()
RETURNS TRIGGER AS $$
BEGIN
  NEW.quality_score := calculate_quality_score(
    NEW.temperature, NEW.humidity, NEW.time
  );

  IF NEW.quality_score < 50 THEN
    -- Move to quarantine instead of rejecting
    INSERT INTO sensor_readings_quarantine
    VALUES (NEW.*, 'Low quality score: ' || NEW.quality_score, NOW());
    RETURN NULL;  -- Skip insert into main table
  END IF;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

When to use: Allow analysis while marking suspect data

-- Store all data but flag quality issues
CREATE OR REPLACE FUNCTION flag_data_quality()
RETURNS TRIGGER AS $$
BEGIN
  NEW.quality_score := calculate_quality_score(
    NEW.temperature, NEW.humidity, NEW.time
  );

  -- Build quality flags JSON
  NEW.quality_flags := jsonb_build_object(
    'temp_valid', NEW.temperature BETWEEN -50 AND 50,
    'humidity_valid', NEW.humidity BETWEEN 0 AND 100,
    'timestamp_fresh', EXTRACT(EPOCH FROM (NOW() - NEW.time)) < 300,
    'overall_quality', CASE
      WHEN NEW.quality_score >= 80 THEN 'good'
      WHEN NEW.quality_score >= 50 THEN 'acceptable'
      ELSE 'poor'
    END
  );

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Analytics can filter based on flags
SELECT * FROM sensor_readings
WHERE quality_flags->>'overall_quality' = 'good'
  AND time > NOW() - INTERVAL '1 day';

7.8 Data Quality Dashboards

Visualizing quality metrics helps identify patterns:

Key metrics to track:

  1. Quality trend over time (line chart)
    • X-axis: Time (hourly/daily)
    • Y-axis: Average quality score
    • Multiple lines: One per device or device group
  2. Device quality heatmap (2D grid)
    • X-axis: Devices
    • Y-axis: Time periods
    • Color: Quality score (green=good, yellow=acceptable, red=poor)
  3. Data completeness gauge (percentage)
    • Expected readings vs actual readings
    • Target: >95% completeness
  4. Outlier rate (stacked area chart)
    • Show percentage of readings that are outliers over time
    • Helps identify sensor drift or calibration issues

7.9 Common Pitfalls

Beyond monitoring data values, two common design mistakes cause systemic data quality failures at the protocol and schema level:

Common Pitfall: Payload Size Too Large for Protocol/Network

The mistake: Sending full JSON payloads over constrained networks like LoRaWAN or Sigfox, exceeding protocol limits and causing transmission failures.

Symptoms:

  • Message fragmentation across multiple transmissions
  • Transmission failures and dropped messages
  • High battery consumption from repeated send attempts
  • Gateway buffer overflows

Why it happens: Developers design payloads on desktop environments with ample bandwidth, then deploy to constrained networks without considering size limits. LoRaWAN allows ~50-250 bytes depending on spreading factor; Sigfox allows only 12 bytes.

The fix:

# Use binary encoding for LPWAN
import struct
# Pack data into 10 bytes instead of 150+ byte JSON
payload = struct.pack(
    ">HhHHH",  # Format: 2+2+2+2+2 = 10 bytes
    device_id_short,           # 2 bytes (unsigned)
    int(temp * 10),            # 2 bytes (signed), -3276.8 to 3276.7 C
    int(humidity * 10),        # 2 bytes, 0-100% with 0.1% precision
    int((pressure - 900) * 10),  # 2 bytes, 900-7453.5 hPa range
    int(battery * 100),        # 2 bytes, 0-655.35V
)

Prevention: Know your protocol’s payload limits before designing data schemas. Use CBOR instead of JSON for constrained networks.

Common Pitfall: Missing Schema Versioning

The mistake: Changing data format without versioning, breaking all existing consumers when devices send incompatible payloads.

Symptoms:

  • Parsing errors after firmware updates
  • Old and new devices sending incompatible data
  • Breaking changes cascade through the system

Why it happens: Teams assume all devices will update simultaneously. In IoT, devices may run old firmware for years.

The fix:

# Include version in payload
payload = {
    "version": 2,
    "temperature": 22.5,
    "unit": "C"
}

# Consumer handles multiple versions
def parse(msg):
    version = msg.get("version", 1)
    if version == 1:
        return {"temperature": msg["temp"]}  # Old format
    else:
        return msg  # New format

Prevention: Include version field in all payloads from day one. Use schema registries (Apache Avro, Protobuf) for complex systems.

7.10 Common Misconception

Database choice directly affects what quality monitoring capabilities are available. A common myth pushes teams toward databases that lack the SQL query power needed for effective quality analysis:

Common Misconception: “NoSQL is Always Better for IoT Scale”

The Myth: “IoT generates massive data volumes, so we must use NoSQL databases like MongoDB or Cassandra. SQL databases can’t handle IoT scale.”

The Reality: SQL databases with time-series extensions (TimescaleDB) handle hundreds of thousands of writes per second per node, scale to petabytes with compression, and offer superior query capabilities for time-series workloads compared to general-purpose NoSQL databases.

Real-World Data:

Database Write Throughput Storage Capacity Query Performance (time-range)
TimescaleDB 100K-400K rows/sec (single node) Petabytes with compression 10-100x faster than plain PostgreSQL
InfluxDB 500K-1M points/sec Hundreds of TB Optimized for time queries
MongoDB 100K-500K docs/sec Petabytes with sharding Slower for time-range queries
Cassandra 1M+ writes/sec (distributed) Petabytes Poor for time-range aggregations

When NoSQL Wins:

  • Flexible schemas that evolve rapidly (device metadata varying by type)
  • Distributed availability over consistency (global edge deployments)
  • Massive write distribution across geographic regions (Cassandra)

When SQL+TimescaleDB Wins:

  • Time-series sensor data with regular intervals (90% of IoT use cases)
  • Complex aggregations (hourly/daily summaries, percentiles)
  • SQL compatibility (existing tools, ORMs, data analysts)
  • Cost optimization (compression, continuous aggregates)

The Right Approach: Use polyglot persistence–TimescaleDB for sensor data, PostgreSQL for metadata, Redis for caching, MongoDB for flexible configs.

Data quality monitoring is like having a quality inspector checking every item on a factory assembly line!

7.10.1 The Sensor Squad Adventure: The Quality Patrol

The Sensor Squad had just set up 100 sensors around the school to monitor temperature, noise, and air quality. But something was going wrong!

“The cafeteria sensor says it’s -200 degrees!” shouted Lila the LED. “And the gym sensor says humidity is 500 percent!”

“That’s impossible!” said Sammy the Sensor. “We need a Quality Patrol!”

Max the Microcontroller created a checklist for every reading that came in:

Check 1 - Is it POSSIBLE? (Accuracy) “Can temperature be -200C? NOPE! That’s colder than outer space! RED FLAG!”

Check 2 - Did it ARRIVE? (Completeness) “The library sensor hasn’t sent anything in 2 hours. Missing data alert!”

Check 3 - Does it MATCH? (Consistency) “Sensor A says Celsius, Sensor B says Fahrenheit. That will confuse everything!”

Check 4 - Is it FRESH? (Timeliness) “This reading is from 3 hours ago. That’s too old to be useful right now!”

Check 5 - Is it the RIGHT FORMAT? (Validity) “This sensor sent ‘banana’ instead of a number. That’s not right!”

Check 6 - Is it UNIQUE? (Uniqueness) “We got the same reading twice! Only count it once!”

Bella the Battery set up a scoreboard: “Every reading gets a score from 0 to 100. Below 70? We investigate. Below 50? We quarantine it!”

Now the school’s sensor data was CLEAN and TRUSTWORTHY, and the principal could see exactly which sensors needed maintenance.

7.10.2 Key Words for Kids

Word What It Means
Quality Score A grade for how trustworthy data is – like a food freshness rating
Quarantine Setting aside suspicious data for investigation – like putting a bruised apple aside to check later
Dashboard A screen showing all the important numbers at a glance – like a car dashboard

Scenario: An industrial IoT system ingests 50,000 sensor readings/minute. Implement real-time quality scoring to filter bad data before storage.

Advanced Quality Scoring Function (extends the basic version from above with pressure validation and rate-of-change detection):

CREATE OR REPLACE FUNCTION calculate_quality_score_advanced(
    temp DOUBLE PRECISION,
    humidity DOUBLE PRECISION,
    pressure DOUBLE PRECISION,
    p_device_id TEXT,
    reading_time TIMESTAMPTZ,
    last_reading TIMESTAMPTZ
) RETURNS TABLE(score INTEGER, flags JSONB) AS $$
DECLARE
    s INTEGER := 100;
    f JSONB := '{}'::JSONB;
    age_seconds DOUBLE PRECISION;
    rate_of_change DOUBLE PRECISION;
BEGIN
    -- Check 1: Temperature range (-50 to 50C)
    IF temp < -50 OR temp > 50 THEN
        s := s - 30;
        f := f || '{"temp_range": false}'::JSONB;
    ELSE
        f := f || '{"temp_range": true}'::JSONB;
    END IF;

    -- Check 2: Humidity range (0-100%)
    IF humidity < 0 OR humidity > 100 THEN
        s := s - 30;
        f := f || '{"humidity_range": false}'::JSONB;
    ELSE
        f := f || '{"humidity_range": true}'::JSONB;
    END IF;

    -- Check 3: Pressure range (300-1100 hPa typical for Earth)
    IF pressure < 300 OR pressure > 1100 THEN
        s := s - 20;
        f := f || '{"pressure_range": false}'::JSONB;
    ELSE
        f := f || '{"pressure_range": true}'::JSONB;
    END IF;

    -- Check 4: Data freshness (<5 minutes)
    age_seconds := EXTRACT(EPOCH FROM (NOW() - reading_time));
    IF age_seconds > 300 THEN
        s := s - 20;
        f := f || '{"freshness": false}'::JSONB;
    ELSE
        f := f || '{"freshness": true}'::JSONB;
    END IF;

    -- Check 5: Rate of change (temp shouldn't jump >10C in 1 minute)
    IF last_reading IS NOT NULL AND reading_time > last_reading THEN
        rate_of_change := ABS(temp - (
            SELECT prev.temperature FROM sensor_readings prev
            WHERE prev.device_id = p_device_id
              AND prev.time = last_reading LIMIT 1
        )) / EXTRACT(EPOCH FROM (reading_time - last_reading));

        IF rate_of_change > 0.167 THEN  -- 10C/60sec = 0.167C/sec
            s := s - 20;
            f := f || '{"rate_of_change": false}'::JSONB;
        ELSE
            f := f || '{"rate_of_change": true}'::JSONB;
        END IF;
    END IF;

    RETURN QUERY SELECT GREATEST(s, 0), f;
END;
$$ LANGUAGE plpgsql;

Example Usage:

-- Insert with automatic quality scoring
WITH quality AS (
    SELECT * FROM calculate_quality_score_advanced(
        23.5,           -- temperature
        55.0,           -- humidity
        1013.2,         -- pressure
        'sensor_001',   -- device_id
        NOW(),          -- current timestamp
        NOW() - INTERVAL '1 minute'  -- last reading timestamp
    )
)
INSERT INTO sensor_readings (time, device_id, temperature, humidity, quality_score, quality_flags)
SELECT NOW(), 'sensor_001', 23.5, 55.0, score, flags FROM quality;

Performance Analysis:

Validation Type Execution Time Notes
Range checks (temp, humidity, pressure) ~0.01 ms Simple comparisons, negligible
Timestamp freshness ~0.02 ms Single function call, negligible
Rate-of-change (1 indexed lookup) ~0.5 ms Queries previous reading by PK
Total per-row overhead ~0.53 ms See note below

Throughput impact note: The 0.53 ms per-row overhead does not directly reduce throughput from 20K to ~1.7K rows/sec. In practice, the ingestion pipeline uses batched inserts and connection pooling – multiple rows are validated concurrently across database connections. With 10 concurrent connections, the system sustains ~18,800 rows/sec (a ~6% reduction from the 20K baseline), not the ~91% reduction that naive per-row serial calculation would suggest. Always benchmark with your actual pipeline architecture.

Quality Score Distribution (after 1 week):

Score 100 (perfect): 85% of readings
Score 80-99 (good): 10%
Score 60-79 (acceptable): 3%
Score 40-59 (poor): 1.5%
Score 0-39 (reject): 0.5%

Decision Rules:

  • Score >= 80: Store normally in hot tier
  • Score 70-79: Store with monitoring flag, review if sustained over 1 hour
  • Score 50-69: Store with flag, alert if sustained
  • Score < 50: Quarantine table, investigate sensor

Outcome: Rejected 0.5% of bad data (~360,000 readings/day at 50K/min ingestion rate) that would have corrupted analytics and triggered false alarms.

Key Insight: Spending 0.5 ms per reading on quality checks saves hours of debugging bad data issues downstream.

Configure your quality scoring thresholds based on data criticality and cost of false positives vs false negatives:

Data Criticality Reject Threshold Quarantine Threshold Flag Threshold Reasoning
Safety-Critical (gas sensors) <50 50-70 70-90 Cannot tolerate bad data; err on side of caution
Operational (temperature) <30 30-50 50-70 Some tolerance for transient issues
Best-Effort (traffic counts) <20 20-40 40-60 Data loss acceptable; avoid over-filtering
Compliance (billing meters) <60 60-80 80-95 Regulatory requirements demand high quality

Cost-Benefit Analysis:

Example: Industrial pressure sensor monitoring (1M readings/day, automated triage with manual review of flagged batches)

Threshold False Positive Rate False Negative Rate Investigation Cost Missed-Alert Risk Cost Total Cost
Reject <30 0.1% (1K/day) 2% (20K/day missed) $100/month $50K/month $50,100
Reject <50 0.5% (5K/day) 0.5% (5K/day missed) $500/month $12.5K/month $13,000
Reject <70 2% (20K/day) 0.1% (1K/day missed) $2K/month $2.5K/month $4,500
Reject <90 10% (100K/day) 0.01% (100/day missed) $10K/month $250/month $10,250

Investigation costs assume automated batch triage (grouping false positives by device/time window) with manual review of flagged batches, not individual per-reading human investigation.

Optimal: Reject <70 minimizes total cost ($4,500/month) by balancing false positive investigation cost vs missed alert risk cost.

How to calibrate:

  1. Baseline normal operation (1 week):
    • Measure quality score distribution
    • Identify P95, P99 percentiles
    • Set reject threshold at P1 (99% of normal data passes)
  2. Inject known bad data:
    • Out-of-range values
    • Stale timestamps
    • Duplicate records
    • Measure detection rate (true positive rate)
  3. Tune thresholds iteratively:
    • Too many false positives? Lower reject threshold
    • Missing too many issues? Raise reject threshold
    • Validate with ROC curve analysis
  4. Separate thresholds by sensor type:
    • High-precision sensors: Stricter thresholds
    • Noisy sensors: Looser thresholds
    • Critical sensors: Err on side of rejection

Key Insight: One-size-fits-all thresholds waste effort investigating false positives OR miss real issues. Calibrate per deployment.

Common Mistake: Silent Data Quality Degradation

The Error: A temperature sensor’s calibration drifts, reporting 2C higher than actual over 6 months. No quality monitoring detects the gradual drift because it’s within range checks.

What Goes Wrong:

Timeline:

Month 0: Sensor accurate (reports 20C, actual 20C)
Month 1: +0.3C drift (reports 20.3C, actual 20C) -- Still within +/-5C tolerance
Month 2: +0.7C drift (reports 20.7C)
Month 3: +1.2C drift (reports 21.2C)
Month 6: +2.0C drift (reports 22.0C) -- Passes range checks but WRONG

Impact:

  • HVAC system overcools by 2C -> 15% energy waste ($1,200/year)
  • Historical trends show false “warming” pattern
  • Anomaly detection trains on drifted data -> broken model
  • Only discovered when technician manually checks calibration

The Fix: Cross-validation with peer sensors:

-- Detect drift by comparing with spatial neighbors
WITH sensor_stats AS (
    SELECT
        sensor_id,
        AVG(temperature) as avg_temp,
        STDDEV(temperature) as stddev_temp
    FROM sensor_readings
    WHERE time > NOW() - INTERVAL '7 days'
    GROUP BY sensor_id
),
peer_comparison AS (
    SELECT
        s1.sensor_id,
        s1.avg_temp,
        AVG(s2.avg_temp) as peer_avg,
        ABS(s1.avg_temp - AVG(s2.avg_temp)) as deviation
    FROM sensor_stats s1
    JOIN sensors loc1 ON s1.sensor_id = loc1.sensor_id
    JOIN sensors loc2 ON ST_Distance(loc1.location, loc2.location) < 50  -- 50m radius
    JOIN sensor_stats s2 ON loc2.sensor_id = s2.sensor_id
    WHERE s1.sensor_id != s2.sensor_id
    GROUP BY s1.sensor_id, s1.avg_temp
)
SELECT sensor_id, avg_temp, peer_avg, deviation
FROM peer_comparison
WHERE deviation > 3  -- 3C deviation from peers = likely drift
ORDER BY deviation DESC;

Early Detection Strategies:

  1. Peer validation: Compare with nearby sensors (same zone)
  2. Temporal consistency: Temperature shouldn’t change by 10C in 1 minute
  3. Physical constraints: Indoor temp 18-26C, outdoor -40 to +50C
  4. Statistical outliers: Z-score > 3 from historical baseline
  5. Rate-of-change limits: Max 2C/hour for building temp

Calibration Alert System:

-- Alert when sensor deviates from peers for 7+ days
SELECT sensor_id, 'Calibration drift suspected' as alert
FROM (
    SELECT sensor_id, COUNT(*) as drift_days
    FROM daily_peer_comparison
    WHERE deviation > 2
      AND time > NOW() - INTERVAL '30 days'
    GROUP BY sensor_id
) sub
WHERE drift_days >= 7;

Result: Detected drift after 2 weeks instead of 6 months, preventing $1,000 in wasted energy.

Key Insight: Range checks catch broken sensors, but peer validation catches drifting sensors.

7.11 Try It Yourself: Design a Quality Scoring System

Scenario: You’re building a quality monitoring system for a smart agriculture deployment with 500 soil sensors. Each sensor reports every 5 minutes:

  • Soil moisture (0-100% volumetric water content)
  • Soil temperature (-10°C to 60°C, depending on climate)
  • Electrical conductivity (0-20 dS/m for salinity)
  • Battery voltage (2.0V - 4.2V for lithium cell)

Your tasks:

  1. Define quality checks for each sensor field:
    • What range is valid for soil moisture? (Hint: sand rarely exceeds 45%, clay can reach 60%)
    • What rate-of-change is suspicious for soil temperature? (Hint: soil changes slowly — 1-2°C/hour max)
    • At what battery voltage should you flag readings as unreliable? (Hint: below 2.8V, ADC readings become inaccurate)
  2. Assign weights to each dimension:
    • Which matters more for irrigation decisions: accuracy or timeliness?
    • Should a stale reading (20 minutes old) be rejected or flagged?
  3. Choose handling strategies by use case:
    • Irrigation control (automated): Should you reject or quarantine low-quality data?
    • Seasonal trend analysis (offline): Can you accept flagged data?
    • Frost alert system (safety): What’s your minimum quality threshold?

Bonus: Write the SQL calculate_quality_score function for soil sensors, adapting the pattern from this chapter. Consider: should a low battery voltage reduce the quality score for ALL fields, since the ADC may be unreliable?

Suggested Approach

Quality Check Definitions

  • Soil moisture: Valid range 0-70% (reject >100%, flag >70% as possibly waterlogged)
  • Soil temperature: Valid range -10°C to 60°C, rate-of-change max 2°C/hour
  • Electrical conductivity: Valid range 0-20 dS/m, flag >8 dS/m as high salinity
  • Battery: Flag below 3.0V (accuracy degrades), reject below 2.5V (readings unreliable)

Weight Assignment

For irrigation control: accuracy (0.4) > freshness (0.3) > validity (0.15) > rate-of-change (0.1) > peer consistency (0.05)

Handling Strategies

  • Irrigation control: Reject below 50 (bad data → wrong watering → crop damage)
  • Seasonal analysis: Flag below 70 (include but mark — data analysts can filter)
  • Frost alerts: Reject below 70 with fallback to nearest peer sensor reading

7.12 Summary

  • Six dimensions of data quality (accuracy, completeness, consistency, timeliness, validity, uniqueness) must be monitored
  • Quality scoring at ingestion enables filtering and alerting on suspect data
  • Continuous aggregates provide real-time quality dashboards without performance impact
  • Three handling strategies: Reject (critical apps), Quarantine (investigation), Flag (preserve for analysis)
  • Performance trade-offs require tiered validation: fast checks at ingestion, complex checks in batch
  • Schema versioning from day one prevents breaking changes during firmware updates

7.13 Concept Relationships

Prerequisites - Read these first: - Time-Series Databases - Foundation for understanding data quality metrics in TSDBs - Database Selection Framework - Choosing databases that support quality validation

Related Concepts - Explore these next: - Sharding Strategies - Quality checks impact performance in distributed systems - Stream Processing Fundamentals - Real-time quality validation before storage

Practical Applications:

Data quality scoring quantifies trustworthiness for automated filtering decisions. Weighted quality score:

Quality Score Formula: \(Q = \frac{\sum_{i=1}^{n} w_i \cdot q_i}{\sum_{i=1}^{n} w_i}\) where \(w_i\) = weight, \(q_i \in [0,1]\) = dimension score. When weights sum to 1, this simplifies to \(Q = \sum w_i \cdot q_i\).

Worked example (smart building temperature sensor with 5 quality dimensions):

  • Accuracy (range check): \(q_1 = 1.0\) (22.5C in [-10, 50] range), \(w_1 = 0.4\)
  • Freshness: \(q_2 = 0.8\) (4 min old, 5 min threshold), \(w_2 = 0.25\)
  • Validity: \(q_3 = 1.0\) (valid JSON format), \(w_3 = 0.15\)
  • Rate-of-change: \(q_4 = 0.6\) (3C jump in 1 min, suspicious), \(w_4 = 0.15\)
  • Peer consistency: \(q_5 = 0.9\) (0.5C from nearby sensors), \(w_5 = 0.05\)

Total: \(Q = 0.4(1.0) + 0.25(0.8) + 0.15(1.0) + 0.15(0.6) + 0.05(0.9) = 0.4 + 0.2 + 0.15 + 0.09 + 0.045 = 0.885\) (88.5% quality score).

Action thresholds:

  • Above 80%: Accept – store normally
  • 70-80%: Accept with monitoring – store with flag, review if pattern persists
  • 50-70%: Investigate – flag and alert operations team
  • Below 50%: Quarantine – divert to quarantine table, investigate sensor

7.13.1 Interactive: Weighted Quality Score Calculator

Adjust individual dimension scores and weights to see how they affect the overall quality score and the resulting action.

7.13.2 Interactive: Data Completeness Calculator

Estimate how many readings your system should expect and how many might be missing, based on device count, sampling interval, and observed completeness rate.

7.14 What’s Next

If you want to… Read this next
Scale quality-monitored databases across multiple nodes Sharding Strategies
Choose the right database for your quality requirements Database Selection Framework
See quality monitoring applied to real deployments Worked Examples
Set up retention policies based on quality scores Data Retention
Process quality checks in real-time streams before storage Stream Processing Fundamentals