1276  Data Quality Monitoring

Learning Objectives

After completing this chapter, you will be able to:

  • Define and measure the six dimensions of data quality for IoT
  • Implement data quality checks at ingestion time
  • Design quality monitoring dashboards and alerts
  • Handle low-quality data with appropriate strategies
  • Balance data quality validation with database performance

1276.1 Introduction

Storing IoT data is only valuable if the data quality is high. This chapter covers essential data quality metrics and monitoring strategies for IoT database systems.

1276.2 The Six Dimensions of Data Quality

NoteData Quality Framework for IoT

IoT systems must monitor data quality across six critical dimensions:

Dimension Definition IoT Example Impact of Poor Quality
Accuracy Data correctly represents reality Sensor reports 25C when actual is 24C Wrong decisions, false alerts
Completeness All required data is present Temperature readings but missing humidity Incomplete analysis, ML failures
Consistency Data is uniform across systems Device A uses Celsius, Device B Fahrenheit Aggregation errors, wrong comparisons
Timeliness Data is up-to-date and available when needed Sensor data arrives 10 minutes late Missed real-time opportunities
Validity Data conforms to defined formats/rules GPS coordinates outside valid range Database errors, query failures
Uniqueness No duplicate records exist Same reading stored twice Inflated counts, wasted storage

1276.3 Implementing Data Quality Checks at Ingestion

Data quality should be validated as data enters the database:

1276.3.1 Real-time Quality Scoring

-- TimescaleDB: Add quality score to sensor readings
CREATE TABLE sensor_readings (
  time TIMESTAMPTZ NOT NULL,
  device_id TEXT NOT NULL,
  temperature DOUBLE PRECISION,
  humidity DOUBLE PRECISION,
  -- Quality metadata
  quality_score INTEGER CHECK (quality_score BETWEEN 0 AND 100),
  quality_flags JSONB,
  PRIMARY KEY (device_id, time)
);

-- Create hypertable
SELECT create_hypertable('sensor_readings', 'time');

-- Function to calculate quality score
CREATE OR REPLACE FUNCTION calculate_quality_score(
  temp DOUBLE PRECISION,
  humidity DOUBLE PRECISION,
  timestamp TIMESTAMPTZ
) RETURNS INTEGER AS $$
DECLARE
  score INTEGER := 100;
  age_seconds INTEGER;
BEGIN
  -- Check temperature validity (+/-50C range)
  IF temp < -50 OR temp > 50 THEN score := score - 40; END IF;

  -- Check humidity validity (0-100% range)
  IF humidity < 0 OR humidity > 100 THEN score := score - 40; END IF;

  -- Check data freshness (penalize if >5 minutes old)
  age_seconds := EXTRACT(EPOCH FROM (NOW() - timestamp));
  IF age_seconds > 300 THEN score := score - 20; END IF;

  RETURN GREATEST(score, 0);
END;
$$ LANGUAGE plpgsql;

-- Insert with automatic quality scoring
INSERT INTO sensor_readings (time, device_id, temperature, humidity, quality_score)
VALUES (
  NOW(),
  'sensor_001',
  23.5,
  55.0,
  calculate_quality_score(23.5, 55.0, NOW())
);

1276.4 Quality Monitoring Queries

Continuous quality monitoring dashboards should track:

-- 1. Quality score distribution (last 24 hours)
SELECT
  device_id,
  AVG(quality_score) as avg_quality,
  MIN(quality_score) as min_quality,
  COUNT(*) FILTER (WHERE quality_score < 70) as low_quality_count,
  COUNT(*) as total_readings
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id
HAVING AVG(quality_score) < 80  -- Alert threshold
ORDER BY avg_quality ASC;

-- 2. Missing data detection (gaps in time-series)
SELECT
  device_id,
  time_bucket('1 hour', time) AS hour,
  COUNT(*) as readings_count,
  -- Expected: 60 readings/hour at 1/min sampling
  60 - COUNT(*) as missing_readings
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id, hour
HAVING COUNT(*) < 50  -- >10 missing readings/hour
ORDER BY missing_readings DESC;

-- 3. Duplicate detection (same timestamp, device)
SELECT
  device_id,
  time,
  COUNT(*) as duplicate_count
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id, time
HAVING COUNT(*) > 1;

-- 4. Outlier detection (values beyond 3 standard deviations)
WITH stats AS (
  SELECT
    device_id,
    AVG(temperature) as mean_temp,
    STDDEV(temperature) as stddev_temp
  FROM sensor_readings
  WHERE time > NOW() - INTERVAL '7 days'
  GROUP BY device_id
)
SELECT
  r.time,
  r.device_id,
  r.temperature,
  s.mean_temp,
  ABS(r.temperature - s.mean_temp) / s.stddev_temp as z_score
FROM sensor_readings r
JOIN stats s ON r.device_id = s.device_id
WHERE
  r.time > NOW() - INTERVAL '24 hours'
  AND ABS(r.temperature - s.mean_temp) > 3 * s.stddev_temp
ORDER BY z_score DESC;

1276.5 Automated Quality Alerts

TipProduction Monitoring Best Practices

Alert Thresholds (tune based on your system):

  1. Quality Score < 70: Investigate device/sensor issues
  2. Missing Data > 10%: Network or device failure likely
  3. Duplicate Rate > 1%: Application or network retry issues
  4. Outlier Rate > 5%: Sensor calibration or environmental anomaly

Implementation with continuous aggregates:

-- Create materialized view for hourly quality metrics
CREATE MATERIALIZED VIEW hourly_quality_summary
WITH (timescaledb.continuous) AS
SELECT
  device_id,
  time_bucket('1 hour', time) AS hour,
  AVG(quality_score) as avg_quality,
  COUNT(*) as reading_count,
  COUNT(*) FILTER (WHERE quality_score < 70) as poor_quality_count,
  AVG(temperature) as avg_temperature,
  STDDEV(temperature) as stddev_temperature
FROM sensor_readings
GROUP BY device_id, hour;

-- Auto-refresh every 5 minutes
SELECT add_continuous_aggregate_policy('hourly_quality_summary',
  start_offset => INTERVAL '1 day',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '5 minutes');

-- Query for quick dashboard updates
SELECT * FROM hourly_quality_summary
WHERE hour > NOW() - INTERVAL '24 hours'
  AND avg_quality < 80
ORDER BY avg_quality ASC;

1276.6 Data Quality vs. Database Performance Trade-offs

WarningTradeoff: Hot/Warm/Cold Storage Tiers for IoT Data

Option A: Single-Tier (Hot Storage Only) - All data on fast SSD-backed database - Query latency: 1-10ms for any time range - Storage cost: $0.10-0.25/GB/month (AWS RDS gp3 SSD) - 1TB for 1 year: $1,200-3,000/year - Operational simplicity: Single database to manage

Option B: Multi-Tier (Hot/Warm/Cold) - Data moves to cheaper storage as it ages - Hot tier (SSD, 0-7 days): $0.10-0.25/GB/month, 1-10ms latency - Warm tier (HDD/compressed, 7-90 days): $0.02-0.05/GB/month, 50-200ms latency - Cold tier (S3/Glacier, 90+ days): $0.004-0.01/GB/month, 100ms-12hr latency - 1TB for 1 year: $300-800/year (75-80% cost reduction)

Decision Factors: - Choose Single-Tier when: Dataset under 100GB, uniform query patterns across all time ranges, operational simplicity is paramount - Choose Multi-Tier when: Dataset exceeds 500GB, 90%+ queries target recent data (last 7 days), storage cost is a concern, compliance requires multi-year retention

Quality validation has computational costs:

Quality Check Performance Impact When to Apply
Range validation Low (simple comparison) Every insert (always)
Duplicate detection Medium (index lookup) Every insert (recommended)
Outlier detection High (statistical aggregation) Batch processing (hourly/daily)
Cross-sensor consistency Very High (joins across datasets) Offline validation (weekly)

Optimization strategy:

  1. Ingestion time: Fast checks only (range, format, uniqueness)
  2. Real-time (1-5 min delay): Lightweight statistical checks (moving average validation)
  3. Batch (hourly/daily): Complex validations (cross-correlation, trend analysis, ML-based anomaly detection)

1276.7 Handling Low-Quality Data

When quality issues are detected, choose the appropriate strategy:

When to use: Critical applications where bad data is worse than no data

-- Reject invalid readings at ingestion
CREATE OR REPLACE FUNCTION reject_invalid_data()
RETURNS TRIGGER AS $$
BEGIN
  -- Calculate quality score
  NEW.quality_score := calculate_quality_score(
    NEW.temperature,
    NEW.humidity,
    NEW.time
  );

  -- Reject if quality too low
  IF NEW.quality_score < 50 THEN
    RAISE EXCEPTION 'Data quality too low: %', NEW.quality_score;
  END IF;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER check_data_quality
  BEFORE INSERT ON sensor_readings
  FOR EACH ROW
  EXECUTE FUNCTION reject_invalid_data();

When to use: Need to investigate issues while keeping pipeline flowing

-- Create quarantine table
CREATE TABLE sensor_readings_quarantine (
  LIKE sensor_readings INCLUDING ALL,
  quarantine_reason TEXT,
  quarantine_time TIMESTAMPTZ DEFAULT NOW()
);

-- Modified trigger: quarantine instead of reject
CREATE OR REPLACE FUNCTION quarantine_invalid_data()
RETURNS TRIGGER AS $$
BEGIN
  NEW.quality_score := calculate_quality_score(
    NEW.temperature, NEW.humidity, NEW.time
  );

  IF NEW.quality_score < 50 THEN
    -- Move to quarantine instead of rejecting
    INSERT INTO sensor_readings_quarantine
    VALUES (NEW.*, 'Low quality score: ' || NEW.quality_score, NOW());
    RETURN NULL;  -- Skip insert into main table
  END IF;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

When to use: Allow analysis while marking suspect data

-- Store all data but flag quality issues
CREATE OR REPLACE FUNCTION flag_data_quality()
RETURNS TRIGGER AS $$
BEGIN
  NEW.quality_score := calculate_quality_score(
    NEW.temperature, NEW.humidity, NEW.time
  );

  -- Build quality flags JSON
  NEW.quality_flags := jsonb_build_object(
    'temp_valid', NEW.temperature BETWEEN -50 AND 50,
    'humidity_valid', NEW.humidity BETWEEN 0 AND 100,
    'timestamp_fresh', EXTRACT(EPOCH FROM (NOW() - NEW.time)) < 300,
    'overall_quality', CASE
      WHEN NEW.quality_score >= 80 THEN 'good'
      WHEN NEW.quality_score >= 50 THEN 'acceptable'
      ELSE 'poor'
    END
  );

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Analytics can filter based on flags
SELECT * FROM sensor_readings
WHERE quality_flags->>'overall_quality' = 'good'
  AND time > NOW() - INTERVAL '1 day';

1276.8 Data Quality Dashboards

Visualizing quality metrics helps identify patterns:

Key metrics to track:

  1. Quality trend over time (line chart)
    • X-axis: Time (hourly/daily)
    • Y-axis: Average quality score
    • Multiple lines: One per device or device group
  2. Device quality heatmap (2D grid)
    • X-axis: Devices
    • Y-axis: Time periods
    • Color: Quality score (green=good, yellow=acceptable, red=poor)
  3. Data completeness gauge (percentage)
    • Expected readings vs actual readings
    • Target: >95% completeness
  4. Outlier rate (stacked area chart)
    • Show percentage of readings that are outliers over time
    • Helps identify sensor drift or calibration issues

1276.9 Common Pitfalls

WarningCommon Pitfall: Payload Size Too Large for Protocol/Network

The mistake: Sending full JSON payloads over constrained networks like LoRaWAN or Sigfox, exceeding protocol limits and causing transmission failures.

Symptoms: - Message fragmentation across multiple transmissions - Transmission failures and dropped messages - High battery consumption from repeated send attempts - Gateway buffer overflows

Why it happens: Developers design payloads on desktop environments with ample bandwidth, then deploy to constrained networks without considering size limits. LoRaWAN allows ~50-250 bytes depending on spreading factor; Sigfox allows only 12 bytes.

The fix:

# Use binary encoding for LPWAN
import struct
# Pack data into 12 bytes instead of 150+ byte JSON
payload = struct.pack(
    ">HhHHHH",  # Format: 2+2+2+2+2+2 = 12 bytes
    device_id_short,      # 2 bytes
    int(temp * 10),       # 2 bytes, -327.6 to 327.5 C
    int(humidity * 10),   # 2 bytes
    int(pressure - 900) * 10,  # 2 bytes, 900-965 hPa
    int(battery * 100)    # 2 bytes
)

Prevention: Know your protocol’s payload limits before designing data schemas. Use CBOR instead of JSON for constrained networks.

WarningCommon Pitfall: Missing Schema Versioning

The mistake: Changing data format without versioning, breaking all existing consumers when devices send incompatible payloads.

Symptoms: - Parsing errors after firmware updates - Old and new devices sending incompatible data - Breaking changes cascade through the system

Why it happens: Teams assume all devices will update simultaneously. In IoT, devices may run old firmware for years.

The fix:

# Include version in payload
payload = {
    "version": 2,
    "temperature": 22.5,
    "unit": "C"
}

# Consumer handles multiple versions
def parse(msg):
    version = msg.get("version", 1)
    if version == 1:
        return {"temperature": msg["temp"]}  # Old format
    else:
        return msg  # New format

Prevention: Include version field in all payloads from day one. Use schema registries (Apache Avro, Protobuf) for complex systems.

1276.10 Common Misconception

WarningCommon Misconception: “NoSQL is Always Better for IoT Scale”

The Myth: “IoT generates massive data volumes, so we must use NoSQL databases like MongoDB or Cassandra. SQL databases can’t handle IoT scale.”

The Reality: SQL databases with time-series extensions (TimescaleDB) routinely handle millions of writes per second and petabytes of data–matching or exceeding NoSQL performance for time-series workloads.

Real-World Data:

Database Write Throughput Storage Capacity Query Performance (time-range)
TimescaleDB 1M+ rows/sec (single node) Petabytes with compression 10-100x faster than PostgreSQL
InfluxDB 500K-1M points/sec Hundreds of TB Optimized for time queries
MongoDB 100K-500K docs/sec Petabytes with sharding Slower for time-range queries
Cassandra 1M+ writes/sec (distributed) Petabytes Poor for time-range aggregations

When NoSQL Wins: - Flexible schemas that evolve rapidly (device metadata varying by type) - Distributed availability over consistency (global edge deployments) - Massive write distribution across geographic regions (Cassandra)

When SQL+TimescaleDB Wins: - Time-series sensor data with regular intervals (90% of IoT use cases) - Complex aggregations (hourly/daily summaries, percentiles) - SQL compatibility (existing tools, ORMs, data analysts) - Cost optimization (compression, continuous aggregates)

The Right Approach: Use polyglot persistence–TimescaleDB for sensor data, PostgreSQL for metadata, Redis for caching, MongoDB for flexible configs.

1276.11 Summary

  • Six dimensions of data quality (accuracy, completeness, consistency, timeliness, validity, uniqueness) must be monitored
  • Quality scoring at ingestion enables filtering and alerting on suspect data
  • Continuous aggregates provide real-time quality dashboards without performance impact
  • Three handling strategies: Reject (critical apps), Quarantine (investigation), Flag (preserve for analysis)
  • Performance trade-offs require tiered validation: fast checks at ingestion, complex checks in batch
  • Schema versioning from day one prevents breaking changes during firmware updates

1276.12 What’s Next

Continue with Sharding Strategies to learn how to scale IoT databases horizontally for massive data volumes.

Related Chapters: - Time-Series Databases - Database optimization - Worked Examples - Complete architecture case studies