1276 Data Quality Monitoring

Learning Objectives

After completing this chapter, you will be able to:

Define and measure the six dimensions of data quality for IoT
Implement data quality checks at ingestion time
Design quality monitoring dashboards and alerts
Handle low-quality data with appropriate strategies
Balance data quality validation with database performance

1276.1 Introduction

Storing IoT data is only valuable if the data quality is high. This chapter covers essential data quality metrics and monitoring strategies for IoT database systems.

1276.2 The Six Dimensions of Data Quality

Data Quality Framework for IoT

IoT systems must monitor data quality across six critical dimensions:

Dimension	Definition	IoT Example	Impact of Poor Quality
Accuracy	Data correctly represents reality	Sensor reports 25C when actual is 24C	Wrong decisions, false alerts
Completeness	All required data is present	Temperature readings but missing humidity	Incomplete analysis, ML failures
Consistency	Data is uniform across systems	Device A uses Celsius, Device B Fahrenheit	Aggregation errors, wrong comparisons
Timeliness	Data is up-to-date and available when needed	Sensor data arrives 10 minutes late	Missed real-time opportunities
Validity	Data conforms to defined formats/rules	GPS coordinates outside valid range	Database errors, query failures
Uniqueness	No duplicate records exist	Same reading stored twice	Inflated counts, wasted storage

1276.3 Implementing Data Quality Checks at Ingestion

Data quality should be validated as data enters the database:

1276.3.1 Real-time Quality Scoring

-- TimescaleDB: Add quality score to sensor readings
CREATE TABLE sensor_readings (
  time TIMESTAMPTZ NOT NULL,
  device_id TEXT NOT NULL,
  temperature DOUBLE PRECISION,
  humidity DOUBLE PRECISION,
  -- Quality metadata
  quality_score INTEGER CHECK (quality_score BETWEEN 0 AND 100),
  quality_flags JSONB,
  PRIMARY KEY (device_id, time)
);

-- Create hypertable
SELECT create_hypertable('sensor_readings', 'time');

-- Function to calculate quality score
CREATE OR REPLACE FUNCTION calculate_quality_score(
  temp DOUBLE PRECISION,
  humidity DOUBLE PRECISION,
  timestamp TIMESTAMPTZ
) RETURNS INTEGER AS $$
DECLARE
  score INTEGER := 100;
  age_seconds INTEGER;
BEGIN
  -- Check temperature validity (+/-50C range)
  IF temp < -50 OR temp > 50 THEN score := score - 40; END IF;

  -- Check humidity validity (0-100% range)
  IF humidity < 0 OR humidity > 100 THEN score := score - 40; END IF;

  -- Check data freshness (penalize if >5 minutes old)
  age_seconds := EXTRACT(EPOCH FROM (NOW() - timestamp));
  IF age_seconds > 300 THEN score := score - 20; END IF;

  RETURN GREATEST(score, 0);
END;
$$ LANGUAGE plpgsql;

-- Insert with automatic quality scoring
INSERT INTO sensor_readings (time, device_id, temperature, humidity, quality_score)
VALUES (
  NOW(),
  'sensor_001',
  23.5,
  55.0,
  calculate_quality_score(23.5, 55.0, NOW())
);

1276.4 Quality Monitoring Queries

Continuous quality monitoring dashboards should track:

-- 1. Quality score distribution (last 24 hours)
SELECT
  device_id,
  AVG(quality_score) as avg_quality,
  MIN(quality_score) as min_quality,
  COUNT(*) FILTER (WHERE quality_score < 70) as low_quality_count,
  COUNT(*) as total_readings
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id
HAVING AVG(quality_score) < 80  -- Alert threshold
ORDER BY avg_quality ASC;

-- 2. Missing data detection (gaps in time-series)
SELECT
  device_id,
  time_bucket('1 hour', time) AS hour,
  COUNT(*) as readings_count,
  -- Expected: 60 readings/hour at 1/min sampling
  60 - COUNT(*) as missing_readings
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id, hour
HAVING COUNT(*) < 50  -- >10 missing readings/hour
ORDER BY missing_readings DESC;

-- 3. Duplicate detection (same timestamp, device)
SELECT
  device_id,
  time,
  COUNT(*) as duplicate_count
FROM sensor_readings
WHERE time > NOW() - INTERVAL '24 hours'
GROUP BY device_id, time
HAVING COUNT(*) > 1;

-- 4. Outlier detection (values beyond 3 standard deviations)
WITH stats AS (
  SELECT
    device_id,
    AVG(temperature) as mean_temp,
    STDDEV(temperature) as stddev_temp
  FROM sensor_readings
  WHERE time > NOW() - INTERVAL '7 days'
  GROUP BY device_id
)
SELECT
  r.time,
  r.device_id,
  r.temperature,
  s.mean_temp,
  ABS(r.temperature - s.mean_temp) / s.stddev_temp as z_score
FROM sensor_readings r
JOIN stats s ON r.device_id = s.device_id
WHERE
  r.time > NOW() - INTERVAL '24 hours'
  AND ABS(r.temperature - s.mean_temp) > 3 * s.stddev_temp
ORDER BY z_score DESC;

1276.5 Automated Quality Alerts

Production Monitoring Best Practices

Alert Thresholds (tune based on your system):

Quality Score < 70: Investigate device/sensor issues
Missing Data > 10%: Network or device failure likely
Duplicate Rate > 1%: Application or network retry issues
Outlier Rate > 5%: Sensor calibration or environmental anomaly

Implementation with continuous aggregates:

-- Create materialized view for hourly quality metrics
CREATE MATERIALIZED VIEW hourly_quality_summary
WITH (timescaledb.continuous) AS
SELECT
  device_id,
  time_bucket('1 hour', time) AS hour,
  AVG(quality_score) as avg_quality,
  COUNT(*) as reading_count,
  COUNT(*) FILTER (WHERE quality_score < 70) as poor_quality_count,
  AVG(temperature) as avg_temperature,
  STDDEV(temperature) as stddev_temperature
FROM sensor_readings
GROUP BY device_id, hour;

-- Auto-refresh every 5 minutes
SELECT add_continuous_aggregate_policy('hourly_quality_summary',
  start_offset => INTERVAL '1 day',
  end_offset => INTERVAL '1 hour',
  schedule_interval => INTERVAL '5 minutes');

-- Query for quick dashboard updates
SELECT * FROM hourly_quality_summary
WHERE hour > NOW() - INTERVAL '24 hours'
  AND avg_quality < 80
ORDER BY avg_quality ASC;

1276.6 Data Quality vs. Database Performance Trade-offs

Tradeoff: Hot/Warm/Cold Storage Tiers for IoT Data

Option A: Single-Tier (Hot Storage Only) - All data on fast SSD-backed database - Query latency: 1-10ms for any time range - Storage cost: $0.10-0.25/GB/month (AWS RDS gp3 SSD) - 1TB for 1 year: $1,200-3,000/year - Operational simplicity: Single database to manage

Option B: Multi-Tier (Hot/Warm/Cold) - Data moves to cheaper storage as it ages - Hot tier (SSD, 0-7 days): $0.10-0.25/GB/month, 1-10ms latency - Warm tier (HDD/compressed, 7-90 days): $0.02-0.05/GB/month, 50-200ms latency - Cold tier (S3/Glacier, 90+ days): $0.004-0.01/GB/month, 100ms-12hr latency - 1TB for 1 year: $300-800/year (75-80% cost reduction)

Decision Factors: - Choose Single-Tier when: Dataset under 100GB, uniform query patterns across all time ranges, operational simplicity is paramount - Choose Multi-Tier when: Dataset exceeds 500GB, 90%+ queries target recent data (last 7 days), storage cost is a concern, compliance requires multi-year retention

Quality validation has computational costs:

Quality Check	Performance Impact	When to Apply
Range validation	Low (simple comparison)	Every insert (always)
Duplicate detection	Medium (index lookup)	Every insert (recommended)
Outlier detection	High (statistical aggregation)	Batch processing (hourly/daily)
Cross-sensor consistency	Very High (joins across datasets)	Offline validation (weekly)

Optimization strategy:

Ingestion time: Fast checks only (range, format, uniqueness)
Real-time (1-5 min delay): Lightweight statistical checks (moving average validation)
Batch (hourly/daily): Complex validations (cross-correlation, trend analysis, ML-based anomaly detection)

Show code

{
  const container = document.getElementById('kc-db-8');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "A sensor reports temperature as -273.5C (below absolute zero). Your IoT system must handle this invalid reading. Which data quality strategy is MOST appropriate for a predictive maintenance application?",
      options: [
        {text: "Reject the reading at ingestion - invalid data should never enter the database", correct: false, feedback: "Rejection means you lose visibility into sensor health. The reading indicates a sensor malfunction that needs investigation."},
        {text: "Accept the reading with a quality flag and alert the maintenance team", correct: true, feedback: "Correct! Flagging preserves the data for debugging while marking it as suspect. The alert enables maintenance to investigate the sensor issue. This is the 'Flag Strategy' approach."},
        {text: "Automatically correct the value to the last known good reading", correct: false, feedback: "Auto-correction hides problems and could propagate errors. If the sensor is failing, you need to detect this, not mask it."},
        {text: "Discard the reading silently to avoid contaminating the dataset", correct: false, feedback: "Silent discarding loses valuable diagnostic information. A -273.5C reading strongly suggests sensor failure that needs attention."}
      ],
      difficulty: "medium",
      topic: "iot-databases"
    }));
  }
}

1276.7 Handling Low-Quality Data

When quality issues are detected, choose the appropriate strategy:

When to use: Critical applications where bad data is worse than no data

-- Reject invalid readings at ingestion
CREATE OR REPLACE FUNCTION reject_invalid_data()
RETURNS TRIGGER AS $$
BEGIN
  -- Calculate quality score
  NEW.quality_score := calculate_quality_score(
    NEW.temperature,
    NEW.humidity,
    NEW.time
  );

  -- Reject if quality too low
  IF NEW.quality_score < 50 THEN
    RAISE EXCEPTION 'Data quality too low: %', NEW.quality_score;
  END IF;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER check_data_quality
  BEFORE INSERT ON sensor_readings
  FOR EACH ROW
  EXECUTE FUNCTION reject_invalid_data();

When to use: Need to investigate issues while keeping pipeline flowing

-- Create quarantine table
CREATE TABLE sensor_readings_quarantine (
  LIKE sensor_readings INCLUDING ALL,
  quarantine_reason TEXT,
  quarantine_time TIMESTAMPTZ DEFAULT NOW()
);

-- Modified trigger: quarantine instead of reject
CREATE OR REPLACE FUNCTION quarantine_invalid_data()
RETURNS TRIGGER AS $$
BEGIN
  NEW.quality_score := calculate_quality_score(
    NEW.temperature, NEW.humidity, NEW.time
  );

  IF NEW.quality_score < 50 THEN
    -- Move to quarantine instead of rejecting
    INSERT INTO sensor_readings_quarantine
    VALUES (NEW.*, 'Low quality score: ' || NEW.quality_score, NOW());
    RETURN NULL;  -- Skip insert into main table
  END IF;

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

When to use: Allow analysis while marking suspect data

-- Store all data but flag quality issues
CREATE OR REPLACE FUNCTION flag_data_quality()
RETURNS TRIGGER AS $$
BEGIN
  NEW.quality_score := calculate_quality_score(
    NEW.temperature, NEW.humidity, NEW.time
  );

  -- Build quality flags JSON
  NEW.quality_flags := jsonb_build_object(
    'temp_valid', NEW.temperature BETWEEN -50 AND 50,
    'humidity_valid', NEW.humidity BETWEEN 0 AND 100,
    'timestamp_fresh', EXTRACT(EPOCH FROM (NOW() - NEW.time)) < 300,
    'overall_quality', CASE
      WHEN NEW.quality_score >= 80 THEN 'good'
      WHEN NEW.quality_score >= 50 THEN 'acceptable'
      ELSE 'poor'
    END
  );

  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

-- Analytics can filter based on flags
SELECT * FROM sensor_readings
WHERE quality_flags->>'overall_quality' = 'good'
  AND time > NOW() - INTERVAL '1 day';

1276.8 Data Quality Dashboards

Visualizing quality metrics helps identify patterns:

Key metrics to track:

Quality trend over time (line chart)
- X-axis: Time (hourly/daily)
- Y-axis: Average quality score
- Multiple lines: One per device or device group
Device quality heatmap (2D grid)
- X-axis: Devices
- Y-axis: Time periods
- Color: Quality score (green=good, yellow=acceptable, red=poor)
Data completeness gauge (percentage)
- Expected readings vs actual readings
- Target: >95% completeness
Outlier rate (stacked area chart)
- Show percentage of readings that are outliers over time
- Helps identify sensor drift or calibration issues

1276.9 Common Pitfalls

Common Pitfall: Payload Size Too Large for Protocol/Network

The mistake: Sending full JSON payloads over constrained networks like LoRaWAN or Sigfox, exceeding protocol limits and causing transmission failures.

Symptoms: - Message fragmentation across multiple transmissions - Transmission failures and dropped messages - High battery consumption from repeated send attempts - Gateway buffer overflows

Why it happens: Developers design payloads on desktop environments with ample bandwidth, then deploy to constrained networks without considering size limits. LoRaWAN allows ~50-250 bytes depending on spreading factor; Sigfox allows only 12 bytes.

The fix:

# Use binary encoding for LPWAN
import struct
# Pack data into 12 bytes instead of 150+ byte JSON
payload = struct.pack(
    ">HhHHHH",  # Format: 2+2+2+2+2+2 = 12 bytes
    device_id_short,      # 2 bytes
    int(temp * 10),       # 2 bytes, -327.6 to 327.5 C
    int(humidity * 10),   # 2 bytes
    int(pressure - 900) * 10,  # 2 bytes, 900-965 hPa
    int(battery * 100)    # 2 bytes
)

Prevention: Know your protocol’s payload limits before designing data schemas. Use CBOR instead of JSON for constrained networks.

Common Pitfall: Missing Schema Versioning

The mistake: Changing data format without versioning, breaking all existing consumers when devices send incompatible payloads.

Symptoms: - Parsing errors after firmware updates - Old and new devices sending incompatible data - Breaking changes cascade through the system

Why it happens: Teams assume all devices will update simultaneously. In IoT, devices may run old firmware for years.

The fix:

# Include version in payload
payload = {
    "version": 2,
    "temperature": 22.5,
    "unit": "C"
}

# Consumer handles multiple versions
def parse(msg):
    version = msg.get("version", 1)
    if version == 1:
        return {"temperature": msg["temp"]}  # Old format
    else:
        return msg  # New format

Prevention: Include version field in all payloads from day one. Use schema registries (Apache Avro, Protobuf) for complex systems.

Show code

{
  const container = document.getElementById('kc-db-10');
  if (container && typeof InlineKnowledgeCheck !== 'undefined') {
    container.innerHTML = '';
    container.appendChild(InlineKnowledgeCheck.create({
      question: "Your IoT firmware team releases an update that changes the telemetry JSON format from {\"temp\": 22.5} to {\"temperature\": 22.5, \"unit\": \"C\"}. Half your devices update immediately, half remain on old firmware. What is the BEST approach to handle this?",
      options: [
        {text: "Require all devices to update before accepting new format data", correct: false, feedback: "In IoT deployments, forcing simultaneous updates is often impossible. Some devices may be offline, in remote locations, or on old hardware."},
        {text: "Include a version field and implement backward-compatible parsing on the server", correct: true, feedback: "Correct! Add 'version' to payloads and write parsers that handle both formats. This allows gradual rollout without breaking existing devices."},
        {text: "Create two separate database tables for old and new formats", correct: false, feedback: "This creates data fragmentation and complicates queries. Better to normalize at ingestion with version-aware parsing."},
        {text: "Drop all data from old-format devices until they update", correct: false, feedback: "This loses valuable data from devices that may never update. The server should be flexible enough to handle multiple formats."}
      ],
      difficulty: "medium",
      topic: "iot-databases"
    }));
  }
}

1276.10 Common Misconception

Common Misconception: “NoSQL is Always Better for IoT Scale”

The Myth: “IoT generates massive data volumes, so we must use NoSQL databases like MongoDB or Cassandra. SQL databases can’t handle IoT scale.”

The Reality: SQL databases with time-series extensions (TimescaleDB) routinely handle millions of writes per second and petabytes of data–matching or exceeding NoSQL performance for time-series workloads.

Real-World Data:

Database	Write Throughput	Storage Capacity	Query Performance (time-range)
TimescaleDB	1M+ rows/sec (single node)	Petabytes with compression	10-100x faster than PostgreSQL
InfluxDB	500K-1M points/sec	Hundreds of TB	Optimized for time queries
MongoDB	100K-500K docs/sec	Petabytes with sharding	Slower for time-range queries
Cassandra	1M+ writes/sec (distributed)	Petabytes	Poor for time-range aggregations

When NoSQL Wins: - Flexible schemas that evolve rapidly (device metadata varying by type) - Distributed availability over consistency (global edge deployments) - Massive write distribution across geographic regions (Cassandra)

When SQL+TimescaleDB Wins: - Time-series sensor data with regular intervals (90% of IoT use cases) - Complex aggregations (hourly/daily summaries, percentiles) - SQL compatibility (existing tools, ORMs, data analysts) - Cost optimization (compression, continuous aggregates)

The Right Approach: Use polyglot persistence–TimescaleDB for sensor data, PostgreSQL for metadata, Redis for caching, MongoDB for flexible configs.

1276.11 Summary

Six dimensions of data quality (accuracy, completeness, consistency, timeliness, validity, uniqueness) must be monitored
Quality scoring at ingestion enables filtering and alerting on suspect data
Continuous aggregates provide real-time quality dashboards without performance impact
Three handling strategies: Reject (critical apps), Quarantine (investigation), Flag (preserve for analysis)
Performance trade-offs require tiered validation: fast checks at ingestion, complex checks in batch
Schema versioning from day one prevents breaking changes during firmware updates

1276.12 What’s Next

Continue with Sharding Strategies to learn how to scale IoT databases horizontally for massive data volumes.

Related Chapters: - Time-Series Databases - Database optimization - Worked Examples - Complete architecture case studies