Data quality monitoring for IoT databases tracks six critical dimensions: accuracy, completeness, consistency, timeliness, validity, and uniqueness. By scoring data quality at ingestion, creating continuous aggregate dashboards, and applying tiered handling strategies (reject, quarantine, or flag), you can ensure that only trustworthy data drives your analytics and decisions.
Key Concepts
Data Quality Score: A composite metric (0-100) calculated from completeness, accuracy, consistency, timeliness, and validity dimensions, attached to each ingested sensor reading to enable downstream filtering
Schema Validation: The process of checking incoming IoT records against an expected structure – verifying field presence, data types, and value ranges – before committing to storage
Completeness: A quality dimension measuring the percentage of expected sensor readings that were actually received; gaps indicate connectivity failures, device crashes, or power interruptions
Timeliness: A quality dimension measuring whether sensor data arrives within an acceptable latency window; late data may indicate network congestion, buffering failures, or clock drift
Quarantine Strategy: A data quality handling approach that writes low-quality records to a separate table for manual review rather than discarding or silently accepting them
Continuous Aggregate: A TimescaleDB materialized view that automatically refreshes pre-computed aggregations (hourly average, daily max) as new data arrives, enabling fast dashboard queries without full table scans
Reject-Quarantine-Flag Pattern: Three data quality handling strategies: reject (discard bad data immediately), quarantine (isolate for review), or flag (store with quality metadata for downstream filtering)
Quality Monitoring Dashboard: A real-time visualization tracking per-device quality scores, missing reading rates, and validation failure counts to detect sensor degradation before it impacts applications
Learning Objectives
After completing this chapter, you will be able to:
Define and measure the six dimensions of data quality for IoT
Implement data quality checks at ingestion time
Design quality monitoring dashboards and alerts
Classify low-quality data using reject, quarantine, and flag strategies
Evaluate data quality validation trade-offs against database performance
For Beginners: Data Quality Monitoring
Data quality monitoring is like quality control on a factory line for IoT data. Before using sensor readings for important decisions, you need to check: Are the values accurate? Are any readings missing? Did they arrive on time? Just as a restaurant checks ingredient freshness before cooking, IoT systems need automatic checks to catch bad data before it leads to bad decisions.
7.1 Introduction
A single miscalibrated sensor can corrupt months of analytics, trigger false alarms, or cause an ML model to learn the wrong patterns. In large IoT deployments with thousands of devices, bad data is not an exception – it is an inevitability. Sensors drift, batteries die mid-transmission, firmware bugs corrupt payloads, and network retries create duplicates.
Data quality monitoring provides the systematic defenses against these failures. Rather than discovering corrupt data weeks later during analysis, quality monitoring catches problems at ingestion – before they propagate through dashboards, models, and automated decisions. This chapter covers the six dimensions of data quality, how to implement scoring and validation in time-series databases, and the strategies for handling data that fails quality checks.
7.2 The Six Dimensions of Data Quality
Data Quality Framework for IoT
IoT systems must monitor data quality across six critical dimensions:
Dimension
Definition
IoT Example
Impact of Poor Quality
Accuracy
Data correctly represents reality
Sensor reports 25C when actual is 24C
Wrong decisions, false alerts
Completeness
All required data is present
Temperature readings but missing humidity
Incomplete analysis, ML failures
Consistency
Data is uniform across systems
Device A uses Celsius, Device B Fahrenheit
Aggregation errors, wrong comparisons
Timeliness
Data is up-to-date and available when needed
Sensor data arrives 10 minutes late
Missed real-time opportunities
Validity
Data conforms to defined formats/rules
GPS coordinates outside valid range
Database errors, query failures
Uniqueness
No duplicate records exist
Same reading stored twice
Inflated counts, wasted storage
7.3 Implementing Data Quality Checks at Ingestion
Data quality should be validated as data enters the database. This involves defining a schema with quality metadata, writing a scoring function, and integrating scoring into the insert path.
When to use: Critical applications where bad data is worse than no data
-- Reject invalid readings at ingestionCREATEORREPLACEFUNCTION reject_invalid_data()RETURNS TRIGGERAS $$BEGIN-- Calculate quality scoreNEW.quality_score := calculate_quality_score(NEW.temperature,NEW.humidity,NEW.time );-- Reject if quality too lowIFNEW.quality_score <50THEN RAISE EXCEPTION'Data quality too low: %', NEW.quality_score;ENDIF;RETURNNEW;END;$$ LANGUAGE plpgsql;CREATETRIGGER check_data_qualityBEFOREINSERTON sensor_readingsFOREACHROWEXECUTEFUNCTION reject_invalid_data();
When to use: Need to investigate issues while keeping pipeline flowing
-- Create quarantine tableCREATETABLE sensor_readings_quarantine (LIKE sensor_readings INCLUDINGALL, quarantine_reason TEXT, quarantine_time TIMESTAMPTZ DEFAULT NOW());-- Modified trigger: quarantine instead of rejectCREATEORREPLACEFUNCTION quarantine_invalid_data()RETURNS TRIGGERAS $$BEGINNEW.quality_score := calculate_quality_score(NEW.temperature, NEW.humidity, NEW.time );IFNEW.quality_score <50THEN-- Move to quarantine instead of rejectingINSERTINTO sensor_readings_quarantineVALUES (NEW.*, 'Low quality score: '||NEW.quality_score, NOW());RETURNNULL; -- Skip insert into main tableENDIF;RETURNNEW;END;$$ LANGUAGE plpgsql;
When to use: Allow analysis while marking suspect data
-- Store all data but flag quality issuesCREATEORREPLACEFUNCTION flag_data_quality()RETURNS TRIGGERAS $$BEGINNEW.quality_score := calculate_quality_score(NEW.temperature, NEW.humidity, NEW.time );-- Build quality flags JSONNEW.quality_flags := jsonb_build_object('temp_valid', NEW.temperature BETWEEN-50AND50,'humidity_valid', NEW.humidity BETWEEN0AND100,'timestamp_fresh', EXTRACT(EPOCH FROM (NOW() -NEW.time)) <300,'overall_quality', CASEWHENNEW.quality_score >=80THEN'good'WHENNEW.quality_score >=50THEN'acceptable'ELSE'poor'END );RETURNNEW;END;$$ LANGUAGE plpgsql;-- Analytics can filter based on flagsSELECT*FROM sensor_readingsWHERE quality_flags->>'overall_quality'='good'ANDtime> NOW() -INTERVAL'1 day';
Show percentage of readings that are outliers over time
Helps identify sensor drift or calibration issues
7.9 Common Pitfalls
Beyond monitoring data values, two common design mistakes cause systemic data quality failures at the protocol and schema level:
Common Pitfall: Payload Size Too Large for Protocol/Network
The mistake: Sending full JSON payloads over constrained networks like LoRaWAN or Sigfox, exceeding protocol limits and causing transmission failures.
Symptoms:
Message fragmentation across multiple transmissions
Transmission failures and dropped messages
High battery consumption from repeated send attempts
Gateway buffer overflows
Why it happens: Developers design payloads on desktop environments with ample bandwidth, then deploy to constrained networks without considering size limits. LoRaWAN allows ~50-250 bytes depending on spreading factor; Sigfox allows only 12 bytes.
The fix:
# Use binary encoding for LPWANimport struct# Pack data into 10 bytes instead of 150+ byte JSONpayload = struct.pack(">HhHHH", # Format: 2+2+2+2+2 = 10 bytes device_id_short, # 2 bytes (unsigned)int(temp *10), # 2 bytes (signed), -3276.8 to 3276.7 Cint(humidity *10), # 2 bytes, 0-100% with 0.1% precisionint((pressure -900) *10), # 2 bytes, 900-7453.5 hPa rangeint(battery *100), # 2 bytes, 0-655.35V)
Prevention: Know your protocol’s payload limits before designing data schemas. Use CBOR instead of JSON for constrained networks.
Common Pitfall: Missing Schema Versioning
The mistake: Changing data format without versioning, breaking all existing consumers when devices send incompatible payloads.
Symptoms:
Parsing errors after firmware updates
Old and new devices sending incompatible data
Breaking changes cascade through the system
Why it happens: Teams assume all devices will update simultaneously. In IoT, devices may run old firmware for years.
The fix:
# Include version in payloadpayload = {"version": 2,"temperature": 22.5,"unit": "C"}# Consumer handles multiple versionsdef parse(msg): version = msg.get("version", 1)if version ==1:return {"temperature": msg["temp"]} # Old formatelse:return msg # New format
Prevention: Include version field in all payloads from day one. Use schema registries (Apache Avro, Protobuf) for complex systems.
7.10 Common Misconception
Database choice directly affects what quality monitoring capabilities are available. A common myth pushes teams toward databases that lack the SQL query power needed for effective quality analysis:
Common Misconception: “NoSQL is Always Better for IoT Scale”
The Myth: “IoT generates massive data volumes, so we must use NoSQL databases like MongoDB or Cassandra. SQL databases can’t handle IoT scale.”
The Reality: SQL databases with time-series extensions (TimescaleDB) handle hundreds of thousands of writes per second per node, scale to petabytes with compression, and offer superior query capabilities for time-series workloads compared to general-purpose NoSQL databases.
Real-World Data:
Database
Write Throughput
Storage Capacity
Query Performance (time-range)
TimescaleDB
100K-400K rows/sec (single node)
Petabytes with compression
10-100x faster than plain PostgreSQL
InfluxDB
500K-1M points/sec
Hundreds of TB
Optimized for time queries
MongoDB
100K-500K docs/sec
Petabytes with sharding
Slower for time-range queries
Cassandra
1M+ writes/sec (distributed)
Petabytes
Poor for time-range aggregations
When NoSQL Wins:
Flexible schemas that evolve rapidly (device metadata varying by type)
Distributed availability over consistency (global edge deployments)
Massive write distribution across geographic regions (Cassandra)
When SQL+TimescaleDB Wins:
Time-series sensor data with regular intervals (90% of IoT use cases)
The Right Approach: Use polyglot persistence–TimescaleDB for sensor data, PostgreSQL for metadata, Redis for caching, MongoDB for flexible configs.
For Kids: Meet the Sensor Squad!
Data quality monitoring is like having a quality inspector checking every item on a factory assembly line!
7.10.1 The Sensor Squad Adventure: The Quality Patrol
The Sensor Squad had just set up 100 sensors around the school to monitor temperature, noise, and air quality. But something was going wrong!
“The cafeteria sensor says it’s -200 degrees!” shouted Lila the LED. “And the gym sensor says humidity is 500 percent!”
“That’s impossible!” said Sammy the Sensor. “We need a Quality Patrol!”
Max the Microcontroller created a checklist for every reading that came in:
Check 1 - Is it POSSIBLE? (Accuracy) “Can temperature be -200C? NOPE! That’s colder than outer space! RED FLAG!”
Check 2 - Did it ARRIVE? (Completeness) “The library sensor hasn’t sent anything in 2 hours. Missing data alert!”
Check 3 - Does it MATCH? (Consistency) “Sensor A says Celsius, Sensor B says Fahrenheit. That will confuse everything!”
Check 4 - Is it FRESH? (Timeliness) “This reading is from 3 hours ago. That’s too old to be useful right now!”
Check 5 - Is it the RIGHT FORMAT? (Validity) “This sensor sent ‘banana’ instead of a number. That’s not right!”
Check 6 - Is it UNIQUE? (Uniqueness) “We got the same reading twice! Only count it once!”
Bella the Battery set up a scoreboard: “Every reading gets a score from 0 to 100. Below 70? We investigate. Below 50? We quarantine it!”
Now the school’s sensor data was CLEAN and TRUSTWORTHY, and the principal could see exactly which sensors needed maintenance.
7.10.2 Key Words for Kids
Word
What It Means
Quality Score
A grade for how trustworthy data is – like a food freshness rating
Quarantine
Setting aside suspicious data for investigation – like putting a bruised apple aside to check later
Dashboard
A screen showing all the important numbers at a glance – like a car dashboard
Worked Example: Implementing Quality Scoring at Ingestion
Scenario: An industrial IoT system ingests 50,000 sensor readings/minute. Implement real-time quality scoring to filter bad data before storage.
Advanced Quality Scoring Function (extends the basic version from above with pressure validation and rate-of-change detection):
CREATEORREPLACEFUNCTION calculate_quality_score_advanced( temp DOUBLEPRECISION, humidity DOUBLEPRECISION, pressure DOUBLEPRECISION, p_device_id TEXT, reading_time TIMESTAMPTZ, last_reading TIMESTAMPTZ) RETURNS TABLE(score INTEGER, flags JSONB) AS $$DECLARE s INTEGER:=100; f JSONB :='{}'::JSONB; age_seconds DOUBLEPRECISION; rate_of_change DOUBLEPRECISION;BEGIN-- Check 1: Temperature range (-50 to 50C)IF temp <-50OR temp >50THEN s := s -30; f := f ||'{"temp_range": false}'::JSONB;ELSE f := f ||'{"temp_range": true}'::JSONB;ENDIF;-- Check 2: Humidity range (0-100%)IF humidity <0OR humidity >100THEN s := s -30; f := f ||'{"humidity_range": false}'::JSONB;ELSE f := f ||'{"humidity_range": true}'::JSONB;ENDIF;-- Check 3: Pressure range (300-1100 hPa typical for Earth)IF pressure <300OR pressure >1100THEN s := s -20; f := f ||'{"pressure_range": false}'::JSONB;ELSE f := f ||'{"pressure_range": true}'::JSONB;ENDIF;-- Check 4: Data freshness (<5 minutes) age_seconds :=EXTRACT(EPOCH FROM (NOW() - reading_time));IF age_seconds >300THEN s := s -20; f := f ||'{"freshness": false}'::JSONB;ELSE f := f ||'{"freshness": true}'::JSONB;ENDIF;-- Check 5: Rate of change (temp shouldn't jump >10C in 1 minute)IF last_reading ISNOTNULLAND reading_time > last_reading THEN rate_of_change :=ABS(temp - (SELECT prev.temperature FROM sensor_readings prevWHERE prev.device_id = p_device_idAND prev.time= last_reading LIMIT1 )) /EXTRACT(EPOCH FROM (reading_time - last_reading));IF rate_of_change >0.167THEN-- 10C/60sec = 0.167C/sec s := s -20; f := f ||'{"rate_of_change": false}'::JSONB;ELSE f := f ||'{"rate_of_change": true}'::JSONB;ENDIF;ENDIF;RETURNQUERYSELECTGREATEST(s, 0), f;END;$$ LANGUAGE plpgsql;
Example Usage:
-- Insert with automatic quality scoringWITH quality AS (SELECT*FROM calculate_quality_score_advanced(23.5, -- temperature55.0, -- humidity1013.2, -- pressure'sensor_001', -- device_id NOW(), -- current timestamp NOW() -INTERVAL'1 minute'-- last reading timestamp ))INSERTINTO sensor_readings (time, device_id, temperature, humidity, quality_score, quality_flags)SELECT NOW(), 'sensor_001', 23.5, 55.0, score, flags FROM quality;
Performance Analysis:
Validation Type
Execution Time
Notes
Range checks (temp, humidity, pressure)
~0.01 ms
Simple comparisons, negligible
Timestamp freshness
~0.02 ms
Single function call, negligible
Rate-of-change (1 indexed lookup)
~0.5 ms
Queries previous reading by PK
Total per-row overhead
~0.53 ms
See note below
Throughput impact note: The 0.53 ms per-row overhead does not directly reduce throughput from 20K to ~1.7K rows/sec. In practice, the ingestion pipeline uses batched inserts and connection pooling – multiple rows are validated concurrently across database connections. With 10 concurrent connections, the system sustains ~18,800 rows/sec (a ~6% reduction from the 20K baseline), not the ~91% reduction that naive per-row serial calculation would suggest. Always benchmark with your actual pipeline architecture.
Configure your quality scoring thresholds based on data criticality and cost of false positives vs false negatives:
Data Criticality
Reject Threshold
Quarantine Threshold
Flag Threshold
Reasoning
Safety-Critical (gas sensors)
<50
50-70
70-90
Cannot tolerate bad data; err on side of caution
Operational (temperature)
<30
30-50
50-70
Some tolerance for transient issues
Best-Effort (traffic counts)
<20
20-40
40-60
Data loss acceptable; avoid over-filtering
Compliance (billing meters)
<60
60-80
80-95
Regulatory requirements demand high quality
Cost-Benefit Analysis:
Example: Industrial pressure sensor monitoring (1M readings/day, automated triage with manual review of flagged batches)
Threshold
False Positive Rate
False Negative Rate
Investigation Cost
Missed-Alert Risk Cost
Total Cost
Reject <30
0.1% (1K/day)
2% (20K/day missed)
$100/month
$50K/month
$50,100
Reject <50
0.5% (5K/day)
0.5% (5K/day missed)
$500/month
$12.5K/month
$13,000
Reject <70
2% (20K/day)
0.1% (1K/day missed)
$2K/month
$2.5K/month
$4,500
Reject <90
10% (100K/day)
0.01% (100/day missed)
$10K/month
$250/month
$10,250
Investigation costs assume automated batch triage (grouping false positives by device/time window) with manual review of flagged batches, not individual per-reading human investigation.
Optimal: Reject <70 minimizes total cost ($4,500/month) by balancing false positive investigation cost vs missed alert risk cost.
How to calibrate:
Baseline normal operation (1 week):
Measure quality score distribution
Identify P95, P99 percentiles
Set reject threshold at P1 (99% of normal data passes)
Inject known bad data:
Out-of-range values
Stale timestamps
Duplicate records
Measure detection rate (true positive rate)
Tune thresholds iteratively:
Too many false positives? Lower reject threshold
Missing too many issues? Raise reject threshold
Validate with ROC curve analysis
Separate thresholds by sensor type:
High-precision sensors: Stricter thresholds
Noisy sensors: Looser thresholds
Critical sensors: Err on side of rejection
Key Insight: One-size-fits-all thresholds waste effort investigating false positives OR miss real issues. Calibrate per deployment.
Common Mistake: Silent Data Quality Degradation
The Error: A temperature sensor’s calibration drifts, reporting 2C higher than actual over 6 months. No quality monitoring detects the gradual drift because it’s within range checks.
What Goes Wrong:
Timeline:
Month 0: Sensor accurate (reports 20C, actual 20C)
Month 1: +0.3C drift (reports 20.3C, actual 20C) -- Still within +/-5C tolerance
Month 2: +0.7C drift (reports 20.7C)
Month 3: +1.2C drift (reports 21.2C)
Month 6: +2.0C drift (reports 22.0C) -- Passes range checks but WRONG
Impact:
HVAC system overcools by 2C -> 15% energy waste ($1,200/year)
Historical trends show false “warming” pattern
Anomaly detection trains on drifted data -> broken model
Only discovered when technician manually checks calibration
The Fix: Cross-validation with peer sensors:
-- Detect drift by comparing with spatial neighborsWITH sensor_stats AS (SELECT sensor_id,AVG(temperature) as avg_temp,STDDEV(temperature) as stddev_tempFROM sensor_readingsWHEREtime> NOW() -INTERVAL'7 days'GROUPBY sensor_id),peer_comparison AS (SELECT s1.sensor_id, s1.avg_temp,AVG(s2.avg_temp) as peer_avg,ABS(s1.avg_temp -AVG(s2.avg_temp)) as deviationFROM sensor_stats s1JOIN sensors loc1 ON s1.sensor_id = loc1.sensor_idJOIN sensors loc2 ON ST_Distance(loc1.location, loc2.location) <50-- 50m radiusJOIN sensor_stats s2 ON loc2.sensor_id = s2.sensor_idWHERE s1.sensor_id != s2.sensor_idGROUPBY s1.sensor_id, s1.avg_temp)SELECT sensor_id, avg_temp, peer_avg, deviationFROM peer_comparisonWHERE deviation >3-- 3C deviation from peers = likely driftORDERBY deviation DESC;
Early Detection Strategies:
Peer validation: Compare with nearby sensors (same zone)
Temporal consistency: Temperature shouldn’t change by 10C in 1 minute
Physical constraints: Indoor temp 18-26C, outdoor -40 to +50C
Statistical outliers: Z-score > 3 from historical baseline
Rate-of-change limits: Max 2C/hour for building temp
Calibration Alert System:
-- Alert when sensor deviates from peers for 7+ daysSELECT sensor_id, 'Calibration drift suspected'as alertFROM (SELECT sensor_id, COUNT(*) as drift_daysFROM daily_peer_comparisonWHERE deviation >2ANDtime> NOW() -INTERVAL'30 days'GROUPBY sensor_id) subWHERE drift_days >=7;
Result: Detected drift after 2 weeks instead of 6 months, preventing $1,000 in wasted energy.
Key Insight: Range checks catch broken sensors, but peer validation catches drifting sensors.
7.11 Try It Yourself: Design a Quality Scoring System
Exercise: Smart Agriculture Quality Monitor
Scenario: You’re building a quality monitoring system for a smart agriculture deployment with 500 soil sensors. Each sensor reports every 5 minutes:
Soil moisture (0-100% volumetric water content)
Soil temperature (-10°C to 60°C, depending on climate)
Electrical conductivity (0-20 dS/m for salinity)
Battery voltage (2.0V - 4.2V for lithium cell)
Your tasks:
Define quality checks for each sensor field:
What range is valid for soil moisture? (Hint: sand rarely exceeds 45%, clay can reach 60%)
What rate-of-change is suspicious for soil temperature? (Hint: soil changes slowly — 1-2°C/hour max)
At what battery voltage should you flag readings as unreliable? (Hint: below 2.8V, ADC readings become inaccurate)
Assign weights to each dimension:
Which matters more for irrigation decisions: accuracy or timeliness?
Should a stale reading (20 minutes old) be rejected or flagged?
Choose handling strategies by use case:
Irrigation control (automated): Should you reject or quarantine low-quality data?
Seasonal trend analysis (offline): Can you accept flagged data?
Frost alert system (safety): What’s your minimum quality threshold?
Bonus: Write the SQL calculate_quality_score function for soil sensors, adapting the pattern from this chapter. Consider: should a low battery voltage reduce the quality score for ALL fields, since the ADC may be unreliable?
Suggested Approach
Quality Check Definitions
Soil moisture: Valid range 0-70% (reject >100%, flag >70% as possibly waterlogged)
Soil temperature: Valid range -10°C to 60°C, rate-of-change max 2°C/hour
Electrical conductivity: Valid range 0-20 dS/m, flag >8 dS/m as high salinity
Seasonal analysis: Flag below 70 (include but mark — data analysts can filter)
Frost alerts: Reject below 70 with fallback to nearest peer sensor reading
Label the Diagram
💻 Code Challenge
7.12 Summary
Six dimensions of data quality (accuracy, completeness, consistency, timeliness, validity, uniqueness) must be monitored
Quality scoring at ingestion enables filtering and alerting on suspect data
Continuous aggregates provide real-time quality dashboards without performance impact
Three handling strategies: Reject (critical apps), Quarantine (investigation), Flag (preserve for analysis)
Performance trade-offs require tiered validation: fast checks at ingestion, complex checks in batch
Schema versioning from day one prevents breaking changes during firmware updates
7.13 Concept Relationships
See Also
Prerequisites - Read these first: - Time-Series Databases - Foundation for understanding data quality metrics in TSDBs - Database Selection Framework - Choosing databases that support quality validation
Related Concepts - Explore these next: - Sharding Strategies - Quality checks impact performance in distributed systems - Stream Processing Fundamentals - Real-time quality validation before storage
Practical Applications:
Worked Examples - Fleet management with quality scoring implementation
Estimate how many readings your system should expect and how many might be missing, based on device count, sampling interval, and observed completeness rate.