After completing this chapter, you will be able to:
Define and measure the six dimensions of data quality for IoT
Implement data quality checks at ingestion time
Design quality monitoring dashboards and alerts
Handle low-quality data with appropriate strategies
Balance data quality validation with database performance
1276.1 Introduction
Storing IoT data is only valuable if the data quality is high. This chapter covers essential data quality metrics and monitoring strategies for IoT database systems.
1276.2 The Six Dimensions of Data Quality
NoteData Quality Framework for IoT
IoT systems must monitor data quality across six critical dimensions:
Dimension
Definition
IoT Example
Impact of Poor Quality
Accuracy
Data correctly represents reality
Sensor reports 25C when actual is 24C
Wrong decisions, false alerts
Completeness
All required data is present
Temperature readings but missing humidity
Incomplete analysis, ML failures
Consistency
Data is uniform across systems
Device A uses Celsius, Device B Fahrenheit
Aggregation errors, wrong comparisons
Timeliness
Data is up-to-date and available when needed
Sensor data arrives 10 minutes late
Missed real-time opportunities
Validity
Data conforms to defined formats/rules
GPS coordinates outside valid range
Database errors, query failures
Uniqueness
No duplicate records exist
Same reading stored twice
Inflated counts, wasted storage
1276.3 Implementing Data Quality Checks at Ingestion
Data quality should be validated as data enters the database:
Missing Data > 10%: Network or device failure likely
Duplicate Rate > 1%: Application or network retry issues
Outlier Rate > 5%: Sensor calibration or environmental anomaly
Implementation with continuous aggregates:
-- Create materialized view for hourly quality metricsCREATEMATERIALIZEDVIEW hourly_quality_summaryWITH (timescaledb.continuous) ASSELECT device_id, time_bucket('1 hour', time) AShour,AVG(quality_score) as avg_quality,COUNT(*) as reading_count,COUNT(*) FILTER (WHERE quality_score <70) as poor_quality_count,AVG(temperature) as avg_temperature,STDDEV(temperature) as stddev_temperatureFROM sensor_readingsGROUPBY device_id, hour;-- Auto-refresh every 5 minutesSELECT add_continuous_aggregate_policy('hourly_quality_summary', start_offset =>INTERVAL'1 day', end_offset =>INTERVAL'1 hour', schedule_interval =>INTERVAL'5 minutes');-- Query for quick dashboard updatesSELECT*FROM hourly_quality_summaryWHEREhour> NOW() -INTERVAL'24 hours'AND avg_quality <80ORDERBY avg_quality ASC;
1276.6 Data Quality vs. Database Performance Trade-offs
WarningTradeoff: Hot/Warm/Cold Storage Tiers for IoT Data
Option A: Single-Tier (Hot Storage Only) - All data on fast SSD-backed database - Query latency: 1-10ms for any time range - Storage cost: $0.10-0.25/GB/month (AWS RDS gp3 SSD) - 1TB for 1 year: $1,200-3,000/year - Operational simplicity: Single database to manage
Option B: Multi-Tier (Hot/Warm/Cold) - Data moves to cheaper storage as it ages - Hot tier (SSD, 0-7 days): $0.10-0.25/GB/month, 1-10ms latency - Warm tier (HDD/compressed, 7-90 days): $0.02-0.05/GB/month, 50-200ms latency - Cold tier (S3/Glacier, 90+ days): $0.004-0.01/GB/month, 100ms-12hr latency - 1TB for 1 year: $300-800/year (75-80% cost reduction)
Decision Factors: - Choose Single-Tier when: Dataset under 100GB, uniform query patterns across all time ranges, operational simplicity is paramount - Choose Multi-Tier when: Dataset exceeds 500GB, 90%+ queries target recent data (last 7 days), storage cost is a concern, compliance requires multi-year retention
Quality validation has computational costs:
Quality Check
Performance Impact
When to Apply
Range validation
Low (simple comparison)
Every insert (always)
Duplicate detection
Medium (index lookup)
Every insert (recommended)
Outlier detection
High (statistical aggregation)
Batch processing (hourly/daily)
Cross-sensor consistency
Very High (joins across datasets)
Offline validation (weekly)
Optimization strategy:
Ingestion time: Fast checks only (range, format, uniqueness)
Real-time (1-5 min delay): Lightweight statistical checks (moving average validation)
{const container =document.getElementById('kc-db-8');if (container &&typeof InlineKnowledgeCheck !=='undefined') { container.innerHTML=''; container.appendChild(InlineKnowledgeCheck.create({question:"A sensor reports temperature as -273.5C (below absolute zero). Your IoT system must handle this invalid reading. Which data quality strategy is MOST appropriate for a predictive maintenance application?",options: [ {text:"Reject the reading at ingestion - invalid data should never enter the database",correct:false,feedback:"Rejection means you lose visibility into sensor health. The reading indicates a sensor malfunction that needs investigation."}, {text:"Accept the reading with a quality flag and alert the maintenance team",correct:true,feedback:"Correct! Flagging preserves the data for debugging while marking it as suspect. The alert enables maintenance to investigate the sensor issue. This is the 'Flag Strategy' approach."}, {text:"Automatically correct the value to the last known good reading",correct:false,feedback:"Auto-correction hides problems and could propagate errors. If the sensor is failing, you need to detect this, not mask it."}, {text:"Discard the reading silently to avoid contaminating the dataset",correct:false,feedback:"Silent discarding loses valuable diagnostic information. A -273.5C reading strongly suggests sensor failure that needs attention."} ],difficulty:"medium",topic:"iot-databases" })); }}
1276.7 Handling Low-Quality Data
When quality issues are detected, choose the appropriate strategy:
When to use: Critical applications where bad data is worse than no data
-- Reject invalid readings at ingestionCREATEORREPLACEFUNCTION reject_invalid_data()RETURNS TRIGGERAS $$BEGIN-- Calculate quality scoreNEW.quality_score := calculate_quality_score(NEW.temperature,NEW.humidity,NEW.time );-- Reject if quality too lowIFNEW.quality_score <50THEN RAISE EXCEPTION'Data quality too low: %', NEW.quality_score;ENDIF;RETURNNEW;END;$$ LANGUAGE plpgsql;CREATETRIGGER check_data_qualityBEFOREINSERTON sensor_readingsFOREACHROWEXECUTEFUNCTION reject_invalid_data();
When to use: Need to investigate issues while keeping pipeline flowing
-- Create quarantine tableCREATETABLE sensor_readings_quarantine (LIKE sensor_readings INCLUDINGALL, quarantine_reason TEXT, quarantine_time TIMESTAMPTZ DEFAULT NOW());-- Modified trigger: quarantine instead of rejectCREATEORREPLACEFUNCTION quarantine_invalid_data()RETURNS TRIGGERAS $$BEGINNEW.quality_score := calculate_quality_score(NEW.temperature, NEW.humidity, NEW.time );IFNEW.quality_score <50THEN-- Move to quarantine instead of rejectingINSERTINTO sensor_readings_quarantineVALUES (NEW.*, 'Low quality score: '||NEW.quality_score, NOW());RETURNNULL; -- Skip insert into main tableENDIF;RETURNNEW;END;$$ LANGUAGE plpgsql;
When to use: Allow analysis while marking suspect data
-- Store all data but flag quality issuesCREATEORREPLACEFUNCTION flag_data_quality()RETURNS TRIGGERAS $$BEGINNEW.quality_score := calculate_quality_score(NEW.temperature, NEW.humidity, NEW.time );-- Build quality flags JSONNEW.quality_flags := jsonb_build_object('temp_valid', NEW.temperature BETWEEN-50AND50,'humidity_valid', NEW.humidity BETWEEN0AND100,'timestamp_fresh', EXTRACT(EPOCH FROM (NOW() -NEW.time)) <300,'overall_quality', CASEWHENNEW.quality_score >=80THEN'good'WHENNEW.quality_score >=50THEN'acceptable'ELSE'poor'END );RETURNNEW;END;$$ LANGUAGE plpgsql;-- Analytics can filter based on flagsSELECT*FROM sensor_readingsWHERE quality_flags->>'overall_quality'='good'ANDtime> NOW() -INTERVAL'1 day';
Show percentage of readings that are outliers over time
Helps identify sensor drift or calibration issues
1276.9 Common Pitfalls
WarningCommon Pitfall: Payload Size Too Large for Protocol/Network
The mistake: Sending full JSON payloads over constrained networks like LoRaWAN or Sigfox, exceeding protocol limits and causing transmission failures.
Symptoms: - Message fragmentation across multiple transmissions - Transmission failures and dropped messages - High battery consumption from repeated send attempts - Gateway buffer overflows
Why it happens: Developers design payloads on desktop environments with ample bandwidth, then deploy to constrained networks without considering size limits. LoRaWAN allows ~50-250 bytes depending on spreading factor; Sigfox allows only 12 bytes.
The fix:
# Use binary encoding for LPWANimport struct# Pack data into 12 bytes instead of 150+ byte JSONpayload = struct.pack(">HhHHHH", # Format: 2+2+2+2+2+2 = 12 bytes device_id_short, # 2 bytesint(temp *10), # 2 bytes, -327.6 to 327.5 Cint(humidity *10), # 2 bytesint(pressure -900) *10, # 2 bytes, 900-965 hPaint(battery *100) # 2 bytes)
Prevention: Know your protocol’s payload limits before designing data schemas. Use CBOR instead of JSON for constrained networks.
WarningCommon Pitfall: Missing Schema Versioning
The mistake: Changing data format without versioning, breaking all existing consumers when devices send incompatible payloads.
Symptoms: - Parsing errors after firmware updates - Old and new devices sending incompatible data - Breaking changes cascade through the system
Why it happens: Teams assume all devices will update simultaneously. In IoT, devices may run old firmware for years.
The fix:
# Include version in payloadpayload = {"version": 2,"temperature": 22.5,"unit": "C"}# Consumer handles multiple versionsdef parse(msg): version = msg.get("version", 1)if version ==1:return {"temperature": msg["temp"]} # Old formatelse:return msg # New format
Prevention: Include version field in all payloads from day one. Use schema registries (Apache Avro, Protobuf) for complex systems.
Show code
{const container =document.getElementById('kc-db-10');if (container &&typeof InlineKnowledgeCheck !=='undefined') { container.innerHTML=''; container.appendChild(InlineKnowledgeCheck.create({question:"Your IoT firmware team releases an update that changes the telemetry JSON format from {\"temp\": 22.5} to {\"temperature\": 22.5, \"unit\": \"C\"}. Half your devices update immediately, half remain on old firmware. What is the BEST approach to handle this?",options: [ {text:"Require all devices to update before accepting new format data",correct:false,feedback:"In IoT deployments, forcing simultaneous updates is often impossible. Some devices may be offline, in remote locations, or on old hardware."}, {text:"Include a version field and implement backward-compatible parsing on the server",correct:true,feedback:"Correct! Add 'version' to payloads and write parsers that handle both formats. This allows gradual rollout without breaking existing devices."}, {text:"Create two separate database tables for old and new formats",correct:false,feedback:"This creates data fragmentation and complicates queries. Better to normalize at ingestion with version-aware parsing."}, {text:"Drop all data from old-format devices until they update",correct:false,feedback:"This loses valuable data from devices that may never update. The server should be flexible enough to handle multiple formats."} ],difficulty:"medium",topic:"iot-databases" })); }}
1276.10 Common Misconception
WarningCommon Misconception: “NoSQL is Always Better for IoT Scale”
The Myth: “IoT generates massive data volumes, so we must use NoSQL databases like MongoDB or Cassandra. SQL databases can’t handle IoT scale.”
The Reality: SQL databases with time-series extensions (TimescaleDB) routinely handle millions of writes per second and petabytes of data–matching or exceeding NoSQL performance for time-series workloads.
Real-World Data:
Database
Write Throughput
Storage Capacity
Query Performance (time-range)
TimescaleDB
1M+ rows/sec (single node)
Petabytes with compression
10-100x faster than PostgreSQL
InfluxDB
500K-1M points/sec
Hundreds of TB
Optimized for time queries
MongoDB
100K-500K docs/sec
Petabytes with sharding
Slower for time-range queries
Cassandra
1M+ writes/sec (distributed)
Petabytes
Poor for time-range aggregations
When NoSQL Wins: - Flexible schemas that evolve rapidly (device metadata varying by type) - Distributed availability over consistency (global edge deployments) - Massive write distribution across geographic regions (Cassandra)
When SQL+TimescaleDB Wins: - Time-series sensor data with regular intervals (90% of IoT use cases) - Complex aggregations (hourly/daily summaries, percentiles) - SQL compatibility (existing tools, ORMs, data analysts) - Cost optimization (compression, continuous aggregates)
The Right Approach: Use polyglot persistence–TimescaleDB for sensor data, PostgreSQL for metadata, Redis for caching, MongoDB for flexible configs.
1276.11 Summary
Six dimensions of data quality (accuracy, completeness, consistency, timeliness, validity, uniqueness) must be monitored
Quality scoring at ingestion enables filtering and alerting on suspect data
Continuous aggregates provide real-time quality dashboards without performance impact
Three handling strategies: Reject (critical apps), Quarantine (investigation), Flag (preserve for analysis)
Performance trade-offs require tiered validation: fast checks at ingestion, complex checks in batch
Schema versioning from day one prevents breaking changes during firmware updates
1276.12 What’s Next
Continue with Sharding Strategies to learn how to scale IoT databases horizontally for massive data volumes.