%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
subgraph Input["Raw Data"]
S1[Sensor<br/>Readings]
end
subgraph Validate["Stage 1: Validate"]
V1[Range Check]
V2[Rate-of-Change]
V3[Plausibility]
end
subgraph Clean["Stage 2: Clean"]
C1[Outlier Detection]
C2[Missing Value<br/>Imputation]
C3[Noise Filtering]
end
subgraph Transform["Stage 3: Transform"]
T1[Normalization]
T2[Scaling]
T3[Feature<br/>Engineering]
end
subgraph Output["Clean Data"]
O1[Analysis<br/>Ready]
end
S1 --> V1
V1 --> V2
V2 --> V3
V3 --> C1
C1 --> C2
C2 --> C3
C3 --> T1
T1 --> T2
T2 --> T3
T3 --> O1
style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
style V1 fill:#2C3E50,stroke:#16A085,color:#fff
style V2 fill:#2C3E50,stroke:#16A085,color:#fff
style V3 fill:#2C3E50,stroke:#16A085,color:#fff
style C1 fill:#16A085,stroke:#2C3E50,color:#fff
style C2 fill:#16A085,stroke:#2C3E50,color:#fff
style C3 fill:#16A085,stroke:#2C3E50,color:#fff
style T1 fill:#7F8C8D,stroke:#2C3E50,color:#fff
style T2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
style T3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
style O1 fill:#27AE60,stroke:#2C3E50,color:#fff
1306 Data Validation and Outlier Detection
1306.1 Learning Objectives
By the end of this chapter, you will be able to:
- Identify Data Quality Issues: Recognize common data quality problems in IoT sensor streams including outliers, missing values, noise, and drift
- Implement Validation Rules: Create real-time data validation pipelines that detect and flag invalid sensor readings
- Apply Outlier Detection: Use statistical methods (Z-score, IQR, MAD) to identify and handle anomalous data points
- Design Context-Aware Validation: Build adaptive validation thresholds that account for environmental conditions
1306.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- Edge Data Acquisition: Understanding how sensor data is collected at the edge provides context for where preprocessing fits in the pipeline
- Sensor Fundamentals: Knowledge of sensor characteristics helps understand why data quality issues arise
- Signal Processing Essentials: Basic concepts of filtering and signal conditioning
Data quality is like being a detective who checks if clues are real or fake before solving a mystery!
1306.2.1 The Sensor Squad Adventure: The Case of the Confused Readings
Sammy the Sensor was the star detective at Sensor Squad Headquarters. Every day, sensors from all over the city sent readings to Sammy for the Big Weather Report.
One morning, something STRANGE happened!
“Temperature Terry says it’s 500 degrees in the park!” gasped Lila the LED. “But Humidity Hannah says the park is underwater AND it’s -100 degrees there!”
Sammy knew something was VERY wrong. “That’s impossible! Water can’t be frozen AND boiling at the same time. And nothing on Earth is 500 degrees outside of volcanoes!”
Bella the Battery suggested, “Maybe the sensors made mistakes? Or maybe a squirrel chewed Terry’s wires?”
Sammy put on his detective hat and created the “Data Quality Checklist”:
Step 1 - VALIDATE (Is it even possible?) “Can a park really be 500 degrees? NO! That’s hotter than an oven! REJECTED!”
Step 2 - CLEAN (Remove the mistakes) “Hannah’s -100 degrees? Let me check the sensors nearby. They all say 25 degrees. Hannah is the odd one out - she needs a checkup!”
After checking everything, Sammy had CLEAN, TRUSTWORTHY data for the Weather Report.
Max the Microcontroller was impressed. “Without your detective work, we would have told everyone the park was on fire AND frozen at the same time!”
The Sensor Squad learned: Bad data in = Bad decisions out! Always check your data before trusting it!
1306.2.2 Key Words for Kids
| Word | What It Means |
|---|---|
| Data Quality | How good and trustworthy your information is - like checking if food is fresh before eating |
| Outlier | A reading that’s WAY different from the others - like one kid saying they’re 100 feet tall |
| Validation | Checking if a reading is even POSSIBLE - humans can’t be 100 feet tall! |
1306.2.3 Try This at Home!
The Data Detective Game:
- Ask 5 family members their height (or age, or favorite number 1-10)
- Write down their answers
- Now add ONE obviously wrong answer: “Uncle Bob is 50 feet tall!”
Play detective: - Which answer is the OUTLIER? (50 feet - impossible for a human!) - How did you know it was wrong? (No human is that tall - that’s VALIDATION)
Now you’re a Data Quality Detective just like Sammy!
Think of sensor data like ingredients in a recipe. Even the best chef cannot make a great dish from spoiled or contaminated ingredients. Similarly, even the most sophisticated analytics and machine learning models cannot produce reliable results from low-quality sensor data.
Common data quality problems in IoT:
| Problem | Example | Impact |
|---|---|---|
| Outliers | Temperature reads -40C in a room | False alerts, wrong decisions |
| Missing Data | Sensor battery dies for 2 hours | Gaps in analysis, failed predictions |
| Noise | Vibration sensor picks up building HVAC | Obscures real signals |
| Drift | Humidity sensor slowly reads 5% high | Gradual accuracy loss |
| Invalid Range | Humidity shows 105% | Physically impossible readings |
Why preprocessing matters:
- Garbage in, garbage out: ML models amplify data problems
- Edge resources: Cleaning data at the source reduces bandwidth
- Real-time decisions: Invalid data triggers false alarms
- Long-term trends: Drift and bias corrupt historical analysis
Key question this chapter answers: “How do I detect invalid or anomalous sensor readings before they cause problems?”
Core Concept: Data validation is the first line of defense in data quality, checking whether sensor readings are physically possible and plausible before any further processing.
Why It Matters: A single corrupt sensor reading propagated through an IoT system can trigger false alarms, cause equipment shutdowns, or corrupt trained ML models. Validation catches 80% of data quality issues at the point of collection.
Key Takeaway: Always validate readings against physical bounds (temperature cannot exceed 100C for a room sensor) and rate-of-change limits (indoor temperature cannot change 20C in one second). If a reading fails these basic checks, it is always wrong.
1306.3 The Data Quality Pipeline
Data quality is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This section explores the validation and outlier detection stages of data quality preprocessing.
In one sentence: Clean data at the edge before it travels - fixing bad data at the source costs 1% of fixing it in the cloud.
Remember this rule: If a reading violates physical laws (temperature below absolute zero, humidity above 100%), it is always wrong.
1306.3.1 Pipeline Overview
This view helps determine the appropriate handling strategy for different data quality issues:
%% fig-alt: "Decision tree for handling data quality issues. Start with the type of problem detected. If value is outside physical bounds, discard and flag sensor fault. If value is statistically unusual but physically possible, check context - if contextually valid keep it, otherwise apply outlier treatment. If value is missing, determine cause - if sensor offline use forward-fill or interpolation, if transmission gap use backfill when data arrives. If value is noisy, apply appropriate filter based on signal characteristics."
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
Start[Data Quality<br/>Issue Detected] --> Type{Issue Type?}
Type -->|Out of Range| Range[Physical<br/>Bounds Violation]
Type -->|Unusual Value| Stats[Statistical<br/>Outlier]
Type -->|No Reading| Missing[Missing<br/>Data]
Type -->|Fluctuating| Noise[Noisy<br/>Signal]
Range --> Discard[Discard Reading<br/>Flag Sensor Fault]
Stats --> Context{Contextually<br/>Valid?}
Context -->|Yes| Keep[Keep Value<br/>Document]
Context -->|No| Treat[Apply Outlier<br/>Treatment]
Missing --> Cause{Cause<br/>Known?}
Cause -->|Sensor Offline| Forward[Forward-Fill<br/>or Interpolate]
Cause -->|Transmission Gap| Backfill[Backfill When<br/>Data Arrives]
Noise --> Filter[Apply Appropriate<br/>Noise Filter]
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style Discard fill:#E74C3C,stroke:#2C3E50,color:#fff
style Keep fill:#27AE60,stroke:#2C3E50,color:#fff
style Treat fill:#E67E22,stroke:#2C3E50,color:#fff
style Forward fill:#16A085,stroke:#2C3E50,color:#fff
style Backfill fill:#16A085,stroke:#2C3E50,color:#fff
style Filter fill:#3498DB,stroke:#2C3E50,color:#fff
Use this decision tree to select the appropriate handling strategy for each type of data quality issue.
1306.4 Data Validation Techniques
1306.4.1 Range Validation
The simplest form of validation checks whether values fall within physically possible or operationally expected bounds:
class RangeValidator:
def __init__(self, sensor_type):
"""Define valid ranges for common sensor types"""
self.ranges = {
'temperature_indoor': (-10, 50), # Celsius
'temperature_outdoor': (-40, 60), # Celsius
'humidity': (0, 100), # Percentage
'pressure': (870, 1084), # hPa (sea level)
'light': (0, 100000), # Lux
'co2': (300, 5000), # ppm
'battery_voltage': (2.5, 4.2), # Li-ion typical
'soil_moisture': (0, 100), # Percentage
}
self.min_val, self.max_val = self.ranges.get(
sensor_type, (float('-inf'), float('inf'))
)
def validate(self, value):
"""Check if value is within valid range"""
if value < self.min_val or value > self.max_val:
return False, f"Out of range [{self.min_val}, {self.max_val}]"
return True, "Valid"
# Example usage
temp_validator = RangeValidator('temperature_indoor')
print(temp_validator.validate(22.5)) # (True, 'Valid')
print(temp_validator.validate(-45.0)) # (False, 'Out of range [-10, 50]')1306.4.2 Rate-of-Change Validation
Physical sensors cannot change instantaneously. A temperature sensor reading 22C one second and -40C the next indicates a fault, not an actual temperature change:
class RateOfChangeValidator:
def __init__(self, max_rate_per_second):
"""
max_rate_per_second: Maximum expected change per second
Example: Indoor temperature typically changes < 0.5C/min = 0.0083C/s
"""
self.max_rate = max_rate_per_second
self.last_value = None
self.last_time = None
def validate(self, value, timestamp):
if self.last_value is None:
self.last_value = value
self.last_time = timestamp
return True, "First reading"
time_delta = timestamp - self.last_time
if time_delta <= 0:
return False, "Invalid timestamp"
rate = abs(value - self.last_value) / time_delta
# Update history
self.last_value = value
self.last_time = timestamp
if rate > self.max_rate:
return False, f"Rate {rate:.4f}/s exceeds max {self.max_rate}/s"
return True, f"Rate {rate:.4f}/s OK"
# Indoor temperature: max 0.5C per minute = 0.0083C/s
temp_roc = RateOfChangeValidator(max_rate_per_second=0.0083)1306.4.3 Multi-Sensor Plausibility Checks
When multiple sensors measure related quantities, cross-validation can detect faults:
def cross_validate_weather(temperature, humidity, dew_point_reported):
"""
Validate weather sensor readings using physical relationships.
Dew point cannot exceed temperature.
"""
# Calculate expected dew point from T and RH
# Magnus formula approximation
import math
a = 17.27
b = 237.7
alpha = ((a * temperature) / (b + temperature)) + math.log(humidity / 100.0)
dew_point_calculated = (b * alpha) / (a - alpha)
# Check if reported dew point is plausible
errors = []
if dew_point_reported > temperature:
errors.append(f"Dew point ({dew_point_reported}C) > Temperature ({temperature}C)")
if abs(dew_point_calculated - dew_point_reported) > 5:
errors.append(f"Dew point mismatch: calculated {dew_point_calculated:.1f}C vs reported {dew_point_reported}C")
return len(errors) == 0, errorsThe mistake: Using fixed validation thresholds that do not account for context, leading to either excessive false rejections or missed errors.
Symptoms:
- Valid readings rejected during extreme weather events
- Obvious sensor faults accepted because values are technically “in range”
- Different deployments require manual threshold tuning
- Seasonal patterns cause periodic validation failures
Why it happens: Engineers set thresholds based on typical conditions without considering edge cases. A threshold of 35C maximum for indoor temperature works in temperate climates but fails in regions with summer heatwaves.
The fix: Implement context-aware validation:
# Bad: Static threshold
if temperature > 35:
reject_reading()
# Good: Context-aware threshold
outdoor_temp = get_outdoor_temperature()
if temperature > min(40, outdoor_temp + 10):
reject_reading() # Indoor cannot be much hotter than outsidePrevention: Define thresholds as functions of context (season, location, related sensors) rather than constants. Log rejection rates and investigate if they exceed 1-2%.
1306.5 Outlier Detection and Handling
1306.5.1 Z-Score Method
Identifies outliers based on standard deviations from the mean:
import numpy as np
class ZScoreOutlierDetector:
def __init__(self, threshold=3.0, window_size=100):
self.threshold = threshold
self.window_size = window_size
self.buffer = []
def detect(self, value):
"""Returns (is_outlier, z_score)"""
self.buffer.append(value)
if len(self.buffer) > self.window_size:
self.buffer.pop(0)
if len(self.buffer) < 10:
return False, 0.0
mean = np.mean(self.buffer)
std = np.std(self.buffer)
if std == 0:
return False, 0.0
z_score = abs((value - mean) / std)
return z_score > self.threshold, z_score1306.5.2 Interquartile Range (IQR) Method
More robust to extreme outliers than Z-score:
class IQROutlierDetector:
def __init__(self, multiplier=1.5, window_size=100):
self.multiplier = multiplier
self.window_size = window_size
self.buffer = []
def detect(self, value):
"""Returns (is_outlier, bounds)"""
self.buffer.append(value)
if len(self.buffer) > self.window_size:
self.buffer.pop(0)
if len(self.buffer) < 10:
return False, (None, None)
q1 = np.percentile(self.buffer, 25)
q3 = np.percentile(self.buffer, 75)
iqr = q3 - q1
lower = q1 - self.multiplier * iqr
upper = q3 + self.multiplier * iqr
is_outlier = value < lower or value > upper
return is_outlier, (lower, upper)1306.5.3 Median Absolute Deviation (MAD)
Extremely robust to outliers - the median of absolute deviations from the median:
class MADOutlierDetector:
def __init__(self, threshold=3.5, window_size=100):
self.threshold = threshold
self.window_size = window_size
self.buffer = []
self.scale = 1.4826 # Consistency constant for normal distribution
def detect(self, value):
"""Returns (is_outlier, modified_z_score)"""
self.buffer.append(value)
if len(self.buffer) > self.window_size:
self.buffer.pop(0)
if len(self.buffer) < 10:
return False, 0.0
median = np.median(self.buffer)
mad = np.median(np.abs(np.array(self.buffer) - median))
if mad == 0:
return False, 0.0
modified_z = (value - median) / (self.scale * mad)
return abs(modified_z) > self.threshold, modified_z1306.5.4 Outlier Handling Strategies
| Strategy | When to Use | Implementation |
|---|---|---|
| Discard | Clear sensor faults | Skip reading, log event |
| Replace with median | Single outliers in stable signal | Use window median |
| Winsorize | Preserve data count | Clip to boundary values |
| Interpolate | Time-series continuity needed | Linear/spline interpolation |
| Flag and keep | Audit trail required | Add quality flag column |
1306.6 Knowledge Check
1306.7 Summary
Data validation and outlier detection form the first critical stages of the data quality pipeline:
- Range Validation: Check values against physical bounds - temperature cannot exceed what the environment allows, humidity cannot exceed 100%
- Rate-of-Change Validation: Physical systems have inertia - detect sensor faults by catching impossible jumps
- Multi-Sensor Plausibility: Cross-validate related measurements using physical relationships
- Z-Score Detection: Good for Gaussian distributions, uses standard deviations from mean
- IQR Detection: Robust to skewed data and extreme values, uses percentiles
- MAD Detection: Most robust method, based on median rather than mean
Critical Design Principle: Implement validation at the edge before data travels. Range and rate-of-change checks are computationally cheap and catch most sensor faults immediately.
1306.8 What’s Next
The next chapter covers Missing Value Imputation and Noise Filtering, exploring how to handle gaps in data and remove noise while preserving the underlying signal.
Data Quality Series:
- Data Quality and Preprocessing - Overview and index
- Missing Value Imputation and Noise Filtering - Handling gaps and noise
- Data Normalization and Preprocessing Lab - Scaling and hands-on practice
Prerequisites:
- Edge Data Acquisition - Where raw data originates
- Sensor Fundamentals - Understanding sensor characteristics
Next Steps:
- Multi-Sensor Data Fusion - Combining preprocessed data
- Anomaly Detection - Finding meaningful outliers in clean data