1306 Data Validation and Outlier Detection

1306.1 Learning Objectives

By the end of this chapter, you will be able to:

Identify Data Quality Issues: Recognize common data quality problems in IoT sensor streams including outliers, missing values, noise, and drift
Implement Validation Rules: Create real-time data validation pipelines that detect and flag invalid sensor readings
Apply Outlier Detection: Use statistical methods (Z-score, IQR, MAD) to identify and handle anomalous data points
Design Context-Aware Validation: Build adaptive validation thresholds that account for environmental conditions

1306.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Edge Data Acquisition: Understanding how sensor data is collected at the edge provides context for where preprocessing fits in the pipeline
Sensor Fundamentals: Knowledge of sensor characteristics helps understand why data quality issues arise
Signal Processing Essentials: Basic concepts of filtering and signal conditioning

For Kids: Meet the Sensor Squad!

Data quality is like being a detective who checks if clues are real or fake before solving a mystery!

1306.2.1 The Sensor Squad Adventure: The Case of the Confused Readings

Sammy the Sensor was the star detective at Sensor Squad Headquarters. Every day, sensors from all over the city sent readings to Sammy for the Big Weather Report.

One morning, something STRANGE happened!

“Temperature Terry says it’s 500 degrees in the park!” gasped Lila the LED. “But Humidity Hannah says the park is underwater AND it’s -100 degrees there!”

Sammy knew something was VERY wrong. “That’s impossible! Water can’t be frozen AND boiling at the same time. And nothing on Earth is 500 degrees outside of volcanoes!”

Bella the Battery suggested, “Maybe the sensors made mistakes? Or maybe a squirrel chewed Terry’s wires?”

Sammy put on his detective hat and created the “Data Quality Checklist”:

Step 1 - VALIDATE (Is it even possible?) “Can a park really be 500 degrees? NO! That’s hotter than an oven! REJECTED!”

Step 2 - CLEAN (Remove the mistakes) “Hannah’s -100 degrees? Let me check the sensors nearby. They all say 25 degrees. Hannah is the odd one out - she needs a checkup!”

After checking everything, Sammy had CLEAN, TRUSTWORTHY data for the Weather Report.

Max the Microcontroller was impressed. “Without your detective work, we would have told everyone the park was on fire AND frozen at the same time!”

The Sensor Squad learned: Bad data in = Bad decisions out! Always check your data before trusting it!

1306.2.2 Key Words for Kids

Word	What It Means
Data Quality	How good and trustworthy your information is - like checking if food is fresh before eating
Outlier	A reading that’s WAY different from the others - like one kid saying they’re 100 feet tall
Validation	Checking if a reading is even POSSIBLE - humans can’t be 100 feet tall!

1306.2.3 Try This at Home!

The Data Detective Game:

Ask 5 family members their height (or age, or favorite number 1-10)
Write down their answers
Now add ONE obviously wrong answer: “Uncle Bob is 50 feet tall!”

Play detective: - Which answer is the OUTLIER? (50 feet - impossible for a human!) - How did you know it was wrong? (No human is that tall - that’s VALIDATION)

Now you’re a Data Quality Detective just like Sammy!

For Beginners: What is Data Quality in IoT?

Think of sensor data like ingredients in a recipe. Even the best chef cannot make a great dish from spoiled or contaminated ingredients. Similarly, even the most sophisticated analytics and machine learning models cannot produce reliable results from low-quality sensor data.

Common data quality problems in IoT:

Problem	Example	Impact
Outliers	Temperature reads -40C in a room	False alerts, wrong decisions
Missing Data	Sensor battery dies for 2 hours	Gaps in analysis, failed predictions
Noise	Vibration sensor picks up building HVAC	Obscures real signals
Drift	Humidity sensor slowly reads 5% high	Gradual accuracy loss
Invalid Range	Humidity shows 105%	Physically impossible readings

Why preprocessing matters:

Garbage in, garbage out: ML models amplify data problems
Edge resources: Cleaning data at the source reduces bandwidth
Real-time decisions: Invalid data triggers false alarms
Long-term trends: Drift and bias corrupt historical analysis

Key question this chapter answers: “How do I detect invalid or anomalous sensor readings before they cause problems?”

Minimum Viable Understanding: Data Validation

Core Concept: Data validation is the first line of defense in data quality, checking whether sensor readings are physically possible and plausible before any further processing.

Why It Matters: A single corrupt sensor reading propagated through an IoT system can trigger false alarms, cause equipment shutdowns, or corrupt trained ML models. Validation catches 80% of data quality issues at the point of collection.

Key Takeaway: Always validate readings against physical bounds (temperature cannot exceed 100C for a room sensor) and rate-of-change limits (indoor temperature cannot change 20C in one second). If a reading fails these basic checks, it is always wrong.

1306.3 The Data Quality Pipeline

~10 min | - - Intermediate | - P10.C09.U01

Data quality is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This section explores the validation and outlier detection stages of data quality preprocessing.

Key Takeaway

In one sentence: Clean data at the edge before it travels - fixing bad data at the source costs 1% of fixing it in the cloud.

Remember this rule: If a reading violates physical laws (temperature below absolute zero, humidity above 100%), it is always wrong.

1306.3.1 Pipeline Overview

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Input["Raw Data"]
        S1[Sensor<br/>Readings]
    end

    subgraph Validate["Stage 1: Validate"]
        V1[Range Check]
        V2[Rate-of-Change]
        V3[Plausibility]
    end

    subgraph Clean["Stage 2: Clean"]
        C1[Outlier Detection]
        C2[Missing Value<br/>Imputation]
        C3[Noise Filtering]
    end

    subgraph Transform["Stage 3: Transform"]
        T1[Normalization]
        T2[Scaling]
        T3[Feature<br/>Engineering]
    end

    subgraph Output["Clean Data"]
        O1[Analysis<br/>Ready]
    end

    S1 --> V1
    V1 --> V2
    V2 --> V3
    V3 --> C1
    C1 --> C2
    C2 --> C3
    C3 --> T1
    T1 --> T2
    T2 --> T3
    T3 --> O1

    style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style V1 fill:#2C3E50,stroke:#16A085,color:#fff
    style V2 fill:#2C3E50,stroke:#16A085,color:#fff
    style V3 fill:#2C3E50,stroke:#16A085,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff
    style T1 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style O1 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1306.1: Three-stage data quality pipeline: Validate (check bounds), Clean (remove errors), Transform (prepare for analysis)

Alternative View: Data Quality Decision Tree

This view helps determine the appropriate handling strategy for different data quality issues:

%% fig-alt: "Decision tree for handling data quality issues. Start with the type of problem detected. If value is outside physical bounds, discard and flag sensor fault. If value is statistically unusual but physically possible, check context - if contextually valid keep it, otherwise apply outlier treatment. If value is missing, determine cause - if sensor offline use forward-fill or interpolation, if transmission gap use backfill when data arrives. If value is noisy, apply appropriate filter based on signal characteristics."
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
    Start[Data Quality<br/>Issue Detected] --> Type{Issue Type?}

    Type -->|Out of Range| Range[Physical<br/>Bounds Violation]
    Type -->|Unusual Value| Stats[Statistical<br/>Outlier]
    Type -->|No Reading| Missing[Missing<br/>Data]
    Type -->|Fluctuating| Noise[Noisy<br/>Signal]

    Range --> Discard[Discard Reading<br/>Flag Sensor Fault]

    Stats --> Context{Contextually<br/>Valid?}
    Context -->|Yes| Keep[Keep Value<br/>Document]
    Context -->|No| Treat[Apply Outlier<br/>Treatment]

    Missing --> Cause{Cause<br/>Known?}
    Cause -->|Sensor Offline| Forward[Forward-Fill<br/>or Interpolate]
    Cause -->|Transmission Gap| Backfill[Backfill When<br/>Data Arrives]

    Noise --> Filter[Apply Appropriate<br/>Noise Filter]

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style Discard fill:#E74C3C,stroke:#2C3E50,color:#fff
    style Keep fill:#27AE60,stroke:#2C3E50,color:#fff
    style Treat fill:#E67E22,stroke:#2C3E50,color:#fff
    style Forward fill:#16A085,stroke:#2C3E50,color:#fff
    style Backfill fill:#16A085,stroke:#2C3E50,color:#fff
    style Filter fill:#3498DB,stroke:#2C3E50,color:#fff

Use this decision tree to select the appropriate handling strategy for each type of data quality issue.

1306.4 Data Validation Techniques

~15 min | - - Intermediate | - P10.C09.U02

1306.4.1 Range Validation

The simplest form of validation checks whether values fall within physically possible or operationally expected bounds:

class RangeValidator:
    def __init__(self, sensor_type):
        """Define valid ranges for common sensor types"""
        self.ranges = {
            'temperature_indoor': (-10, 50),      # Celsius
            'temperature_outdoor': (-40, 60),     # Celsius
            'humidity': (0, 100),                 # Percentage
            'pressure': (870, 1084),              # hPa (sea level)
            'light': (0, 100000),                 # Lux
            'co2': (300, 5000),                   # ppm
            'battery_voltage': (2.5, 4.2),        # Li-ion typical
            'soil_moisture': (0, 100),            # Percentage
        }
        self.min_val, self.max_val = self.ranges.get(
            sensor_type, (float('-inf'), float('inf'))
        )

    def validate(self, value):
        """Check if value is within valid range"""
        if value < self.min_val or value > self.max_val:
            return False, f"Out of range [{self.min_val}, {self.max_val}]"
        return True, "Valid"

# Example usage
temp_validator = RangeValidator('temperature_indoor')
print(temp_validator.validate(22.5))   # (True, 'Valid')
print(temp_validator.validate(-45.0))  # (False, 'Out of range [-10, 50]')

1306.4.2 Rate-of-Change Validation

Physical sensors cannot change instantaneously. A temperature sensor reading 22C one second and -40C the next indicates a fault, not an actual temperature change:

class RateOfChangeValidator:
    def __init__(self, max_rate_per_second):
        """
        max_rate_per_second: Maximum expected change per second
        Example: Indoor temperature typically changes < 0.5C/min = 0.0083C/s
        """
        self.max_rate = max_rate_per_second
        self.last_value = None
        self.last_time = None

    def validate(self, value, timestamp):
        if self.last_value is None:
            self.last_value = value
            self.last_time = timestamp
            return True, "First reading"

        time_delta = timestamp - self.last_time
        if time_delta <= 0:
            return False, "Invalid timestamp"

        rate = abs(value - self.last_value) / time_delta

        # Update history
        self.last_value = value
        self.last_time = timestamp

        if rate > self.max_rate:
            return False, f"Rate {rate:.4f}/s exceeds max {self.max_rate}/s"

        return True, f"Rate {rate:.4f}/s OK"

# Indoor temperature: max 0.5C per minute = 0.0083C/s
temp_roc = RateOfChangeValidator(max_rate_per_second=0.0083)

1306.4.3 Multi-Sensor Plausibility Checks

When multiple sensors measure related quantities, cross-validation can detect faults:

def cross_validate_weather(temperature, humidity, dew_point_reported):
    """
    Validate weather sensor readings using physical relationships.
    Dew point cannot exceed temperature.
    """
    # Calculate expected dew point from T and RH
    # Magnus formula approximation
    import math

    a = 17.27
    b = 237.7
    alpha = ((a * temperature) / (b + temperature)) + math.log(humidity / 100.0)
    dew_point_calculated = (b * alpha) / (a - alpha)

    # Check if reported dew point is plausible
    errors = []

    if dew_point_reported > temperature:
        errors.append(f"Dew point ({dew_point_reported}C) > Temperature ({temperature}C)")

    if abs(dew_point_calculated - dew_point_reported) > 5:
        errors.append(f"Dew point mismatch: calculated {dew_point_calculated:.1f}C vs reported {dew_point_reported}C")

    return len(errors) == 0, errors

Common Pitfall: Static Validation Thresholds

The mistake: Using fixed validation thresholds that do not account for context, leading to either excessive false rejections or missed errors.

Symptoms:

Valid readings rejected during extreme weather events
Obvious sensor faults accepted because values are technically “in range”
Different deployments require manual threshold tuning
Seasonal patterns cause periodic validation failures

Why it happens: Engineers set thresholds based on typical conditions without considering edge cases. A threshold of 35C maximum for indoor temperature works in temperate climates but fails in regions with summer heatwaves.

The fix: Implement context-aware validation:

# Bad: Static threshold
if temperature > 35:
    reject_reading()

# Good: Context-aware threshold
outdoor_temp = get_outdoor_temperature()
if temperature > min(40, outdoor_temp + 10):
    reject_reading()  # Indoor cannot be much hotter than outside

Prevention: Define thresholds as functions of context (season, location, related sensors) rather than constants. Log rejection rates and investigate if they exceed 1-2%.

1306.5 Outlier Detection and Handling

~15 min | - - Intermediate | - P10.C09.U03

1306.5.1 Z-Score Method

Identifies outliers based on standard deviations from the mean:

import numpy as np

class ZScoreOutlierDetector:
    def __init__(self, threshold=3.0, window_size=100):
        self.threshold = threshold
        self.window_size = window_size
        self.buffer = []

    def detect(self, value):
        """Returns (is_outlier, z_score)"""
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        if len(self.buffer) < 10:
            return False, 0.0

        mean = np.mean(self.buffer)
        std = np.std(self.buffer)

        if std == 0:
            return False, 0.0

        z_score = abs((value - mean) / std)
        return z_score > self.threshold, z_score

1306.5.2 Interquartile Range (IQR) Method

More robust to extreme outliers than Z-score:

class IQROutlierDetector:
    def __init__(self, multiplier=1.5, window_size=100):
        self.multiplier = multiplier
        self.window_size = window_size
        self.buffer = []

    def detect(self, value):
        """Returns (is_outlier, bounds)"""
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        if len(self.buffer) < 10:
            return False, (None, None)

        q1 = np.percentile(self.buffer, 25)
        q3 = np.percentile(self.buffer, 75)
        iqr = q3 - q1

        lower = q1 - self.multiplier * iqr
        upper = q3 + self.multiplier * iqr

        is_outlier = value < lower or value > upper
        return is_outlier, (lower, upper)

1306.5.3 Median Absolute Deviation (MAD)

Extremely robust to outliers - the median of absolute deviations from the median:

class MADOutlierDetector:
    def __init__(self, threshold=3.5, window_size=100):
        self.threshold = threshold
        self.window_size = window_size
        self.buffer = []
        self.scale = 1.4826  # Consistency constant for normal distribution

    def detect(self, value):
        """Returns (is_outlier, modified_z_score)"""
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        if len(self.buffer) < 10:
            return False, 0.0

        median = np.median(self.buffer)
        mad = np.median(np.abs(np.array(self.buffer) - median))

        if mad == 0:
            return False, 0.0

        modified_z = (value - median) / (self.scale * mad)
        return abs(modified_z) > self.threshold, modified_z

1306.5.4 Outlier Handling Strategies

Strategy	When to Use	Implementation
Discard	Clear sensor faults	Skip reading, log event
Replace with median	Single outliers in stable signal	Use window median
Winsorize	Preserve data count	Clip to boundary values
Interpolate	Time-series continuity needed	Linear/spline interpolation
Flag and keep	Audit trail required	Add quality flag column

1306.6 Knowledge Check

Quiz: Validation Methods

Question: A temperature sensor in a server room reports -40C while the HVAC system shows the room at 22C. The sensor’s valid range is -10C to 60C. What type of validation failure is this?

Explanation: The value -40C is outside the valid range of -10C to 60C defined for indoor temperature sensors. This is a range violation - the reading is physically impossible for this sensor placement. While the HVAC cross-check confirms the problem, the primary detection mechanism is range validation. This is likely a sensor fault (disconnection often shows extreme values on NTC thermistors).

Question: Which outlier detection method is most appropriate for battery voltage data that is naturally bounded between 3.0V and 4.2V with occasional voltage sags during transmission?

Explanation: The IQR method is ideal here because it uses percentiles rather than mean and standard deviation. For bounded data with occasional extreme values (voltage sags), IQR correctly identifies the main distribution without being skewed by the extremes. Z-score would be distorted by the natural bounds and occasional sags. ML is overkill for this univariate case.

1306.7 Summary

Data validation and outlier detection form the first critical stages of the data quality pipeline:

Range Validation: Check values against physical bounds - temperature cannot exceed what the environment allows, humidity cannot exceed 100%
Rate-of-Change Validation: Physical systems have inertia - detect sensor faults by catching impossible jumps
Multi-Sensor Plausibility: Cross-validate related measurements using physical relationships
Z-Score Detection: Good for Gaussian distributions, uses standard deviations from mean
IQR Detection: Robust to skewed data and extreme values, uses percentiles
MAD Detection: Most robust method, based on median rather than mean

Critical Design Principle: Implement validation at the edge before data travels. Range and rate-of-change checks are computationally cheap and catch most sensor faults immediately.

1306.8 What’s Next

The next chapter covers Missing Value Imputation and Noise Filtering, exploring how to handle gaps in data and remove noise while preserving the underlying signal.

Related Chapters and Resources

Data Quality Series:

Data Quality and Preprocessing - Overview and index
Missing Value Imputation and Noise Filtering - Handling gaps and noise
Data Normalization and Preprocessing Lab - Scaling and hands-on practice

Prerequisites:

Edge Data Acquisition - Where raw data originates
Sensor Fundamentals - Understanding sensor characteristics

Next Steps:

Multi-Sensor Data Fusion - Combining preprocessed data
Anomaly Detection - Finding meaningful outliers in clean data