1306  Data Validation and Outlier Detection

1306.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Identify Data Quality Issues: Recognize common data quality problems in IoT sensor streams including outliers, missing values, noise, and drift
  • Implement Validation Rules: Create real-time data validation pipelines that detect and flag invalid sensor readings
  • Apply Outlier Detection: Use statistical methods (Z-score, IQR, MAD) to identify and handle anomalous data points
  • Design Context-Aware Validation: Build adaptive validation thresholds that account for environmental conditions

1306.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Data quality is like being a detective who checks if clues are real or fake before solving a mystery!

1306.2.1 The Sensor Squad Adventure: The Case of the Confused Readings

Sammy the Sensor was the star detective at Sensor Squad Headquarters. Every day, sensors from all over the city sent readings to Sammy for the Big Weather Report.

One morning, something STRANGE happened!

“Temperature Terry says it’s 500 degrees in the park!” gasped Lila the LED. “But Humidity Hannah says the park is underwater AND it’s -100 degrees there!”

Sammy knew something was VERY wrong. “That’s impossible! Water can’t be frozen AND boiling at the same time. And nothing on Earth is 500 degrees outside of volcanoes!”

Bella the Battery suggested, “Maybe the sensors made mistakes? Or maybe a squirrel chewed Terry’s wires?”

Sammy put on his detective hat and created the “Data Quality Checklist”:

Step 1 - VALIDATE (Is it even possible?) “Can a park really be 500 degrees? NO! That’s hotter than an oven! REJECTED!”

Step 2 - CLEAN (Remove the mistakes) “Hannah’s -100 degrees? Let me check the sensors nearby. They all say 25 degrees. Hannah is the odd one out - she needs a checkup!”

After checking everything, Sammy had CLEAN, TRUSTWORTHY data for the Weather Report.

Max the Microcontroller was impressed. “Without your detective work, we would have told everyone the park was on fire AND frozen at the same time!”

The Sensor Squad learned: Bad data in = Bad decisions out! Always check your data before trusting it!

1306.2.2 Key Words for Kids

Word What It Means
Data Quality How good and trustworthy your information is - like checking if food is fresh before eating
Outlier A reading that’s WAY different from the others - like one kid saying they’re 100 feet tall
Validation Checking if a reading is even POSSIBLE - humans can’t be 100 feet tall!

1306.2.3 Try This at Home!

The Data Detective Game:

  1. Ask 5 family members their height (or age, or favorite number 1-10)
  2. Write down their answers
  3. Now add ONE obviously wrong answer: “Uncle Bob is 50 feet tall!”

Play detective: - Which answer is the OUTLIER? (50 feet - impossible for a human!) - How did you know it was wrong? (No human is that tall - that’s VALIDATION)

Now you’re a Data Quality Detective just like Sammy!

Think of sensor data like ingredients in a recipe. Even the best chef cannot make a great dish from spoiled or contaminated ingredients. Similarly, even the most sophisticated analytics and machine learning models cannot produce reliable results from low-quality sensor data.

Common data quality problems in IoT:

Problem Example Impact
Outliers Temperature reads -40C in a room False alerts, wrong decisions
Missing Data Sensor battery dies for 2 hours Gaps in analysis, failed predictions
Noise Vibration sensor picks up building HVAC Obscures real signals
Drift Humidity sensor slowly reads 5% high Gradual accuracy loss
Invalid Range Humidity shows 105% Physically impossible readings

Why preprocessing matters:

  1. Garbage in, garbage out: ML models amplify data problems
  2. Edge resources: Cleaning data at the source reduces bandwidth
  3. Real-time decisions: Invalid data triggers false alarms
  4. Long-term trends: Drift and bias corrupt historical analysis

Key question this chapter answers: “How do I detect invalid or anomalous sensor readings before they cause problems?”

TipMinimum Viable Understanding: Data Validation

Core Concept: Data validation is the first line of defense in data quality, checking whether sensor readings are physically possible and plausible before any further processing.

Why It Matters: A single corrupt sensor reading propagated through an IoT system can trigger false alarms, cause equipment shutdowns, or corrupt trained ML models. Validation catches 80% of data quality issues at the point of collection.

Key Takeaway: Always validate readings against physical bounds (temperature cannot exceed 100C for a room sensor) and rate-of-change limits (indoor temperature cannot change 20C in one second). If a reading fails these basic checks, it is always wrong.

1306.3 The Data Quality Pipeline

  • ~10 min | - - Intermediate | - P10.C09.U01

Data quality is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This section explores the validation and outlier detection stages of data quality preprocessing.

NoteKey Takeaway

In one sentence: Clean data at the edge before it travels - fixing bad data at the source costs 1% of fixing it in the cloud.

Remember this rule: If a reading violates physical laws (temperature below absolute zero, humidity above 100%), it is always wrong.

1306.3.1 Pipeline Overview

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Input["Raw Data"]
        S1[Sensor<br/>Readings]
    end

    subgraph Validate["Stage 1: Validate"]
        V1[Range Check]
        V2[Rate-of-Change]
        V3[Plausibility]
    end

    subgraph Clean["Stage 2: Clean"]
        C1[Outlier Detection]
        C2[Missing Value<br/>Imputation]
        C3[Noise Filtering]
    end

    subgraph Transform["Stage 3: Transform"]
        T1[Normalization]
        T2[Scaling]
        T3[Feature<br/>Engineering]
    end

    subgraph Output["Clean Data"]
        O1[Analysis<br/>Ready]
    end

    S1 --> V1
    V1 --> V2
    V2 --> V3
    V3 --> C1
    C1 --> C2
    C2 --> C3
    C3 --> T1
    T1 --> T2
    T2 --> T3
    T3 --> O1

    style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style V1 fill:#2C3E50,stroke:#16A085,color:#fff
    style V2 fill:#2C3E50,stroke:#16A085,color:#fff
    style V3 fill:#2C3E50,stroke:#16A085,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff
    style T1 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style O1 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1306.1: Three-stage data quality pipeline: Validate (check bounds), Clean (remove errors), Transform (prepare for analysis)

This view helps determine the appropriate handling strategy for different data quality issues:

%% fig-alt: "Decision tree for handling data quality issues. Start with the type of problem detected. If value is outside physical bounds, discard and flag sensor fault. If value is statistically unusual but physically possible, check context - if contextually valid keep it, otherwise apply outlier treatment. If value is missing, determine cause - if sensor offline use forward-fill or interpolation, if transmission gap use backfill when data arrives. If value is noisy, apply appropriate filter based on signal characteristics."
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
    Start[Data Quality<br/>Issue Detected] --> Type{Issue Type?}

    Type -->|Out of Range| Range[Physical<br/>Bounds Violation]
    Type -->|Unusual Value| Stats[Statistical<br/>Outlier]
    Type -->|No Reading| Missing[Missing<br/>Data]
    Type -->|Fluctuating| Noise[Noisy<br/>Signal]

    Range --> Discard[Discard Reading<br/>Flag Sensor Fault]

    Stats --> Context{Contextually<br/>Valid?}
    Context -->|Yes| Keep[Keep Value<br/>Document]
    Context -->|No| Treat[Apply Outlier<br/>Treatment]

    Missing --> Cause{Cause<br/>Known?}
    Cause -->|Sensor Offline| Forward[Forward-Fill<br/>or Interpolate]
    Cause -->|Transmission Gap| Backfill[Backfill When<br/>Data Arrives]

    Noise --> Filter[Apply Appropriate<br/>Noise Filter]

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style Discard fill:#E74C3C,stroke:#2C3E50,color:#fff
    style Keep fill:#27AE60,stroke:#2C3E50,color:#fff
    style Treat fill:#E67E22,stroke:#2C3E50,color:#fff
    style Forward fill:#16A085,stroke:#2C3E50,color:#fff
    style Backfill fill:#16A085,stroke:#2C3E50,color:#fff
    style Filter fill:#3498DB,stroke:#2C3E50,color:#fff

Use this decision tree to select the appropriate handling strategy for each type of data quality issue.

1306.4 Data Validation Techniques

  • ~15 min | - - Intermediate | - P10.C09.U02

1306.4.1 Range Validation

The simplest form of validation checks whether values fall within physically possible or operationally expected bounds:

class RangeValidator:
    def __init__(self, sensor_type):
        """Define valid ranges for common sensor types"""
        self.ranges = {
            'temperature_indoor': (-10, 50),      # Celsius
            'temperature_outdoor': (-40, 60),     # Celsius
            'humidity': (0, 100),                 # Percentage
            'pressure': (870, 1084),              # hPa (sea level)
            'light': (0, 100000),                 # Lux
            'co2': (300, 5000),                   # ppm
            'battery_voltage': (2.5, 4.2),        # Li-ion typical
            'soil_moisture': (0, 100),            # Percentage
        }
        self.min_val, self.max_val = self.ranges.get(
            sensor_type, (float('-inf'), float('inf'))
        )

    def validate(self, value):
        """Check if value is within valid range"""
        if value < self.min_val or value > self.max_val:
            return False, f"Out of range [{self.min_val}, {self.max_val}]"
        return True, "Valid"

# Example usage
temp_validator = RangeValidator('temperature_indoor')
print(temp_validator.validate(22.5))   # (True, 'Valid')
print(temp_validator.validate(-45.0))  # (False, 'Out of range [-10, 50]')

1306.4.2 Rate-of-Change Validation

Physical sensors cannot change instantaneously. A temperature sensor reading 22C one second and -40C the next indicates a fault, not an actual temperature change:

class RateOfChangeValidator:
    def __init__(self, max_rate_per_second):
        """
        max_rate_per_second: Maximum expected change per second
        Example: Indoor temperature typically changes < 0.5C/min = 0.0083C/s
        """
        self.max_rate = max_rate_per_second
        self.last_value = None
        self.last_time = None

    def validate(self, value, timestamp):
        if self.last_value is None:
            self.last_value = value
            self.last_time = timestamp
            return True, "First reading"

        time_delta = timestamp - self.last_time
        if time_delta <= 0:
            return False, "Invalid timestamp"

        rate = abs(value - self.last_value) / time_delta

        # Update history
        self.last_value = value
        self.last_time = timestamp

        if rate > self.max_rate:
            return False, f"Rate {rate:.4f}/s exceeds max {self.max_rate}/s"

        return True, f"Rate {rate:.4f}/s OK"

# Indoor temperature: max 0.5C per minute = 0.0083C/s
temp_roc = RateOfChangeValidator(max_rate_per_second=0.0083)

1306.4.3 Multi-Sensor Plausibility Checks

When multiple sensors measure related quantities, cross-validation can detect faults:

def cross_validate_weather(temperature, humidity, dew_point_reported):
    """
    Validate weather sensor readings using physical relationships.
    Dew point cannot exceed temperature.
    """
    # Calculate expected dew point from T and RH
    # Magnus formula approximation
    import math

    a = 17.27
    b = 237.7
    alpha = ((a * temperature) / (b + temperature)) + math.log(humidity / 100.0)
    dew_point_calculated = (b * alpha) / (a - alpha)

    # Check if reported dew point is plausible
    errors = []

    if dew_point_reported > temperature:
        errors.append(f"Dew point ({dew_point_reported}C) > Temperature ({temperature}C)")

    if abs(dew_point_calculated - dew_point_reported) > 5:
        errors.append(f"Dew point mismatch: calculated {dew_point_calculated:.1f}C vs reported {dew_point_reported}C")

    return len(errors) == 0, errors
WarningCommon Pitfall: Static Validation Thresholds

The mistake: Using fixed validation thresholds that do not account for context, leading to either excessive false rejections or missed errors.

Symptoms:

  • Valid readings rejected during extreme weather events
  • Obvious sensor faults accepted because values are technically “in range”
  • Different deployments require manual threshold tuning
  • Seasonal patterns cause periodic validation failures

Why it happens: Engineers set thresholds based on typical conditions without considering edge cases. A threshold of 35C maximum for indoor temperature works in temperate climates but fails in regions with summer heatwaves.

The fix: Implement context-aware validation:

# Bad: Static threshold
if temperature > 35:
    reject_reading()

# Good: Context-aware threshold
outdoor_temp = get_outdoor_temperature()
if temperature > min(40, outdoor_temp + 10):
    reject_reading()  # Indoor cannot be much hotter than outside

Prevention: Define thresholds as functions of context (season, location, related sensors) rather than constants. Log rejection rates and investigate if they exceed 1-2%.

1306.5 Outlier Detection and Handling

  • ~15 min | - - Intermediate | - P10.C09.U03

1306.5.1 Z-Score Method

Identifies outliers based on standard deviations from the mean:

import numpy as np

class ZScoreOutlierDetector:
    def __init__(self, threshold=3.0, window_size=100):
        self.threshold = threshold
        self.window_size = window_size
        self.buffer = []

    def detect(self, value):
        """Returns (is_outlier, z_score)"""
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        if len(self.buffer) < 10:
            return False, 0.0

        mean = np.mean(self.buffer)
        std = np.std(self.buffer)

        if std == 0:
            return False, 0.0

        z_score = abs((value - mean) / std)
        return z_score > self.threshold, z_score

1306.5.2 Interquartile Range (IQR) Method

More robust to extreme outliers than Z-score:

class IQROutlierDetector:
    def __init__(self, multiplier=1.5, window_size=100):
        self.multiplier = multiplier
        self.window_size = window_size
        self.buffer = []

    def detect(self, value):
        """Returns (is_outlier, bounds)"""
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        if len(self.buffer) < 10:
            return False, (None, None)

        q1 = np.percentile(self.buffer, 25)
        q3 = np.percentile(self.buffer, 75)
        iqr = q3 - q1

        lower = q1 - self.multiplier * iqr
        upper = q3 + self.multiplier * iqr

        is_outlier = value < lower or value > upper
        return is_outlier, (lower, upper)

1306.5.3 Median Absolute Deviation (MAD)

Extremely robust to outliers - the median of absolute deviations from the median:

class MADOutlierDetector:
    def __init__(self, threshold=3.5, window_size=100):
        self.threshold = threshold
        self.window_size = window_size
        self.buffer = []
        self.scale = 1.4826  # Consistency constant for normal distribution

    def detect(self, value):
        """Returns (is_outlier, modified_z_score)"""
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        if len(self.buffer) < 10:
            return False, 0.0

        median = np.median(self.buffer)
        mad = np.median(np.abs(np.array(self.buffer) - median))

        if mad == 0:
            return False, 0.0

        modified_z = (value - median) / (self.scale * mad)
        return abs(modified_z) > self.threshold, modified_z

1306.5.4 Outlier Handling Strategies

Strategy When to Use Implementation
Discard Clear sensor faults Skip reading, log event
Replace with median Single outliers in stable signal Use window median
Winsorize Preserve data count Clip to boundary values
Interpolate Time-series continuity needed Linear/spline interpolation
Flag and keep Audit trail required Add quality flag column

1306.6 Knowledge Check

Question: A temperature sensor in a server room reports -40C while the HVAC system shows the room at 22C. The sensor’s valid range is -10C to 60C. What type of validation failure is this?

Explanation: The value -40C is outside the valid range of -10C to 60C defined for indoor temperature sensors. This is a range violation - the reading is physically impossible for this sensor placement. While the HVAC cross-check confirms the problem, the primary detection mechanism is range validation. This is likely a sensor fault (disconnection often shows extreme values on NTC thermistors).

Question: Which outlier detection method is most appropriate for battery voltage data that is naturally bounded between 3.0V and 4.2V with occasional voltage sags during transmission?

Explanation: The IQR method is ideal here because it uses percentiles rather than mean and standard deviation. For bounded data with occasional extreme values (voltage sags), IQR correctly identifies the main distribution without being skewed by the extremes. Z-score would be distorted by the natural bounds and occasional sags. ML is overkill for this univariate case.

1306.7 Summary

Data validation and outlier detection form the first critical stages of the data quality pipeline:

  • Range Validation: Check values against physical bounds - temperature cannot exceed what the environment allows, humidity cannot exceed 100%
  • Rate-of-Change Validation: Physical systems have inertia - detect sensor faults by catching impossible jumps
  • Multi-Sensor Plausibility: Cross-validate related measurements using physical relationships
  • Z-Score Detection: Good for Gaussian distributions, uses standard deviations from mean
  • IQR Detection: Robust to skewed data and extreme values, uses percentiles
  • MAD Detection: Most robust method, based on median rather than mean

Critical Design Principle: Implement validation at the edge before data travels. Range and rate-of-change checks are computationally cheap and catch most sensor faults immediately.

1306.8 What’s Next

The next chapter covers Missing Value Imputation and Noise Filtering, exploring how to handle gaps in data and remove noise while preserving the underlying signal.

Data Quality Series:

Prerequisites:

Next Steps: