1305 Data Quality and Preprocessing

1305.1 Overview

Data quality preprocessing is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This series explores practical techniques for detecting and correcting data quality issues in real-time on resource-constrained edge devices.

Minimum Viable Understanding: Data Quality Pipeline

Core Concept: Data quality preprocessing is a multi-stage pipeline that validates, cleans, and transforms raw sensor readings into analysis-ready data - catching problems at the source before they propagate to decisions.

Why It Matters: A single corrupt sensor reading propagated through an IoT system can trigger false alarms, cause equipment shutdowns, or corrupt trained ML models. The 10-100x cost of fixing data quality issues increases exponentially as bad data moves downstream. Preprocessing at the edge catches 90% of issues at 1% of the cost.

Key Takeaway: Implement the “validate-clean-transform” pattern at the edge: first validate readings against physical bounds and rate-of-change limits, then apply appropriate noise filters, finally normalize for downstream analysis. Never skip validation - a temperature sensor cannot read -40C indoors, and humidity cannot exceed 100%.

1305.2 Chapter Series

This topic is covered in three focused chapters:

1305.2.1 Data Validation and Outlier Detection

The first stage of the data quality pipeline focuses on detecting invalid and anomalous readings:

Range Validation: Check values against physical bounds
Rate-of-Change Validation: Detect impossible sensor jumps
Multi-Sensor Plausibility: Cross-validate related measurements
Z-Score Detection: Identify outliers in Gaussian distributions
IQR and MAD Detection: Robust outlier detection for skewed data

1305.2.2 Missing Value Imputation and Noise Filtering

The second stage handles gaps in data and removes noise while preserving the underlying signal:

Forward Fill: Simple imputation for slowly-changing values
Linear Interpolation: Fill gaps in trending data
Seasonal Decomposition: Use periodic patterns for imputation
Sensor-Specific Strategies: Match imputation to sensor semantics
Moving Average and Median Filters: Smooth steady-state noise and remove spikes
Exponential Smoothing: Real-time filtering with tunable responsiveness

1305.2.3 Data Normalization and Preprocessing Lab

The final stage prepares data for analysis and provides hands-on practice:

Min-Max Scaling: Transform data to bounded ranges for neural networks
Z-Score Normalization: Center data for clustering and SVM
Robust Scaling: Outlier-resistant normalization
ESP32 Wokwi Lab: Complete data quality pipeline implementation
Challenge Exercises: Extend the pipeline with advanced techniques

1305.3 The Complete Pipeline

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Input["Raw Data"]
        S1[Sensor<br/>Readings]
    end

    subgraph Validate["Stage 1: Validate"]
        V1[Range Check]
        V2[Rate-of-Change]
        V3[Plausibility]
    end

    subgraph Clean["Stage 2: Clean"]
        C1[Outlier Detection]
        C2[Missing Value<br/>Imputation]
        C3[Noise Filtering]
    end

    subgraph Transform["Stage 3: Transform"]
        T1[Normalization]
        T2[Scaling]
        T3[Feature<br/>Engineering]
    end

    subgraph Output["Clean Data"]
        O1[Analysis<br/>Ready]
    end

    S1 --> V1
    V1 --> V2
    V2 --> V3
    V3 --> C1
    C1 --> C2
    C2 --> C3
    C3 --> T1
    T1 --> T2
    T2 --> T3
    T3 --> O1

    style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style V1 fill:#2C3E50,stroke:#16A085,color:#fff
    style V2 fill:#2C3E50,stroke:#16A085,color:#fff
    style V3 fill:#2C3E50,stroke:#16A085,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff
    style T1 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style O1 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1305.1: Three-stage data quality pipeline: Validate (check bounds), Clean (remove errors), Transform (prepare for analysis)

1305.4 Learning Path

Recommended order:

Start with Data Validation and Outlier Detection to understand how to catch invalid data at the source
Continue with Missing Value Imputation and Noise Filtering to learn gap handling and signal smoothing
Complete with Data Normalization and Preprocessing Lab for scaling techniques and hands-on practice

Prerequisites:

Edge Data Acquisition - Where raw data originates
Sensor Fundamentals - Understanding sensor characteristics
Signal Processing Essentials - Basic filtering concepts

Next Steps:

Multi-Sensor Data Fusion - Combining preprocessed data
Anomaly Detection - Finding meaningful outliers in clean data
Stream Processing - Real-time data pipelines