1305  Data Quality and Preprocessing

1305.1 Overview

Data quality preprocessing is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This series explores practical techniques for detecting and correcting data quality issues in real-time on resource-constrained edge devices.

TipMinimum Viable Understanding: Data Quality Pipeline

Core Concept: Data quality preprocessing is a multi-stage pipeline that validates, cleans, and transforms raw sensor readings into analysis-ready data - catching problems at the source before they propagate to decisions.

Why It Matters: A single corrupt sensor reading propagated through an IoT system can trigger false alarms, cause equipment shutdowns, or corrupt trained ML models. The 10-100x cost of fixing data quality issues increases exponentially as bad data moves downstream. Preprocessing at the edge catches 90% of issues at 1% of the cost.

Key Takeaway: Implement the “validate-clean-transform” pattern at the edge: first validate readings against physical bounds and rate-of-change limits, then apply appropriate noise filters, finally normalize for downstream analysis. Never skip validation - a temperature sensor cannot read -40C indoors, and humidity cannot exceed 100%.

1305.2 Chapter Series

This topic is covered in three focused chapters:

1305.2.1 Data Validation and Outlier Detection

The first stage of the data quality pipeline focuses on detecting invalid and anomalous readings:

  • Range Validation: Check values against physical bounds
  • Rate-of-Change Validation: Detect impossible sensor jumps
  • Multi-Sensor Plausibility: Cross-validate related measurements
  • Z-Score Detection: Identify outliers in Gaussian distributions
  • IQR and MAD Detection: Robust outlier detection for skewed data

1305.2.2 Missing Value Imputation and Noise Filtering

The second stage handles gaps in data and removes noise while preserving the underlying signal:

  • Forward Fill: Simple imputation for slowly-changing values
  • Linear Interpolation: Fill gaps in trending data
  • Seasonal Decomposition: Use periodic patterns for imputation
  • Sensor-Specific Strategies: Match imputation to sensor semantics
  • Moving Average and Median Filters: Smooth steady-state noise and remove spikes
  • Exponential Smoothing: Real-time filtering with tunable responsiveness

1305.2.3 Data Normalization and Preprocessing Lab

The final stage prepares data for analysis and provides hands-on practice:

  • Min-Max Scaling: Transform data to bounded ranges for neural networks
  • Z-Score Normalization: Center data for clustering and SVM
  • Robust Scaling: Outlier-resistant normalization
  • ESP32 Wokwi Lab: Complete data quality pipeline implementation
  • Challenge Exercises: Extend the pipeline with advanced techniques

1305.3 The Complete Pipeline

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Input["Raw Data"]
        S1[Sensor<br/>Readings]
    end

    subgraph Validate["Stage 1: Validate"]
        V1[Range Check]
        V2[Rate-of-Change]
        V3[Plausibility]
    end

    subgraph Clean["Stage 2: Clean"]
        C1[Outlier Detection]
        C2[Missing Value<br/>Imputation]
        C3[Noise Filtering]
    end

    subgraph Transform["Stage 3: Transform"]
        T1[Normalization]
        T2[Scaling]
        T3[Feature<br/>Engineering]
    end

    subgraph Output["Clean Data"]
        O1[Analysis<br/>Ready]
    end

    S1 --> V1
    V1 --> V2
    V2 --> V3
    V3 --> C1
    C1 --> C2
    C2 --> C3
    C3 --> T1
    T1 --> T2
    T2 --> T3
    T3 --> O1

    style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style V1 fill:#2C3E50,stroke:#16A085,color:#fff
    style V2 fill:#2C3E50,stroke:#16A085,color:#fff
    style V3 fill:#2C3E50,stroke:#16A085,color:#fff
    style C1 fill:#16A085,stroke:#2C3E50,color:#fff
    style C2 fill:#16A085,stroke:#2C3E50,color:#fff
    style C3 fill:#16A085,stroke:#2C3E50,color:#fff
    style T1 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style T3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style O1 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1305.1: Three-stage data quality pipeline: Validate (check bounds), Clean (remove errors), Transform (prepare for analysis)

1305.4 Learning Path

Recommended order:

  1. Start with Data Validation and Outlier Detection to understand how to catch invalid data at the source
  2. Continue with Missing Value Imputation and Noise Filtering to learn gap handling and signal smoothing
  3. Complete with Data Normalization and Preprocessing Lab for scaling techniques and hands-on practice

Prerequisites:

Next Steps: