%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
subgraph Input["Raw Data"]
S1[Sensor<br/>Readings]
end
subgraph Validate["Stage 1: Validate"]
V1[Range Check]
V2[Rate-of-Change]
V3[Plausibility]
end
subgraph Clean["Stage 2: Clean"]
C1[Outlier Detection]
C2[Missing Value<br/>Imputation]
C3[Noise Filtering]
end
subgraph Transform["Stage 3: Transform"]
T1[Normalization]
T2[Scaling]
T3[Feature<br/>Engineering]
end
subgraph Output["Clean Data"]
O1[Analysis<br/>Ready]
end
S1 --> V1
V1 --> V2
V2 --> V3
V3 --> C1
C1 --> C2
C2 --> C3
C3 --> T1
T1 --> T2
T2 --> T3
T3 --> O1
style S1 fill:#E67E22,stroke:#2C3E50,color:#fff
style V1 fill:#2C3E50,stroke:#16A085,color:#fff
style V2 fill:#2C3E50,stroke:#16A085,color:#fff
style V3 fill:#2C3E50,stroke:#16A085,color:#fff
style C1 fill:#16A085,stroke:#2C3E50,color:#fff
style C2 fill:#16A085,stroke:#2C3E50,color:#fff
style C3 fill:#16A085,stroke:#2C3E50,color:#fff
style T1 fill:#7F8C8D,stroke:#2C3E50,color:#fff
style T2 fill:#7F8C8D,stroke:#2C3E50,color:#fff
style T3 fill:#7F8C8D,stroke:#2C3E50,color:#fff
style O1 fill:#27AE60,stroke:#2C3E50,color:#fff
1305 Data Quality and Preprocessing
1305.1 Overview
Data quality preprocessing is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This series explores practical techniques for detecting and correcting data quality issues in real-time on resource-constrained edge devices.
Core Concept: Data quality preprocessing is a multi-stage pipeline that validates, cleans, and transforms raw sensor readings into analysis-ready data - catching problems at the source before they propagate to decisions.
Why It Matters: A single corrupt sensor reading propagated through an IoT system can trigger false alarms, cause equipment shutdowns, or corrupt trained ML models. The 10-100x cost of fixing data quality issues increases exponentially as bad data moves downstream. Preprocessing at the edge catches 90% of issues at 1% of the cost.
Key Takeaway: Implement the “validate-clean-transform” pattern at the edge: first validate readings against physical bounds and rate-of-change limits, then apply appropriate noise filters, finally normalize for downstream analysis. Never skip validation - a temperature sensor cannot read -40C indoors, and humidity cannot exceed 100%.
1305.2 Chapter Series
This topic is covered in three focused chapters:
1305.2.1 Data Validation and Outlier Detection
The first stage of the data quality pipeline focuses on detecting invalid and anomalous readings:
- Range Validation: Check values against physical bounds
- Rate-of-Change Validation: Detect impossible sensor jumps
- Multi-Sensor Plausibility: Cross-validate related measurements
- Z-Score Detection: Identify outliers in Gaussian distributions
- IQR and MAD Detection: Robust outlier detection for skewed data
1305.2.2 Missing Value Imputation and Noise Filtering
The second stage handles gaps in data and removes noise while preserving the underlying signal:
- Forward Fill: Simple imputation for slowly-changing values
- Linear Interpolation: Fill gaps in trending data
- Seasonal Decomposition: Use periodic patterns for imputation
- Sensor-Specific Strategies: Match imputation to sensor semantics
- Moving Average and Median Filters: Smooth steady-state noise and remove spikes
- Exponential Smoothing: Real-time filtering with tunable responsiveness
1305.2.3 Data Normalization and Preprocessing Lab
The final stage prepares data for analysis and provides hands-on practice:
- Min-Max Scaling: Transform data to bounded ranges for neural networks
- Z-Score Normalization: Center data for clustering and SVM
- Robust Scaling: Outlier-resistant normalization
- ESP32 Wokwi Lab: Complete data quality pipeline implementation
- Challenge Exercises: Extend the pipeline with advanced techniques
1305.3 The Complete Pipeline
1305.4 Learning Path
Recommended order:
- Start with Data Validation and Outlier Detection to understand how to catch invalid data at the source
- Continue with Missing Value Imputation and Noise Filtering to learn gap handling and signal smoothing
- Complete with Data Normalization and Preprocessing Lab for scaling techniques and hands-on practice
Prerequisites:
- Edge Data Acquisition - Where raw data originates
- Sensor Fundamentals - Understanding sensor characteristics
- Signal Processing Essentials - Basic filtering concepts
Next Steps:
- Multi-Sensor Data Fusion - Combining preprocessed data
- Anomaly Detection - Finding meaningful outliers in clean data
- Stream Processing - Real-time data pipelines