%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
Start[Choose Normalization<br/>Method] --> Outliers{Significant<br/>Outliers?}
Outliers -->|Yes| Robust[Robust Scaling<br/>Median/IQR]
Outliers -->|No| Downstream{Downstream<br/>Use Case?}
Downstream -->|Neural Network| MinMax[Min-Max<br/>0 to 1]
Downstream -->|Clustering/SVM| ZScore[Z-Score<br/>Mean=0, Std=1]
Downstream -->|Visualization| MinMax
Start --> Skewed{Data<br/>Distribution?}
Skewed -->|Right-Skewed| Log[Log Transform<br/>Then Normalize]
Skewed -->|Normal| Continue[Continue to<br/>Other Checks]
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style MinMax fill:#27AE60,stroke:#2C3E50,color:#fff
style ZScore fill:#16A085,stroke:#2C3E50,color:#fff
style Robust fill:#E67E22,stroke:#2C3E50,color:#fff
style Log fill:#9B59B6,stroke:#2C3E50,color:#fff
1308 Data Normalization and Preprocessing Lab
1308.1 Learning Objectives
By the end of this chapter, you will be able to:
- Normalize and Scale Data: Apply min-max scaling, Z-score normalization, and other techniques for multi-sensor data fusion
- Choose Normalization Methods: Select appropriate scaling based on downstream use case (neural networks, clustering, visualization)
- Implement a Complete Pipeline: Build an end-to-end data quality system on an ESP32 microcontroller
- Monitor Data Quality Metrics: Track validation rates, outlier counts, and imputation statistics
1308.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- Data Validation and Outlier Detection: Understanding validation as the first stage of data quality
- Missing Value Imputation and Noise Filtering: Handling gaps and noise in sensor data
- Edge Data Acquisition: Understanding how sensor data is collected at the edge
Imagine comparing apples and elephants. Temperature might range from -10 to 50 degrees, while light levels range from 0 to 100,000 lux. If you feed both to a machine learning model without normalizing, the model will think light is 2000x more important simply because the numbers are bigger!
Normalization puts all sensors on the same scale:
| Before Normalization | After Min-Max (0-1) |
|---|---|
| Temperature: 25C | Temperature: 0.58 |
| Light: 50,000 lux | Light: 0.50 |
Now both contribute equally based on their actual information content, not their arbitrary unit scales.
When to use different methods:
| Method | Use When | Output Range |
|---|---|---|
| Min-Max | Need bounded outputs (neural networks) | 0 to 1 |
| Z-Score | Data has outliers, using K-means/SVM | Mean=0, StdDev=1 |
| Robust | Many outliers you want to ignore | Median-centered |
| Log | Data spans orders of magnitude | Compressed scale |
Key question this chapter answers: “How do I prepare multi-sensor data so it can be combined and analyzed fairly?”
Core Concept: Normalization transforms sensor readings to a common scale, enabling fair comparison and combination of data from sensors with vastly different measurement ranges.
Why It Matters: Without normalization, a light sensor reading 100,000 lux would dominate a temperature sensor reading 25C in any analysis, even though both carry equal information. Machine learning models and statistical methods assume comparable scales.
Key Takeaway: Use min-max scaling (0-1) for neural networks and bounded outputs; use Z-score normalization for clustering and when outliers should retain their influence; use robust scaling when outliers should be dampened.
1308.3 Data Normalization and Scaling
1308.3.1 Min-Max Scaling
Scales data to a fixed range (typically 0-1):
class MinMaxScaler:
def __init__(self, feature_range=(0, 1)):
self.min_val = None
self.max_val = None
self.feature_min, self.feature_max = feature_range
def fit(self, data):
"""Learn min/max from training data"""
self.min_val = min(data)
self.max_val = max(data)
def transform(self, value):
"""Scale a single value"""
if self.max_val == self.min_val:
return self.feature_min
scaled = (value - self.min_val) / (self.max_val - self.min_val)
return scaled * (self.feature_max - self.feature_min) + self.feature_min
def inverse_transform(self, scaled_value):
"""Convert back to original scale"""
original = (scaled_value - self.feature_min) / (self.feature_max - self.feature_min)
return original * (self.max_val - self.min_val) + self.min_val1308.3.2 Z-Score Normalization (Standardization)
Centers data around mean with unit variance:
class ZScoreNormalizer:
def __init__(self):
self.mean = None
self.std = None
def fit(self, data):
"""Learn mean and std from training data"""
import numpy as np
self.mean = np.mean(data)
self.std = np.std(data)
def transform(self, value):
"""Normalize a single value"""
if self.std == 0:
return 0.0
return (value - self.mean) / self.std
def inverse_transform(self, normalized_value):
"""Convert back to original scale"""
return normalized_value * self.std + self.mean1308.3.3 Robust Scaling
Uses median and IQR, robust to outliers:
class RobustScaler:
def __init__(self):
self.median = None
self.iqr = None
def fit(self, data):
"""Learn median and IQR from training data"""
import numpy as np
self.median = np.median(data)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
self.iqr = q3 - q1
def transform(self, value):
"""Scale using median and IQR"""
if self.iqr == 0:
return 0.0
return (value - self.median) / self.iqr
def inverse_transform(self, scaled_value):
"""Convert back to original scale"""
return scaled_value * self.iqr + self.median1308.3.4 When to Use Each Method
| Method | Use Case | Preserves | Sensitive To |
|---|---|---|---|
| Min-Max | Neural networks, bounded outputs | Distribution shape | Outliers |
| Z-Score | SVM, K-means, when outliers present | Relative distances | Nothing |
| Robust Scaling | When outliers should not affect range | Median-based | Nothing |
| Log Transform | Right-skewed data (power, counts) | Multiplicative relationships | Zero/negative values |
1308.3.5 Normalization Decision Tree
1308.4 Data Quality Lab: ESP32 Wokwi Simulation
1308.4.1 Lab Overview
In this hands-on lab, you will implement a complete data quality preprocessing pipeline on an ESP32 microcontroller. The simulation demonstrates real-world techniques for handling sensor data problems including outliers, missing values, noise, and the need for normalization.
What You Will Learn:
- Sensor Data Validation: Implementing range checks and rate-of-change validation
- Outlier Detection: Using Z-score and IQR methods to identify anomalous readings
- Missing Value Handling: Forward-fill and interpolation techniques for gap handling
- Noise Filtering: Moving average, median filter, and exponential smoothing
- Data Normalization: Min-max scaling and Z-score normalization for multi-sensor fusion
Skills Practiced:
- Real-time data processing on embedded systems
- Statistical calculations with limited memory
- Circular buffer implementations
- Streaming algorithm design
1308.4.2 Lab Components
| Technique | Implementation | Visual Indicator |
|---|---|---|
| Range Validation | Physical bounds checking | Red LED on violation |
| Z-Score Outliers | Rolling statistics | Yellow LED on outlier |
| Median Filter | 5-sample sliding window | Smoothed output in serial |
| Moving Average | 10-sample window | Trend visualization |
| Normalization | 0-1 scaling | Percentage output |
1308.4.3 Wokwi Simulator
Use the embedded simulator below to build your data quality preprocessing system:
1308.4.4 Circuit Setup
Connect the sensors and indicators to the ESP32:
| Component | ESP32 Pin | Purpose |
|---|---|---|
| Temperature Sensor (NTC) | GPIO 34 | Primary data source |
| Light Sensor (LDR) | GPIO 35 | Secondary data source |
| Potentiometer | GPIO 32 | Simulate sensor drift |
| Red LED | GPIO 18 | Range violation indicator |
| Yellow LED | GPIO 19 | Outlier detection indicator |
| Green LED | GPIO 21 | Valid data indicator |
| Blue LED | GPIO 22 | Missing data indicator |
Add this diagram.json configuration in Wokwi:
{
"version": 1,
"author": "IoT Class - Data Quality Lab",
"editor": "wokwi",
"parts": [
{ "type": "wokwi-esp32-devkit-v1", "id": "esp", "top": 0, "left": 0 },
{ "type": "wokwi-ntc-temperature-sensor", "id": "temp1", "top": -120, "left": 80 },
{ "type": "wokwi-photoresistor-sensor", "id": "ldr1", "top": -120, "left": 180 },
{ "type": "wokwi-potentiometer", "id": "pot1", "top": -120, "left": 280 },
{ "type": "wokwi-led", "id": "led_red", "top": 180, "left": 80, "attrs": { "color": "red" } },
{ "type": "wokwi-led", "id": "led_yellow", "top": 180, "left": 130, "attrs": { "color": "yellow" } },
{ "type": "wokwi-led", "id": "led_green", "top": 180, "left": 180, "attrs": { "color": "green" } },
{ "type": "wokwi-led", "id": "led_blue", "top": 180, "left": 230, "attrs": { "color": "blue" } },
{ "type": "wokwi-resistor", "id": "r1", "top": 230, "left": 80, "attrs": { "value": "220" } },
{ "type": "wokwi-resistor", "id": "r2", "top": 230, "left": 130, "attrs": { "value": "220" } },
{ "type": "wokwi-resistor", "id": "r3", "top": 230, "left": 180, "attrs": { "value": "220" } },
{ "type": "wokwi-resistor", "id": "r4", "top": 230, "left": 230, "attrs": { "value": "220" } }
],
"connections": [
["esp:GND.1", "temp1:GND", "black", ["h0"]],
["esp:3V3", "temp1:VCC", "red", ["h0"]],
["esp:34", "temp1:OUT", "green", ["h0"]],
["esp:GND.1", "ldr1:GND", "black", ["h0"]],
["esp:3V3", "ldr1:VCC", "red", ["h0"]],
["esp:35", "ldr1:OUT", "orange", ["h0"]],
["esp:GND.1", "pot1:GND", "black", ["h0"]],
["esp:3V3", "pot1:VCC", "red", ["h0"]],
["esp:32", "pot1:SIG", "purple", ["h0"]],
["esp:18", "led_red:A", "red", ["h0"]],
["led_red:C", "r1:1", "black", ["h0"]],
["r1:2", "esp:GND.1", "black", ["h0"]],
["esp:19", "led_yellow:A", "yellow", ["h0"]],
["led_yellow:C", "r2:1", "black", ["h0"]],
["r2:2", "esp:GND.1", "black", ["h0"]],
["esp:21", "led_green:A", "green", ["h0"]],
["led_green:C", "r3:1", "black", ["h0"]],
["r3:2", "esp:GND.1", "black", ["h0"]],
["esp:22", "led_blue:A", "blue", ["h0"]],
["led_blue:C", "r4:1", "black", ["h0"]],
["r4:2", "esp:GND.1", "black", ["h0"]]
]
}1308.4.5 Complete Arduino Code
Copy this code into the Wokwi editor:
// ============================================================================
// DATA QUALITY LAB: Comprehensive IoT Data Preprocessing Pipeline
// ============================================================================
// Demonstrates: Validation, Outlier Detection, Missing Value Handling,
// Noise Filtering, Normalization, and Data Scaling
// ============================================================================
#include <Arduino.h>
#include <math.h>
// ====================== PIN DEFINITIONS ======================
const int TEMP_PIN = 34; // Temperature sensor (NTC)
const int LIGHT_PIN = 35; // Light sensor (LDR)
const int DRIFT_PIN = 32; // Potentiometer for drift simulation
const int LED_RED = 18; // Range violation indicator
const int LED_YELLOW = 19; // Outlier detection indicator
const int LED_GREEN = 21; // Valid data indicator
const int LED_BLUE = 22; // Missing data indicator
// ====================== SAMPLING PARAMETERS ======================
const int SAMPLE_INTERVAL_MS = 200; // 5 Hz sampling rate
const int VALIDATION_WINDOW = 50; // Samples for statistics
const int FILTER_WINDOW_MEDIAN = 5; // Median filter window
const int FILTER_WINDOW_MA = 10; // Moving average window
const float EXP_SMOOTHING_ALPHA = 0.3; // Exponential smoothing factor
// ====================== VALIDATION THRESHOLDS ======================
// Temperature sensor valid range (Celsius after conversion)
const float TEMP_MIN_VALID = -10.0;
const float TEMP_MAX_VALID = 60.0;
const float TEMP_MAX_RATE = 2.0; // Max 2C change per second
// Light sensor valid range (0-4095 ADC, 0-100000 lux mapped)
const float LIGHT_MIN_VALID = 0.0;
const float LIGHT_MAX_VALID = 100000.0;
// Outlier detection thresholds
const float ZSCORE_THRESHOLD = 3.0; // Standard deviations
const float IQR_MULTIPLIER = 1.5; // IQR outlier multiplier
const float MAD_THRESHOLD = 3.5; // Modified Z-score threshold
// Missing data parameters
const int MAX_MISSING_SAMPLES = 10; // Max consecutive missing allowed
const float MISSING_PROBABILITY = 0.05; // Simulate 5% missing data
// ====================== DATA STRUCTURES ======================
// Circular buffer for streaming statistics
template<int SIZE>
struct CircularBuffer {
float data[SIZE];
int head;
int count;
CircularBuffer() : head(0), count(0) {
for (int i = 0; i < SIZE; i++) data[i] = 0;
}
void push(float value) {
data[head] = value;
head = (head + 1) % SIZE;
if (count < SIZE) count++;
}
float get(int index) const {
// Get value at index (0 = oldest)
int actual = (head - count + index + SIZE) % SIZE;
return data[actual];
}
bool isFull() const { return count == SIZE; }
int size() const { return count; }
};
// Statistics calculator for streaming data
struct StreamingStats {
float sum;
float sumSq;
int count;
float min_val;
float max_val;
StreamingStats() : sum(0), sumSq(0), count(0),
min_val(INFINITY), max_val(-INFINITY) {}
void reset() {
sum = sumSq = 0;
count = 0;
min_val = INFINITY;
max_val = -INFINITY;
}
void add(float value) {
sum += value;
sumSq += value * value;
count++;
if (value < min_val) min_val = value;
if (value > max_val) max_val = value;
}
float mean() const { return count > 0 ? sum / count : 0; }
float variance() const {
if (count < 2) return 0;
return (sumSq - sum * sum / count) / (count - 1);
}
float stdDev() const { return sqrt(variance()); }
};
// Sensor reading with quality metadata
struct SensorReading {
float rawValue;
float cleanedValue;
float normalizedValue;
unsigned long timestamp;
bool isValid;
bool isOutlier;
bool isMissing;
bool isImputed;
String qualityFlags;
};
// ====================== GLOBAL STATE ======================
// Circular buffers for different processing stages
CircularBuffer<VALIDATION_WINDOW> tempRawBuffer;
CircularBuffer<VALIDATION_WINDOW> tempCleanBuffer;
CircularBuffer<FILTER_WINDOW_MEDIAN> tempMedianBuffer;
CircularBuffer<FILTER_WINDOW_MA> tempMABuffer;
CircularBuffer<VALIDATION_WINDOW> lightRawBuffer;
CircularBuffer<VALIDATION_WINDOW> lightCleanBuffer;
// Streaming statistics
StreamingStats tempStats;
StreamingStats lightStats;
// For rate-of-change validation
float lastValidTemp = NAN;
unsigned long lastValidTempTime = 0;
float lastValidLight = NAN;
unsigned long lastValidLightTime = 0;
// For exponential smoothing
float expSmoothedTemp = NAN;
float expSmoothedLight = NAN;
// For missing data handling
int consecutiveMissingTemp = 0;
int consecutiveMissingLight = 0;
float lastImputedTemp = NAN;
float lastImputedLight = NAN;
// Normalization parameters (learned from data)
float tempMinSeen = INFINITY;
float tempMaxSeen = -INFINITY;
float lightMinSeen = INFINITY;
float lightMaxSeen = -INFINITY;
// Statistics counters
unsigned long totalSamples = 0;
unsigned long validSamples = 0;
unsigned long outlierSamples = 0;
unsigned long missingSamples = 0;
unsigned long imputedSamples = 0;
unsigned long rangeViolations = 0;
unsigned long rateViolations = 0;
// ====================== HELPER FUNCTIONS ======================
// Convert ADC reading to temperature (NTC thermistor approximation)
float adcToTemperature(int adcValue) {
if (adcValue == 0) return -INFINITY;
if (adcValue >= 4095) return INFINITY;
// Simplified Steinhart-Hart approximation for 10K NTC
float resistance = 10000.0 * (4095.0 / adcValue - 1.0);
float steinhart = resistance / 10000.0;
steinhart = log(steinhart);
steinhart /= 3950.0;
steinhart += 1.0 / (25.0 + 273.15);
steinhart = 1.0 / steinhart;
steinhart -= 273.15;
return steinhart;
}
// Convert ADC reading to light level (LDR approximation)
float adcToLight(int adcValue) {
// Map ADC to approximate lux (logarithmic)
if (adcValue < 10) return 0;
float lux = 100000.0 * pow((float)adcValue / 4095.0, 2);
return lux;
}
// ====================== VALIDATION FUNCTIONS ======================
// Range validation - check if value is within physical bounds
bool validateRange(float value, float minValid, float maxValid, String& errorMsg) {
if (isnan(value) || isinf(value)) {
errorMsg = "NaN/Inf";
return false;
}
if (value < minValid) {
errorMsg = "Below min (" + String(minValid) + ")";
return false;
}
if (value > maxValid) {
errorMsg = "Above max (" + String(maxValid) + ")";
return false;
}
errorMsg = "OK";
return true;
}
// Rate-of-change validation - detect impossible jumps
bool validateRateOfChange(float currentValue, float lastValue,
unsigned long currentTime, unsigned long lastTime,
float maxRate, String& errorMsg) {
if (isnan(lastValue) || lastTime == 0) {
errorMsg = "First reading";
return true;
}
float timeDelta = (currentTime - lastTime) / 1000.0; // seconds
if (timeDelta <= 0) {
errorMsg = "Invalid timestamp";
return false;
}
float rate = abs(currentValue - lastValue) / timeDelta;
if (rate > maxRate) {
errorMsg = "Rate " + String(rate, 2) + " exceeds max " + String(maxRate);
return false;
}
errorMsg = "Rate OK (" + String(rate, 2) + ")";
return true;
}
// ====================== OUTLIER DETECTION ======================
// Z-Score outlier detection
bool detectOutlierZScore(float value, float mean, float stdDev,
float threshold, float& zScore) {
if (stdDev == 0) {
zScore = 0;
return false;
}
zScore = abs((value - mean) / stdDev);
return zScore > threshold;
}
// IQR outlier detection (requires sorted buffer)
bool detectOutlierIQR(CircularBuffer<VALIDATION_WINDOW>& buffer, float value,
float multiplier, float& lowerBound, float& upperBound) {
if (buffer.size() < 10) return false;
// Copy to temporary array for sorting
float sorted[VALIDATION_WINDOW];
int n = buffer.size();
for (int i = 0; i < n; i++) {
sorted[i] = buffer.get(i);
}
// Simple bubble sort (OK for small buffers)
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (sorted[j] > sorted[j + 1]) {
float temp = sorted[j];
sorted[j] = sorted[j + 1];
sorted[j + 1] = temp;
}
}
}
// Calculate quartiles
int q1Idx = n / 4;
int q3Idx = 3 * n / 4;
float q1 = sorted[q1Idx];
float q3 = sorted[q3Idx];
float iqr = q3 - q1;
lowerBound = q1 - multiplier * iqr;
upperBound = q3 + multiplier * iqr;
return (value < lowerBound) || (value > upperBound);
}
// ====================== NOISE FILTERING ======================
// Moving average filter
float filterMovingAverage(CircularBuffer<FILTER_WINDOW_MA>& buffer, float newValue) {
buffer.push(newValue);
float sum = 0;
for (int i = 0; i < buffer.size(); i++) {
sum += buffer.get(i);
}
return sum / buffer.size();
}
// Median filter (excellent for spike removal)
float filterMedian(CircularBuffer<FILTER_WINDOW_MEDIAN>& buffer, float newValue) {
buffer.push(newValue);
// Copy and sort
float sorted[FILTER_WINDOW_MEDIAN];
int n = buffer.size();
for (int i = 0; i < n; i++) {
sorted[i] = buffer.get(i);
}
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (sorted[j] > sorted[j + 1]) {
float temp = sorted[j];
sorted[j] = sorted[j + 1];
sorted[j + 1] = temp;
}
}
}
if (n % 2 == 0) {
return (sorted[n/2 - 1] + sorted[n/2]) / 2.0;
}
return sorted[n / 2];
}
// Exponential smoothing filter
float filterExponentialSmoothing(float& smoothed, float newValue, float alpha) {
if (isnan(smoothed)) {
smoothed = newValue;
} else {
smoothed = alpha * newValue + (1 - alpha) * smoothed;
}
return smoothed;
}
// ====================== MISSING VALUE HANDLING ======================
// Simulate missing data (for demonstration)
bool simulateMissingData() {
return random(1000) < (MISSING_PROBABILITY * 1000);
}
// Forward-fill imputation with limit
float imputeForwardFill(float lastValidValue, int& consecutiveMissing,
int maxMissing, bool& wasImputed) {
consecutiveMissing++;
if (consecutiveMissing > maxMissing || isnan(lastValidValue)) {
wasImputed = false;
return NAN;
}
wasImputed = true;
return lastValidValue;
}
// ====================== NORMALIZATION ======================
// Min-Max scaling to 0-1 range
float normalizeMinMax(float value, float minSeen, float maxSeen) {
if (maxSeen == minSeen) return 0.5;
return (value - minSeen) / (maxSeen - minSeen);
}
// Z-Score normalization (standardization)
float normalizeZScore(float value, float mean, float stdDev) {
if (stdDev == 0) return 0;
return (value - mean) / stdDev;
}
// Update normalization parameters
void updateNormalizationParams(float value, float& minSeen, float& maxSeen) {
if (value < minSeen) minSeen = value;
if (value > maxSeen) maxSeen = value;
}
// ====================== MAIN PROCESSING FUNCTION ======================
SensorReading processTemperature(int adcValue, unsigned long timestamp) {
SensorReading reading;
reading.timestamp = timestamp;
reading.isValid = true;
reading.isOutlier = false;
reading.isMissing = false;
reading.isImputed = false;
reading.qualityFlags = "";
// Step 0: Check for simulated missing data
if (simulateMissingData()) {
reading.isMissing = true;
missingSamples++;
// Try forward-fill imputation
bool wasImputed;
float imputed = imputeForwardFill(lastValidTemp, consecutiveMissingTemp,
MAX_MISSING_SAMPLES, wasImputed);
if (wasImputed) {
reading.rawValue = NAN;
reading.cleanedValue = imputed;
reading.isImputed = true;
reading.qualityFlags += "IMPUTED_FFILL ";
imputedSamples++;
lastImputedTemp = imputed;
} else {
reading.rawValue = NAN;
reading.cleanedValue = NAN;
reading.normalizedValue = NAN;
reading.qualityFlags += "MISSING_NO_IMPUTE ";
return reading;
}
} else {
consecutiveMissingTemp = 0;
// Step 1: Convert ADC to temperature
reading.rawValue = adcToTemperature(adcValue);
// Step 2: Range validation
String rangeError;
if (!validateRange(reading.rawValue, TEMP_MIN_VALID, TEMP_MAX_VALID, rangeError)) {
reading.isValid = false;
reading.qualityFlags += "RANGE_VIOLATION(" + rangeError + ") ";
rangeViolations++;
}
// Step 3: Rate-of-change validation
String rateError;
if (!validateRateOfChange(reading.rawValue, lastValidTemp,
timestamp, lastValidTempTime,
TEMP_MAX_RATE, rateError)) {
reading.isValid = false;
reading.qualityFlags += "RATE_VIOLATION(" + rateError + ") ";
rateViolations++;
}
// Step 4: Outlier detection (only if range-valid)
if (reading.isValid && tempStats.count > 20) {
float zScore;
if (detectOutlierZScore(reading.rawValue, tempStats.mean(),
tempStats.stdDev(), ZSCORE_THRESHOLD, zScore)) {
reading.isOutlier = true;
reading.qualityFlags += "ZSCORE_OUTLIER(z=" + String(zScore, 2) + ") ";
outlierSamples++;
}
float lowerBound, upperBound;
if (detectOutlierIQR(tempRawBuffer, reading.rawValue,
IQR_MULTIPLIER, lowerBound, upperBound)) {
reading.isOutlier = true;
reading.qualityFlags += "IQR_OUTLIER ";
}
}
// Step 5: Apply noise filtering
if (reading.isValid && !reading.isOutlier) {
// Median filter first (removes spikes)
float medianFiltered = filterMedian(tempMedianBuffer, reading.rawValue);
// Then moving average (smooths remaining noise)
float maFiltered = filterMovingAverage(tempMABuffer, medianFiltered);
// Exponential smoothing for final output
reading.cleanedValue = filterExponentialSmoothing(expSmoothedTemp,
maFiltered,
EXP_SMOOTHING_ALPHA);
// Update statistics with valid data
tempStats.add(reading.rawValue);
tempRawBuffer.push(reading.rawValue);
tempCleanBuffer.push(reading.cleanedValue);
// Update last valid
lastValidTemp = reading.rawValue;
lastValidTempTime = timestamp;
validSamples++;
} else if (reading.isOutlier) {
// Use median of recent values for outlier replacement
reading.cleanedValue = filterMedian(tempMedianBuffer,
tempMedianBuffer.get(tempMedianBuffer.size() - 1));
reading.qualityFlags += "OUTLIER_REPLACED ";
} else {
// Range/rate violation - use last valid with flag
if (!isnan(lastValidTemp)) {
reading.cleanedValue = lastValidTemp;
reading.qualityFlags += "USING_LAST_VALID ";
} else {
reading.cleanedValue = NAN;
}
}
}
// Step 6: Normalization
if (!isnan(reading.cleanedValue)) {
updateNormalizationParams(reading.cleanedValue, tempMinSeen, tempMaxSeen);
reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
tempMinSeen, tempMaxSeen);
} else {
reading.normalizedValue = NAN;
}
if (reading.qualityFlags == "") {
reading.qualityFlags = "CLEAN";
}
totalSamples++;
return reading;
}
SensorReading processLight(int adcValue, unsigned long timestamp) {
SensorReading reading;
reading.timestamp = timestamp;
reading.isValid = true;
reading.isOutlier = false;
reading.isMissing = false;
reading.isImputed = false;
reading.qualityFlags = "";
// Similar processing as temperature (abbreviated for space)
if (simulateMissingData()) {
reading.isMissing = true;
missingSamples++;
bool wasImputed;
float imputed = imputeForwardFill(lastValidLight, consecutiveMissingLight,
MAX_MISSING_SAMPLES, wasImputed);
if (wasImputed) {
reading.rawValue = NAN;
reading.cleanedValue = imputed;
reading.isImputed = true;
reading.qualityFlags = "IMPUTED_FFILL";
imputedSamples++;
} else {
reading.rawValue = NAN;
reading.cleanedValue = NAN;
reading.normalizedValue = NAN;
reading.qualityFlags = "MISSING_NO_IMPUTE";
return reading;
}
} else {
consecutiveMissingLight = 0;
reading.rawValue = adcToLight(adcValue);
// Simplified processing for light sensor
String rangeError;
if (!validateRange(reading.rawValue, LIGHT_MIN_VALID, LIGHT_MAX_VALID, rangeError)) {
reading.isValid = false;
reading.qualityFlags = "RANGE_VIOLATION";
rangeViolations++;
}
if (reading.isValid) {
reading.cleanedValue = filterExponentialSmoothing(expSmoothedLight,
reading.rawValue,
EXP_SMOOTHING_ALPHA);
lightStats.add(reading.rawValue);
lightRawBuffer.push(reading.rawValue);
lastValidLight = reading.rawValue;
lastValidLightTime = timestamp;
validSamples++;
} else {
reading.cleanedValue = lastValidLight;
}
}
// Normalization
if (!isnan(reading.cleanedValue)) {
updateNormalizationParams(reading.cleanedValue, lightMinSeen, lightMaxSeen);
reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
lightMinSeen, lightMaxSeen);
}
if (reading.qualityFlags == "") {
reading.qualityFlags = "CLEAN";
}
totalSamples++;
return reading;
}
// ====================== LED INDICATOR CONTROL ======================
void updateLEDs(const SensorReading& tempReading, const SensorReading& lightReading) {
// Red LED: Range violation
if (!tempReading.isValid || !lightReading.isValid) {
digitalWrite(LED_RED, HIGH);
} else {
digitalWrite(LED_RED, LOW);
}
// Yellow LED: Outlier detected
if (tempReading.isOutlier || lightReading.isOutlier) {
digitalWrite(LED_YELLOW, HIGH);
} else {
digitalWrite(LED_YELLOW, LOW);
}
// Green LED: Clean valid data
if (tempReading.qualityFlags == "CLEAN" && lightReading.qualityFlags == "CLEAN") {
digitalWrite(LED_GREEN, HIGH);
} else {
digitalWrite(LED_GREEN, LOW);
}
// Blue LED: Missing/imputed data
if (tempReading.isMissing || tempReading.isImputed ||
lightReading.isMissing || lightReading.isImputed) {
digitalWrite(LED_BLUE, HIGH);
} else {
digitalWrite(LED_BLUE, LOW);
}
}
// ====================== SERIAL OUTPUT ======================
void printSensorReading(const char* sensorName, const SensorReading& reading) {
Serial.print(sensorName);
Serial.print(": Raw=");
if (isnan(reading.rawValue)) {
Serial.print("NaN");
} else {
Serial.print(reading.rawValue, 2);
}
Serial.print(", Clean=");
if (isnan(reading.cleanedValue)) {
Serial.print("NaN");
} else {
Serial.print(reading.cleanedValue, 2);
}
Serial.print(", Norm=");
if (isnan(reading.normalizedValue)) {
Serial.print("NaN");
} else {
Serial.print(reading.normalizedValue, 3);
}
Serial.print(" [");
Serial.print(reading.qualityFlags);
Serial.println("]");
}
void printStatistics() {
Serial.println("\n========== DATA QUALITY STATISTICS ==========");
Serial.print("Total Samples: ");
Serial.println(totalSamples);
Serial.print("Valid Samples: ");
Serial.print(validSamples);
Serial.print(" (");
Serial.print(100.0 * validSamples / max(totalSamples, 1UL), 1);
Serial.println("%)");
Serial.print("Outliers Detected: ");
Serial.print(outlierSamples);
Serial.print(" (");
Serial.print(100.0 * outlierSamples / max(totalSamples, 1UL), 1);
Serial.println("%)");
Serial.print("Missing Values: ");
Serial.print(missingSamples);
Serial.print(" (");
Serial.print(100.0 * missingSamples / max(totalSamples, 1UL), 1);
Serial.println("%)");
Serial.print("Imputed Values: ");
Serial.print(imputedSamples);
Serial.println();
Serial.print("Range Violations: ");
Serial.println(rangeViolations);
Serial.print("Rate Violations: ");
Serial.println(rateViolations);
Serial.println("\n--- Temperature Statistics ---");
Serial.print("Mean: ");
Serial.print(tempStats.mean(), 2);
Serial.print(" C, StdDev: ");
Serial.print(tempStats.stdDev(), 2);
Serial.print(" C, Range: [");
Serial.print(tempMinSeen, 1);
Serial.print(", ");
Serial.print(tempMaxSeen, 1);
Serial.println("] C");
Serial.println("\n--- Light Statistics ---");
Serial.print("Mean: ");
Serial.print(lightStats.mean(), 0);
Serial.print(" lux, StdDev: ");
Serial.print(lightStats.stdDev(), 0);
Serial.print(" lux, Range: [");
Serial.print(lightMinSeen, 0);
Serial.print(", ");
Serial.print(lightMaxSeen, 0);
Serial.println("] lux");
Serial.println("==============================================\n");
}
// ====================== SETUP AND LOOP ======================
void setup() {
Serial.begin(115200);
delay(1000);
// Initialize pins
pinMode(TEMP_PIN, INPUT);
pinMode(LIGHT_PIN, INPUT);
pinMode(DRIFT_PIN, INPUT);
pinMode(LED_RED, OUTPUT);
pinMode(LED_YELLOW, OUTPUT);
pinMode(LED_GREEN, OUTPUT);
pinMode(LED_BLUE, OUTPUT);
// All LEDs off initially
digitalWrite(LED_RED, LOW);
digitalWrite(LED_YELLOW, LOW);
digitalWrite(LED_GREEN, LOW);
digitalWrite(LED_BLUE, LOW);
// Seed random for missing data simulation
randomSeed(analogRead(0));
Serial.println("============================================");
Serial.println(" DATA QUALITY LAB: IoT Preprocessing Demo ");
Serial.println("============================================");
Serial.println("Features demonstrated:");
Serial.println(" - Range validation (physical bounds)");
Serial.println(" - Rate-of-change validation");
Serial.println(" - Z-Score outlier detection");
Serial.println(" - IQR outlier detection");
Serial.println(" - Median filter (spike removal)");
Serial.println(" - Moving average filter (smoothing)");
Serial.println(" - Exponential smoothing");
Serial.println(" - Missing value imputation (forward-fill)");
Serial.println(" - Min-Max normalization (0-1 scaling)");
Serial.println("============================================");
Serial.println("LED Indicators:");
Serial.println(" RED: Range/Rate violation");
Serial.println(" YELLOW: Outlier detected");
Serial.println(" GREEN: Clean valid data");
Serial.println(" BLUE: Missing/Imputed data");
Serial.println("============================================\n");
Serial.println("Starting data collection...\n");
}
unsigned long lastSampleTime = 0;
unsigned long lastStatsTime = 0;
const unsigned long STATS_INTERVAL_MS = 10000; // Print stats every 10 seconds
void loop() {
unsigned long currentTime = millis();
// Sample at defined interval
if (currentTime - lastSampleTime >= SAMPLE_INTERVAL_MS) {
lastSampleTime = currentTime;
// Read sensors
int tempADC = analogRead(TEMP_PIN);
int lightADC = analogRead(LIGHT_PIN);
int driftADC = analogRead(DRIFT_PIN);
// Add simulated drift from potentiometer (optional)
float driftFactor = (driftADC - 2048) / 2048.0; // -1 to +1
tempADC = constrain(tempADC + (int)(driftFactor * 500), 0, 4095);
// Process readings through data quality pipeline
SensorReading tempReading = processTemperature(tempADC, currentTime);
SensorReading lightReading = processLight(lightADC, currentTime);
// Update LED indicators
updateLEDs(tempReading, lightReading);
// Print readings
printSensorReading("TEMP", tempReading);
printSensorReading("LIGHT", lightReading);
Serial.println();
}
// Print statistics periodically
if (currentTime - lastStatsTime >= STATS_INTERVAL_MS) {
lastStatsTime = currentTime;
printStatistics();
}
}1308.4.6 Step-by-Step Instructions
1308.4.6.1 Step 1: Set Up the Simulator
- Open the Wokwi simulator embedded above (or visit wokwi.com)
- Create a new ESP32 project
- Click the diagram.json tab and paste the circuit configuration
- Replace the default code with the complete Arduino code above
1308.4.6.2 Step 2: Run and Observe Validation
- Click the Play button to start the simulation
- Open the Serial Monitor to see the data processing output
- Observe the quality flags showing validation status for each reading
- Watch for “CLEAN” flags indicating data passed all quality checks
1308.4.6.3 Step 3: Trigger Range Violations
- Click the NTC temperature sensor in the simulator
- Drag the slider to extreme values (very hot or very cold)
- Watch the RED LED turn on when values exceed physical bounds
- Note the “RANGE_VIOLATION” flag in the serial output
- Observe how the system uses the last valid value when current is invalid
1308.4.6.4 Step 4: Observe Outlier Detection
- Make sudden temperature changes by quickly dragging the sensor slider
- Watch the YELLOW LED blink when outliers are detected
- See the “ZSCORE_OUTLIER” and “IQR_OUTLIER” flags in output
- Notice how outliers are replaced with median values
1308.4.6.5 Step 5: Observe Missing Data Handling
- The code simulates 5% random missing data
- Watch the BLUE LED flash when data is missing or imputed
- See “IMPUTED_FFILL” flags showing forward-fill imputation
- Note “MISSING_NO_IMPUTE” when too many consecutive values are missing
1308.4.6.6 Step 6: Experiment with Filtering
- Observe the Raw vs Clean values in the serial output
- Notice how Clean values are smoother due to median + moving average filters
- Compare the normalized values (0-1 range) for multi-sensor comparison
1308.4.6.7 Step 7: Analyze Statistics
- Wait for the statistics report (prints every 10 seconds)
- Review the data quality percentages: valid, outliers, missing
- Examine the sensor statistics: mean, standard deviation, range
- Consider how these metrics would inform production monitoring
1308.4.7 Challenge Exercises
Difficulty: Intermediate
Task: The code includes MAD outlier detection but does not use it in the main pipeline. Modify the processTemperature() function to use MAD as the primary outlier detection method.
Hints: - MAD is more robust than Z-score for non-Gaussian data - The detectOutlierMAD() function is already implemented - Replace or complement the Z-score check with MAD
Expected Outcome: MAD should detect outliers even when extreme values skew the mean and standard deviation.
Difficulty: Intermediate
Task: Currently, forward-fill uses a fixed limit (MAX_MISSING_SAMPLES). Implement an exponential decay on the confidence of imputed values.
Requirements: - Add a confidence field to SensorReading - Reduce confidence by 10% for each consecutive imputed value - Stop imputing when confidence drops below 50% - Display confidence in serial output
Expected Outcome: Imputed values should be flagged with decreasing confidence as gaps grow longer.
Difficulty: Advanced
Task: Add plausibility checking between temperature and light sensors. If it is very bright (high light), temperature should be reasonable for daytime.
Requirements: - If light > 50000 lux and temperature < 10C, flag as suspicious - If light < 100 lux and temperature > 35C (outdoors), flag as suspicious - Add a new LED or serial indicator for cross-sensor anomalies
Expected Outcome: The system should detect when sensor readings are physically inconsistent with each other.
Difficulty: Advanced
Task: Replace the exponential smoothing filter with a simple Kalman filter for temperature.
Requirements: - Implement 1D Kalman filter with process noise and measurement noise - Estimate the Kalman gain dynamically - Output both the filtered value and the uncertainty estimate
Learning: Kalman filters provide optimal estimation when process and measurement noise characteristics are known.
1308.4.8 Expected Outcomes
After completing this lab, you should be able to:
- Understand validation trade-offs: Strict validation catches more errors but may reject valid extreme readings
- Choose appropriate outlier methods: Z-score for Gaussian data, IQR/MAD for robust detection
- Select imputation strategies: Forward-fill for slow-changing, interpolation for trending data
- Apply noise filters correctly: Median for spikes, moving average for steady-state noise
- Normalize for fusion: Understand when to use min-max vs Z-score normalization
Quality Metrics to Observe: - Valid sample rate should be >90% under normal conditions - Outlier rate should be <5% for stable sensors - Imputed values should maintain temporal continuity
1308.5 Summary
Data normalization and scaling complete the data quality preprocessing pipeline:
- Min-Max Scaling: Transforms data to 0-1 range, ideal for neural networks and bounded outputs
- Z-Score Normalization: Centers data around mean with unit variance, best for clustering and SVM
- Robust Scaling: Uses median and IQR, resistant to outliers
- Log Transform: Compresses right-skewed data spanning orders of magnitude
- Complete Pipeline: Validation -> Cleaning -> Transformation, all implementable on edge devices
Critical Design Principle: The complete “validate-clean-transform” pipeline should run at the edge. Catching data quality issues at the source costs 1% of fixing them in the cloud, and normalized data enables fair multi-sensor fusion.
1308.6 What’s Next
The next chapter explores Multi-Sensor Data Fusion, building on these preprocessing techniques to combine data from multiple sensors for improved accuracy and reliability.
Data Quality Series:
- Data Quality and Preprocessing - Overview and index
- Data Validation and Outlier Detection - Validation and outliers
- Missing Value Imputation and Noise Filtering - Handling gaps and noise
Next Steps:
- Multi-Sensor Data Fusion - Combining preprocessed data
- Anomaly Detection - Finding meaningful outliers in clean data
- Stream Processing - Real-time data pipelines
Advanced Topics:
- Modeling and Inferencing - ML model deployment with quality data
- Edge Compute Patterns - Distributed preprocessing architectures