44 Normalization & Preprocessing
44.1 Learning Objectives
By the end of this chapter, you will be able to:
- Apply Normalization Techniques: Implement min-max scaling, Z-score normalization, and robust scaling for multi-sensor data fusion
- Compare Normalization Methods: Evaluate which scaling approach suits each downstream use case (neural networks, clustering, visualization)
- Implement a Complete Pipeline: Build and test an end-to-end data quality system on an ESP32 microcontroller
- Assess Data Quality Metrics: Calculate and interpret validation rates, outlier counts, and imputation statistics
44.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- Data Validation and Outlier Detection: Understanding validation as the first stage of data quality
- Missing Value Imputation and Noise Filtering: Handling gaps and noise in sensor data
- Edge Data Acquisition: Understanding how sensor data is collected at the edge
For Beginners: Why Normalize Data?
Imagine comparing apples and elephants. Temperature might range from -10 to 50 degrees, while light levels range from 0 to 100,000 lux. If you feed both to a machine learning model without normalizing, the model will think light is 2000x more important simply because the numbers are bigger!
Normalization puts all sensors on the same scale:
| Before Normalization | After Min-Max (0-1) |
|---|---|
| Temperature: 25C | Temperature: 0.58 |
| Light: 50,000 lux | Light: 0.50 |
Now both contribute equally based on their actual information content, not their arbitrary unit scales.
When to use different methods:
| Method | Use When | Output Range |
|---|---|---|
| Min-Max | Need bounded outputs (neural networks) | 0 to 1 |
| Z-Score | Gaussian data, using K-means/SVM/PCA | Mean=0, StdDev=1 |
| Robust | Many outliers you want to ignore | Median-centered |
| Log | Data spans orders of magnitude | Compressed scale |
Key question this chapter answers: “How do I prepare multi-sensor data so it can be combined and analyzed fairly?”
Minimum Viable Understanding: Data Normalization
Core Concept: Normalization transforms sensor readings to a common scale, enabling fair comparison and combination of data from sensors with vastly different measurement ranges.
Why It Matters: Without normalization, a light sensor reading 100,000 lux would dominate a temperature sensor reading 25C in any analysis, even though both carry equal information. Machine learning models and statistical methods assume comparable scales.
Key Takeaway: Use min-max scaling (0-1) for neural networks and bounded outputs; use Z-score normalization for clustering and algorithms assuming Gaussian distributions; use robust scaling when outliers are present and should be dampened.
44.3 Data Normalization and Scaling
Key Concepts
- Min-Max normalisation: Scaling all values to the [0, 1] range by subtracting the minimum and dividing by the range; sensitive to outliers that compress the majority of values into a narrow band.
- Z-score standardisation: Transforming values to have zero mean and unit variance by subtracting the mean and dividing by the standard deviation; assumes approximately Gaussian distribution.
- Robust scaling: Scaling using the median and interquartile range instead of mean and standard deviation, making it resistant to outliers — preferred for sensor data with occasional extreme spikes.
- Feature scaling: The general process of transforming sensor channels to comparable magnitude ranges so that machine learning algorithms treat all features equally regardless of their engineering units.
- Normalisation artefact: A distortion introduced by incorrect normalisation — for example, scaling based on training set statistics applied incorrectly to the test set, or using the wrong normalisation for the downstream algorithm.
- Data leakage: The contamination of model evaluation with information from the test set, often caused by normalising all data together before splitting into train/test, which leaks test statistics into the normalisation parameters.
44.3.1 Min-Max Scaling
Scales data to a fixed range (typically 0-1):
class MinMaxScaler:
def __init__(self, feature_range=(0, 1)):
self.min_val = None
self.max_val = None
self.feature_min, self.feature_max = feature_range
def fit(self, data):
"""Learn min/max from training data"""
self.min_val = min(data)
self.max_val = max(data)
def transform(self, value):
"""Scale a single value"""
if self.max_val == self.min_val:
return self.feature_min
scaled = (value - self.min_val) / (self.max_val - self.min_val)
return scaled * (self.feature_max - self.feature_min) + self.feature_min
def inverse_transform(self, scaled_value):
"""Convert back to original scale"""
original = (scaled_value - self.feature_min) / (self.feature_max - self.feature_min)
return original * (self.max_val - self.min_val) + self.min_val44.3.2 Z-Score Normalization (Standardization)
Centers data around mean with unit variance:
class ZScoreNormalizer:
def __init__(self):
self.mean = None
self.std = None
def fit(self, data):
"""Learn mean and std from training data"""
import numpy as np
self.mean = np.mean(data)
self.std = np.std(data)
def transform(self, value):
"""Normalize a single value"""
if self.std == 0:
return 0.0
return (value - self.mean) / self.std
def inverse_transform(self, normalized_value):
"""Convert back to original scale"""
return normalized_value * self.std + self.mean
Putting Numbers to It
Min-max and Z-score normalization transform sensor data to comparable scales for multi-sensor fusion and machine learning.
Min-max formula: \(x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}\)
Z-score formula: \(z = \frac{x - \mu}{\sigma}\)
Worked example: A temperature sensor collects 5 readings: [18C, 22C, 25C, 19C, 21C]. Normalize using both methods:
Min-max scaling:
- \(x_{\min} = 18\), \(x_{\max} = 25\)
- For 22C: \(x' = \frac{22 - 18}{25 - 18} = \frac{4}{7} = 0.571\)
- For 25C: \(x' = \frac{25 - 18}{25 - 18} = \frac{7}{7} = 1.000\)
Z-score normalization:
- \(\mu = 21\), \(\sigma = \sqrt{\frac{(18{-}21)^2 + (22{-}21)^2 + (25{-}21)^2 + (19{-}21)^2 + (21{-}21)^2}{5}} = \sqrt{6} \approx 2.449\)
- For 22C: \(z = \frac{22 - 21}{2.449} = 0.408\)
- For 25C: \(z = \frac{25 - 21}{2.449} = 1.633\)
Result: Min-max gives bounded [0,1] range, while Z-score preserves statistical distance (25C is 1.63 std deviations above mean).
44.3.3 Robust Scaling
Uses median and IQR, robust to outliers:
class RobustScaler:
def __init__(self):
self.median = None
self.iqr = None
def fit(self, data):
"""Learn median and IQR from training data"""
import numpy as np
self.median = np.median(data)
q1 = np.percentile(data, 25)
q3 = np.percentile(data, 75)
self.iqr = q3 - q1
def transform(self, value):
"""Scale using median and IQR"""
if self.iqr == 0:
return 0.0
return (value - self.median) / self.iqr
def inverse_transform(self, scaled_value):
"""Convert back to original scale"""
return scaled_value * self.iqr + self.median44.3.4 When to Use Each Method
| Method | Use Case | Preserves | Sensitive To |
|---|---|---|---|
| Min-Max | Neural networks, bounded outputs | Distribution shape | Outliers |
| Z-Score | SVM, K-means, PCA, Gaussian assumptions | Relative distances | Outliers (shift mean/std) |
| Robust Scaling | When outliers should not affect range | Median-based | Extremely sparse data |
| Log Transform | Right-skewed data (power, counts) | Multiplicative relationships | Zero/negative values |
44.3.5 Interactive Normalization Calculator
Try different sensor values and see how each normalization method transforms them:
44.4 Data Quality Lab: ESP32 Wokwi Simulation
44.4.1 Lab Overview
In this hands-on lab, you will implement a complete data quality preprocessing pipeline on an ESP32 microcontroller. The simulation demonstrates real-world techniques for handling sensor data problems including outliers, missing values, noise, and the need for normalization.
What You Will Learn:
- Sensor Data Validation: Implementing range checks and rate-of-change validation
- Outlier Detection: Using Z-score and IQR methods to identify anomalous readings
- Missing Value Handling: Forward-fill and interpolation techniques for gap handling
- Noise Filtering: Moving average, median filter, and exponential smoothing
- Data Normalization: Min-max scaling and Z-score normalization for multi-sensor fusion
Skills Practiced:
- Real-time data processing on embedded systems
- Statistical calculations with limited memory
- Circular buffer implementations
- Streaming algorithm design
44.4.2 Lab Components
| Technique | Implementation | Visual Indicator |
|---|---|---|
| Range Validation | Physical bounds checking | Red LED on violation |
| Z-Score Outliers | Rolling statistics | Yellow LED on outlier |
| Median Filter | 5-sample sliding window | Smoothed output in serial |
| Moving Average | 10-sample window | Trend visualization |
| Normalization | 0-1 scaling | Percentage output |
44.4.3 Wokwi Simulator
Use the embedded simulator below to build your data quality preprocessing system:
44.4.4 Circuit Setup
Connect the sensors and indicators to the ESP32:
| Component | ESP32 Pin | Purpose |
|---|---|---|
| Temperature Sensor (NTC) | GPIO 34 | Primary data source |
| Light Sensor (LDR) | GPIO 35 | Secondary data source |
| Potentiometer | GPIO 32 | Simulate sensor drift |
| Red LED | GPIO 18 | Range violation indicator |
| Yellow LED | GPIO 19 | Outlier detection indicator |
| Green LED | GPIO 21 | Valid data indicator |
| Blue LED | GPIO 22 | Missing data indicator |
Add this diagram.json configuration in Wokwi:
{
"version": 1,
"author": "IoT Class - Data Quality Lab",
"editor": "wokwi",
"parts": [
{ "type": "wokwi-esp32-devkit-v1", "id": "esp", "top": 0, "left": 0 },
{ "type": "wokwi-ntc-temperature-sensor", "id": "temp1", "top": -120, "left": 80 },
{ "type": "wokwi-photoresistor-sensor", "id": "ldr1", "top": -120, "left": 180 },
{ "type": "wokwi-potentiometer", "id": "pot1", "top": -120, "left": 280 },
{ "type": "wokwi-led", "id": "led_red", "top": 180, "left": 80, "attrs": { "color": "red" } },
{ "type": "wokwi-led", "id": "led_yellow", "top": 180, "left": 130, "attrs": { "color": "yellow" } },
{ "type": "wokwi-led", "id": "led_green", "top": 180, "left": 180, "attrs": { "color": "green" } },
{ "type": "wokwi-led", "id": "led_blue", "top": 180, "left": 230, "attrs": { "color": "blue" } },
{ "type": "wokwi-resistor", "id": "r1", "top": 230, "left": 80, "attrs": { "value": "220" } },
{ "type": "wokwi-resistor", "id": "r2", "top": 230, "left": 130, "attrs": { "value": "220" } },
{ "type": "wokwi-resistor", "id": "r3", "top": 230, "left": 180, "attrs": { "value": "220" } },
{ "type": "wokwi-resistor", "id": "r4", "top": 230, "left": 230, "attrs": { "value": "220" } }
],
"connections": [
["esp:GND.1", "temp1:GND", "black", ["h0"]],
["esp:3V3", "temp1:VCC", "red", ["h0"]],
["esp:34", "temp1:OUT", "green", ["h0"]],
["esp:GND.1", "ldr1:GND", "black", ["h0"]],
["esp:3V3", "ldr1:VCC", "red", ["h0"]],
["esp:35", "ldr1:OUT", "orange", ["h0"]],
["esp:GND.1", "pot1:GND", "black", ["h0"]],
["esp:3V3", "pot1:VCC", "red", ["h0"]],
["esp:32", "pot1:SIG", "purple", ["h0"]],
["esp:18", "led_red:A", "red", ["h0"]],
["led_red:C", "r1:1", "black", ["h0"]],
["r1:2", "esp:GND.1", "black", ["h0"]],
["esp:19", "led_yellow:A", "yellow", ["h0"]],
["led_yellow:C", "r2:1", "black", ["h0"]],
["r2:2", "esp:GND.1", "black", ["h0"]],
["esp:21", "led_green:A", "green", ["h0"]],
["led_green:C", "r3:1", "black", ["h0"]],
["r3:2", "esp:GND.1", "black", ["h0"]],
["esp:22", "led_blue:A", "blue", ["h0"]],
["led_blue:C", "r4:1", "black", ["h0"]],
["r4:2", "esp:GND.1", "black", ["h0"]]
]
}44.4.5 Complete Arduino Code
Copy this code into the Wokwi editor:
// ============================================================================
// DATA QUALITY LAB: Comprehensive IoT Data Preprocessing Pipeline
// ============================================================================
// Demonstrates: Validation, Outlier Detection, Missing Value Handling,
// Noise Filtering, Normalization, and Data Scaling
// ============================================================================
#include <Arduino.h>
#include <math.h>
// ====================== PIN DEFINITIONS ======================
const int TEMP_PIN = 34; // Temperature sensor (NTC)
const int LIGHT_PIN = 35; // Light sensor (LDR)
const int DRIFT_PIN = 32; // Potentiometer for drift simulation
const int LED_RED = 18; // Range violation indicator
const int LED_YELLOW = 19; // Outlier detection indicator
const int LED_GREEN = 21; // Valid data indicator
const int LED_BLUE = 22; // Missing data indicator
// ====================== SAMPLING PARAMETERS ======================
const int SAMPLE_INTERVAL_MS = 200; // 5 Hz sampling rate
const int VALIDATION_WINDOW = 50; // Samples for statistics
const int FILTER_WINDOW_MEDIAN = 5; // Median filter window
const int FILTER_WINDOW_MA = 10; // Moving average window
const float EXP_SMOOTHING_ALPHA = 0.3; // Exponential smoothing factor
// ====================== VALIDATION THRESHOLDS ======================
// Temperature sensor valid range (Celsius after conversion)
const float TEMP_MIN_VALID = -10.0;
const float TEMP_MAX_VALID = 60.0;
const float TEMP_MAX_RATE = 2.0; // Max 2C change per second
// Light sensor valid range (0-4095 ADC, 0-100000 lux mapped)
const float LIGHT_MIN_VALID = 0.0;
const float LIGHT_MAX_VALID = 100000.0;
// Outlier detection thresholds
const float ZSCORE_THRESHOLD = 3.0; // Standard deviations
const float IQR_MULTIPLIER = 1.5; // IQR outlier multiplier
const float MAD_THRESHOLD = 3.5; // Modified Z-score threshold
// Missing data parameters
const int MAX_MISSING_SAMPLES = 10; // Max consecutive missing allowed
const float MISSING_PROBABILITY = 0.05; // Simulate 5% missing data
// ====================== DATA STRUCTURES ======================
// Circular buffer for streaming statistics
template<int SIZE>
struct CircularBuffer {
float data[SIZE];
int head;
int count;
CircularBuffer() : head(0), count(0) {
for (int i = 0; i < SIZE; i++) data[i] = 0;
}
void push(float value) {
data[head] = value;
head = (head + 1) % SIZE;
if (count < SIZE) count++;
}
float get(int index) const {
// Get value at index (0 = oldest)
int actual = (head - count + index + SIZE) % SIZE;
return data[actual];
}
bool isFull() const { return count == SIZE; }
int size() const { return count; }
};
// Statistics calculator for streaming data
struct StreamingStats {
float sum;
float sumSq;
int count;
float min_val;
float max_val;
StreamingStats() : sum(0), sumSq(0), count(0),
min_val(INFINITY), max_val(-INFINITY) {}
void reset() {
sum = sumSq = 0;
count = 0;
min_val = INFINITY;
max_val = -INFINITY;
}
void add(float value) {
sum += value;
sumSq += value * value;
count++;
if (value < min_val) min_val = value;
if (value > max_val) max_val = value;
}
float mean() const { return count > 0 ? sum / count : 0; }
float variance() const {
if (count < 2) return 0;
return (sumSq - sum * sum / count) / (count - 1);
}
float stdDev() const { return sqrt(variance()); }
};
// Sensor reading with quality metadata
struct SensorReading {
float rawValue;
float cleanedValue;
float normalizedValue;
unsigned long timestamp;
bool isValid;
bool isOutlier;
bool isMissing;
bool isImputed;
String qualityFlags;
};
// ====================== GLOBAL STATE ======================
// Circular buffers for different processing stages
CircularBuffer<VALIDATION_WINDOW> tempRawBuffer;
CircularBuffer<VALIDATION_WINDOW> tempCleanBuffer;
CircularBuffer<FILTER_WINDOW_MEDIAN> tempMedianBuffer;
CircularBuffer<FILTER_WINDOW_MA> tempMABuffer;
CircularBuffer<VALIDATION_WINDOW> lightRawBuffer;
CircularBuffer<VALIDATION_WINDOW> lightCleanBuffer;
// Streaming statistics
StreamingStats tempStats;
StreamingStats lightStats;
// For rate-of-change validation
float lastValidTemp = NAN;
unsigned long lastValidTempTime = 0;
float lastValidLight = NAN;
unsigned long lastValidLightTime = 0;
// For exponential smoothing
float expSmoothedTemp = NAN;
float expSmoothedLight = NAN;
// For missing data handling
int consecutiveMissingTemp = 0;
int consecutiveMissingLight = 0;
float lastImputedTemp = NAN;
float lastImputedLight = NAN;
// Normalization parameters (learned from data)
float tempMinSeen = INFINITY;
float tempMaxSeen = -INFINITY;
float lightMinSeen = INFINITY;
float lightMaxSeen = -INFINITY;
// Statistics counters
unsigned long totalSamples = 0;
unsigned long validSamples = 0;
unsigned long outlierSamples = 0;
unsigned long missingSamples = 0;
unsigned long imputedSamples = 0;
unsigned long rangeViolations = 0;
unsigned long rateViolations = 0;
// ====================== HELPER FUNCTIONS ======================
// Convert ADC reading to temperature (NTC thermistor approximation)
float adcToTemperature(int adcValue) {
if (adcValue == 0) return -INFINITY;
if (adcValue >= 4095) return INFINITY;
// Simplified Steinhart-Hart approximation for 10K NTC
float resistance = 10000.0 * (4095.0 / adcValue - 1.0);
float steinhart = resistance / 10000.0;
steinhart = log(steinhart);
steinhart /= 3950.0;
steinhart += 1.0 / (25.0 + 273.15);
steinhart = 1.0 / steinhart;
steinhart -= 273.15;
return steinhart;
}
// Convert ADC reading to light level (LDR approximation)
float adcToLight(int adcValue) {
// Map ADC to approximate lux (logarithmic)
if (adcValue < 10) return 0;
float lux = 100000.0 * pow((float)adcValue / 4095.0, 2);
return lux;
}
// ====================== VALIDATION FUNCTIONS ======================
// Range validation - check if value is within physical bounds
bool validateRange(float value, float minValid, float maxValid, String& errorMsg) {
if (isnan(value) || isinf(value)) {
errorMsg = "NaN/Inf";
return false;
}
if (value < minValid) {
errorMsg = "Below min (" + String(minValid) + ")";
return false;
}
if (value > maxValid) {
errorMsg = "Above max (" + String(maxValid) + ")";
return false;
}
errorMsg = "OK";
return true;
}
// Rate-of-change validation - detect impossible jumps
bool validateRateOfChange(float currentValue, float lastValue,
unsigned long currentTime, unsigned long lastTime,
float maxRate, String& errorMsg) {
if (isnan(lastValue) || lastTime == 0) {
errorMsg = "First reading";
return true;
}
float timeDelta = (currentTime - lastTime) / 1000.0; // seconds
if (timeDelta <= 0) {
errorMsg = "Invalid timestamp";
return false;
}
float rate = abs(currentValue - lastValue) / timeDelta;
if (rate > maxRate) {
errorMsg = "Rate " + String(rate, 2) + " exceeds max " + String(maxRate);
return false;
}
errorMsg = "Rate OK (" + String(rate, 2) + ")";
return true;
}
// ====================== OUTLIER DETECTION ======================
// Z-Score outlier detection
bool detectOutlierZScore(float value, float mean, float stdDev,
float threshold, float& zScore) {
if (stdDev == 0) {
zScore = 0;
return false;
}
zScore = abs((value - mean) / stdDev);
return zScore > threshold;
}
// IQR outlier detection (requires sorted buffer)
bool detectOutlierIQR(CircularBuffer<VALIDATION_WINDOW>& buffer, float value,
float multiplier, float& lowerBound, float& upperBound) {
if (buffer.size() < 10) return false;
// Copy to temporary array for sorting
float sorted[VALIDATION_WINDOW];
int n = buffer.size();
for (int i = 0; i < n; i++) {
sorted[i] = buffer.get(i);
}
// Simple bubble sort (OK for small buffers)
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (sorted[j] > sorted[j + 1]) {
float temp = sorted[j];
sorted[j] = sorted[j + 1];
sorted[j + 1] = temp;
}
}
}
// Calculate quartiles
int q1Idx = n / 4;
int q3Idx = 3 * n / 4;
float q1 = sorted[q1Idx];
float q3 = sorted[q3Idx];
float iqr = q3 - q1;
lowerBound = q1 - multiplier * iqr;
upperBound = q3 + multiplier * iqr;
return (value < lowerBound) || (value > upperBound);
}
// ====================== NOISE FILTERING ======================
// Moving average filter
float filterMovingAverage(CircularBuffer<FILTER_WINDOW_MA>& buffer, float newValue) {
buffer.push(newValue);
float sum = 0;
for (int i = 0; i < buffer.size(); i++) {
sum += buffer.get(i);
}
return sum / buffer.size();
}
// Median filter (excellent for spike removal)
float filterMedian(CircularBuffer<FILTER_WINDOW_MEDIAN>& buffer, float newValue) {
buffer.push(newValue);
// Copy and sort
float sorted[FILTER_WINDOW_MEDIAN];
int n = buffer.size();
for (int i = 0; i < n; i++) {
sorted[i] = buffer.get(i);
}
for (int i = 0; i < n - 1; i++) {
for (int j = 0; j < n - i - 1; j++) {
if (sorted[j] > sorted[j + 1]) {
float temp = sorted[j];
sorted[j] = sorted[j + 1];
sorted[j + 1] = temp;
}
}
}
if (n % 2 == 0) {
return (sorted[n/2 - 1] + sorted[n/2]) / 2.0;
}
return sorted[n / 2];
}
// Exponential smoothing filter
float filterExponentialSmoothing(float& smoothed, float newValue, float alpha) {
if (isnan(smoothed)) {
smoothed = newValue;
} else {
smoothed = alpha * newValue + (1 - alpha) * smoothed;
}
return smoothed;
}
// ====================== MISSING VALUE HANDLING ======================
// Simulate missing data (for demonstration)
bool simulateMissingData() {
return random(1000) < (MISSING_PROBABILITY * 1000);
}
// Forward-fill imputation with limit
float imputeForwardFill(float lastValidValue, int& consecutiveMissing,
int maxMissing, bool& wasImputed) {
consecutiveMissing++;
if (consecutiveMissing > maxMissing || isnan(lastValidValue)) {
wasImputed = false;
return NAN;
}
wasImputed = true;
return lastValidValue;
}
// ====================== NORMALIZATION ======================
// Min-Max scaling to 0-1 range
float normalizeMinMax(float value, float minSeen, float maxSeen) {
if (maxSeen == minSeen) return 0.5;
return (value - minSeen) / (maxSeen - minSeen);
}
// Z-Score normalization (standardization)
float normalizeZScore(float value, float mean, float stdDev) {
if (stdDev == 0) return 0;
return (value - mean) / stdDev;
}
// Update normalization parameters
void updateNormalizationParams(float value, float& minSeen, float& maxSeen) {
if (value < minSeen) minSeen = value;
if (value > maxSeen) maxSeen = value;
}
// ====================== MAIN PROCESSING FUNCTION ======================
SensorReading processTemperature(int adcValue, unsigned long timestamp) {
SensorReading reading;
reading.timestamp = timestamp;
reading.isValid = true;
reading.isOutlier = false;
reading.isMissing = false;
reading.isImputed = false;
reading.qualityFlags = "";
// Step 0: Check for simulated missing data
if (simulateMissingData()) {
reading.isMissing = true;
missingSamples++;
// Try forward-fill imputation
bool wasImputed;
float imputed = imputeForwardFill(lastValidTemp, consecutiveMissingTemp,
MAX_MISSING_SAMPLES, wasImputed);
if (wasImputed) {
reading.rawValue = NAN;
reading.cleanedValue = imputed;
reading.isImputed = true;
reading.qualityFlags += "IMPUTED_FFILL ";
imputedSamples++;
lastImputedTemp = imputed;
} else {
reading.rawValue = NAN;
reading.cleanedValue = NAN;
reading.normalizedValue = NAN;
reading.qualityFlags += "MISSING_NO_IMPUTE ";
return reading;
}
} else {
consecutiveMissingTemp = 0;
// Step 1: Convert ADC to temperature
reading.rawValue = adcToTemperature(adcValue);
// Step 2: Range validation
String rangeError;
if (!validateRange(reading.rawValue, TEMP_MIN_VALID, TEMP_MAX_VALID, rangeError)) {
reading.isValid = false;
reading.qualityFlags += "RANGE_VIOLATION(" + rangeError + ") ";
rangeViolations++;
}
// Step 3: Rate-of-change validation
String rateError;
if (!validateRateOfChange(reading.rawValue, lastValidTemp,
timestamp, lastValidTempTime,
TEMP_MAX_RATE, rateError)) {
reading.isValid = false;
reading.qualityFlags += "RATE_VIOLATION(" + rateError + ") ";
rateViolations++;
}
// Step 4: Outlier detection (only if range-valid)
if (reading.isValid && tempStats.count > 20) {
float zScore;
if (detectOutlierZScore(reading.rawValue, tempStats.mean(),
tempStats.stdDev(), ZSCORE_THRESHOLD, zScore)) {
reading.isOutlier = true;
reading.qualityFlags += "ZSCORE_OUTLIER(z=" + String(zScore, 2) + ") ";
outlierSamples++;
}
float lowerBound, upperBound;
if (detectOutlierIQR(tempRawBuffer, reading.rawValue,
IQR_MULTIPLIER, lowerBound, upperBound)) {
reading.isOutlier = true;
reading.qualityFlags += "IQR_OUTLIER ";
}
}
// Step 5: Apply noise filtering
if (reading.isValid && !reading.isOutlier) {
// Median filter first (removes spikes)
float medianFiltered = filterMedian(tempMedianBuffer, reading.rawValue);
// Then moving average (smooths remaining noise)
float maFiltered = filterMovingAverage(tempMABuffer, medianFiltered);
// Exponential smoothing for final output
reading.cleanedValue = filterExponentialSmoothing(expSmoothedTemp,
maFiltered,
EXP_SMOOTHING_ALPHA);
// Update statistics with valid data
tempStats.add(reading.rawValue);
tempRawBuffer.push(reading.rawValue);
tempCleanBuffer.push(reading.cleanedValue);
// Update last valid
lastValidTemp = reading.rawValue;
lastValidTempTime = timestamp;
validSamples++;
} else if (reading.isOutlier) {
// Use median of recent values for outlier replacement
reading.cleanedValue = filterMedian(tempMedianBuffer,
tempMedianBuffer.get(tempMedianBuffer.size() - 1));
reading.qualityFlags += "OUTLIER_REPLACED ";
} else {
// Range/rate violation - use last valid with flag
if (!isnan(lastValidTemp)) {
reading.cleanedValue = lastValidTemp;
reading.qualityFlags += "USING_LAST_VALID ";
} else {
reading.cleanedValue = NAN;
}
}
}
// Step 6: Normalization
if (!isnan(reading.cleanedValue)) {
updateNormalizationParams(reading.cleanedValue, tempMinSeen, tempMaxSeen);
reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
tempMinSeen, tempMaxSeen);
} else {
reading.normalizedValue = NAN;
}
if (reading.qualityFlags == "") {
reading.qualityFlags = "CLEAN";
}
totalSamples++;
return reading;
}
SensorReading processLight(int adcValue, unsigned long timestamp) {
SensorReading reading;
reading.timestamp = timestamp;
reading.isValid = true;
reading.isOutlier = false;
reading.isMissing = false;
reading.isImputed = false;
reading.qualityFlags = "";
// Similar processing as temperature (abbreviated for space)
if (simulateMissingData()) {
reading.isMissing = true;
missingSamples++;
bool wasImputed;
float imputed = imputeForwardFill(lastValidLight, consecutiveMissingLight,
MAX_MISSING_SAMPLES, wasImputed);
if (wasImputed) {
reading.rawValue = NAN;
reading.cleanedValue = imputed;
reading.isImputed = true;
reading.qualityFlags = "IMPUTED_FFILL";
imputedSamples++;
} else {
reading.rawValue = NAN;
reading.cleanedValue = NAN;
reading.normalizedValue = NAN;
reading.qualityFlags = "MISSING_NO_IMPUTE";
return reading;
}
} else {
consecutiveMissingLight = 0;
reading.rawValue = adcToLight(adcValue);
// Simplified processing for light sensor
String rangeError;
if (!validateRange(reading.rawValue, LIGHT_MIN_VALID, LIGHT_MAX_VALID, rangeError)) {
reading.isValid = false;
reading.qualityFlags = "RANGE_VIOLATION";
rangeViolations++;
}
if (reading.isValid) {
reading.cleanedValue = filterExponentialSmoothing(expSmoothedLight,
reading.rawValue,
EXP_SMOOTHING_ALPHA);
lightStats.add(reading.rawValue);
lightRawBuffer.push(reading.rawValue);
lastValidLight = reading.rawValue;
lastValidLightTime = timestamp;
validSamples++;
} else {
reading.cleanedValue = lastValidLight;
}
}
// Normalization
if (!isnan(reading.cleanedValue)) {
updateNormalizationParams(reading.cleanedValue, lightMinSeen, lightMaxSeen);
reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
lightMinSeen, lightMaxSeen);
}
if (reading.qualityFlags == "") {
reading.qualityFlags = "CLEAN";
}
totalSamples++;
return reading;
}
// ====================== LED INDICATOR CONTROL ======================
void updateLEDs(const SensorReading& tempReading, const SensorReading& lightReading) {
// Red LED: Range violation
if (!tempReading.isValid || !lightReading.isValid) {
digitalWrite(LED_RED, HIGH);
} else {
digitalWrite(LED_RED, LOW);
}
// Yellow LED: Outlier detected
if (tempReading.isOutlier || lightReading.isOutlier) {
digitalWrite(LED_YELLOW, HIGH);
} else {
digitalWrite(LED_YELLOW, LOW);
}
// Green LED: Clean valid data
if (tempReading.qualityFlags == "CLEAN" && lightReading.qualityFlags == "CLEAN") {
digitalWrite(LED_GREEN, HIGH);
} else {
digitalWrite(LED_GREEN, LOW);
}
// Blue LED: Missing/imputed data
if (tempReading.isMissing || tempReading.isImputed ||
lightReading.isMissing || lightReading.isImputed) {
digitalWrite(LED_BLUE, HIGH);
} else {
digitalWrite(LED_BLUE, LOW);
}
}
// ====================== SERIAL OUTPUT ======================
void printSensorReading(const char* sensorName, const SensorReading& reading) {
Serial.print(sensorName);
Serial.print(": Raw=");
if (isnan(reading.rawValue)) {
Serial.print("NaN");
} else {
Serial.print(reading.rawValue, 2);
}
Serial.print(", Clean=");
if (isnan(reading.cleanedValue)) {
Serial.print("NaN");
} else {
Serial.print(reading.cleanedValue, 2);
}
Serial.print(", Norm=");
if (isnan(reading.normalizedValue)) {
Serial.print("NaN");
} else {
Serial.print(reading.normalizedValue, 3);
}
Serial.print(" [");
Serial.print(reading.qualityFlags);
Serial.println("]");
}
void printStatistics() {
Serial.println("\n========== DATA QUALITY STATISTICS ==========");
Serial.print("Total Samples: ");
Serial.println(totalSamples);
Serial.print("Valid Samples: ");
Serial.print(validSamples);
Serial.print(" (");
Serial.print(100.0 * validSamples / max(totalSamples, 1UL), 1);
Serial.println("%)");
Serial.print("Outliers Detected: ");
Serial.print(outlierSamples);
Serial.print(" (");
Serial.print(100.0 * outlierSamples / max(totalSamples, 1UL), 1);
Serial.println("%)");
Serial.print("Missing Values: ");
Serial.print(missingSamples);
Serial.print(" (");
Serial.print(100.0 * missingSamples / max(totalSamples, 1UL), 1);
Serial.println("%)");
Serial.print("Imputed Values: ");
Serial.print(imputedSamples);
Serial.println();
Serial.print("Range Violations: ");
Serial.println(rangeViolations);
Serial.print("Rate Violations: ");
Serial.println(rateViolations);
Serial.println("\n--- Temperature Statistics ---");
Serial.print("Mean: ");
Serial.print(tempStats.mean(), 2);
Serial.print(" C, StdDev: ");
Serial.print(tempStats.stdDev(), 2);
Serial.print(" C, Range: [");
Serial.print(tempMinSeen, 1);
Serial.print(", ");
Serial.print(tempMaxSeen, 1);
Serial.println("] C");
Serial.println("\n--- Light Statistics ---");
Serial.print("Mean: ");
Serial.print(lightStats.mean(), 0);
Serial.print(" lux, StdDev: ");
Serial.print(lightStats.stdDev(), 0);
Serial.print(" lux, Range: [");
Serial.print(lightMinSeen, 0);
Serial.print(", ");
Serial.print(lightMaxSeen, 0);
Serial.println("] lux");
Serial.println("==============================================\n");
}
// ====================== SETUP AND LOOP ======================
void setup() {
Serial.begin(115200);
delay(1000);
// Initialize pins
pinMode(TEMP_PIN, INPUT);
pinMode(LIGHT_PIN, INPUT);
pinMode(DRIFT_PIN, INPUT);
pinMode(LED_RED, OUTPUT);
pinMode(LED_YELLOW, OUTPUT);
pinMode(LED_GREEN, OUTPUT);
pinMode(LED_BLUE, OUTPUT);
// All LEDs off initially
digitalWrite(LED_RED, LOW);
digitalWrite(LED_YELLOW, LOW);
digitalWrite(LED_GREEN, LOW);
digitalWrite(LED_BLUE, LOW);
// Seed random for missing data simulation
randomSeed(analogRead(0));
Serial.println("============================================");
Serial.println(" DATA QUALITY LAB: IoT Preprocessing Demo ");
Serial.println("============================================");
Serial.println("Features demonstrated:");
Serial.println(" - Range validation (physical bounds)");
Serial.println(" - Rate-of-change validation");
Serial.println(" - Z-Score outlier detection");
Serial.println(" - IQR outlier detection");
Serial.println(" - Median filter (spike removal)");
Serial.println(" - Moving average filter (smoothing)");
Serial.println(" - Exponential smoothing");
Serial.println(" - Missing value imputation (forward-fill)");
Serial.println(" - Min-Max normalization (0-1 scaling)");
Serial.println("============================================");
Serial.println("LED Indicators:");
Serial.println(" RED: Range/Rate violation");
Serial.println(" YELLOW: Outlier detected");
Serial.println(" GREEN: Clean valid data");
Serial.println(" BLUE: Missing/Imputed data");
Serial.println("============================================\n");
Serial.println("Starting data collection...\n");
}
unsigned long lastSampleTime = 0;
unsigned long lastStatsTime = 0;
const unsigned long STATS_INTERVAL_MS = 10000; // Print stats every 10 seconds
void loop() {
unsigned long currentTime = millis();
// Sample at defined interval
if (currentTime - lastSampleTime >= SAMPLE_INTERVAL_MS) {
lastSampleTime = currentTime;
// Read sensors
int tempADC = analogRead(TEMP_PIN);
int lightADC = analogRead(LIGHT_PIN);
int driftADC = analogRead(DRIFT_PIN);
// Add simulated drift from potentiometer (optional)
float driftFactor = (driftADC - 2048) / 2048.0; // -1 to +1
tempADC = constrain(tempADC + (int)(driftFactor * 500), 0, 4095);
// Process readings through data quality pipeline
SensorReading tempReading = processTemperature(tempADC, currentTime);
SensorReading lightReading = processLight(lightADC, currentTime);
// Update LED indicators
updateLEDs(tempReading, lightReading);
// Print readings
printSensorReading("TEMP", tempReading);
printSensorReading("LIGHT", lightReading);
Serial.println();
}
// Print statistics periodically
if (currentTime - lastStatsTime >= STATS_INTERVAL_MS) {
lastStatsTime = currentTime;
printStatistics();
}
}44.4.6 Step-by-Step Instructions
44.4.6.1 Step 1: Set Up the Simulator
- Open the Wokwi simulator embedded above (or visit wokwi.com)
- Create a new ESP32 project
- Click the diagram.json tab and paste the circuit configuration
- Replace the default code with the complete Arduino code above
44.4.6.2 Step 2: Run and Observe Validation
- Click the Play button to start the simulation
- Open the Serial Monitor to see the data processing output
- Observe the quality flags showing validation status for each reading
- Watch for “CLEAN” flags indicating data passed all quality checks
44.4.6.3 Step 3: Trigger Range Violations
- Click the NTC temperature sensor in the simulator
- Drag the slider to extreme values (very hot or very cold)
- Watch the RED LED turn on when values exceed physical bounds
- Note the “RANGE_VIOLATION” flag in the serial output
- Observe how the system uses the last valid value when current is invalid
44.4.6.4 Step 4: Observe Outlier Detection
- Make sudden temperature changes by quickly dragging the sensor slider
- Watch the YELLOW LED blink when outliers are detected
- See the “ZSCORE_OUTLIER” and “IQR_OUTLIER” flags in output
- Notice how outliers are replaced with median values
44.4.6.5 Step 5: Observe Missing Data Handling
- The code simulates 5% random missing data
- Watch the BLUE LED flash when data is missing or imputed
- See “IMPUTED_FFILL” flags showing forward-fill imputation
- Note “MISSING_NO_IMPUTE” when too many consecutive values are missing
44.4.6.6 Step 6: Experiment with Filtering
- Observe the Raw vs Clean values in the serial output
- Notice how Clean values are smoother due to median + moving average filters
- Compare the normalized values (0-1 range) for multi-sensor comparison
44.4.6.7 Step 7: Analyze Statistics
- Wait for the statistics report (prints every 10 seconds)
- Review the data quality percentages: valid, outliers, missing
- Examine the sensor statistics: mean, standard deviation, range
- Consider how these metrics would inform production monitoring
44.4.7 Challenge Exercises
Challenge 1: Implement Median Absolute Deviation (MAD) Filtering
Difficulty: Intermediate
Task: The code defines a MAD_THRESHOLD constant but does not implement MAD outlier detection. Implement a detectOutlierMAD() function and integrate it into the processTemperature() pipeline as the primary outlier detection method.
Hints:
- MAD is more robust than Z-score for non-Gaussian data
- You will need to implement a
detectOutlierMAD()function (modified Z-score = 0.6745 * (x - median) / MAD) - Replace or complement the Z-score check with MAD
Expected Outcome: MAD should detect outliers even when extreme values skew the mean and standard deviation.
Challenge 2: Add Exponential Backoff for Missing Data
Difficulty: Intermediate
Task: Currently, forward-fill uses a fixed limit (MAX_MISSING_SAMPLES). Implement an exponential decay on the confidence of imputed values.
Requirements:
- Add a
confidencefield toSensorReading - Reduce confidence by 10% for each consecutive imputed value
- Stop imputing when confidence drops below 50%
- Display confidence in serial output
Expected Outcome: Imputed values should be flagged with decreasing confidence as gaps grow longer.
Challenge 3: Cross-Sensor Validation
Difficulty: Advanced
Task: Add plausibility checking between temperature and light sensors. If it is very bright (high light), temperature should be reasonable for daytime.
Requirements:
- If light > 50000 lux and temperature < 10C, flag as suspicious
- If light < 100 lux and temperature > 35C (outdoors), flag as suspicious
- Add a new LED or serial indicator for cross-sensor anomalies
Expected Outcome: The system should detect when sensor readings are physically inconsistent with each other.
Challenge 4: Implement Kalman Filter
Difficulty: Advanced
Task: Replace the exponential smoothing filter with a simple Kalman filter for temperature.
Requirements:
- Implement 1D Kalman filter with process noise and measurement noise
- Estimate the Kalman gain dynamically
- Output both the filtered value and the uncertainty estimate
Learning: Kalman filters provide optimal estimation when process and measurement noise characteristics are known.
44.4.8 Expected Outcomes
After completing this lab, you should be able to:
- Understand validation trade-offs: Strict validation catches more errors but may reject valid extreme readings
- Choose appropriate outlier methods: Z-score for Gaussian data, IQR/MAD for robust detection
- Select imputation strategies: Forward-fill for slow-changing, interpolation for trending data
- Apply noise filters correctly: Median for spikes, moving average for steady-state noise
- Normalize for fusion: Understand when to use min-max vs Z-score normalization
Quality Metrics to Observe:
- Valid sample rate should be >90% under normal conditions
- Outlier rate should be <5% for stable sensors
- Imputed values should maintain temporal continuity
Worked Example: Multi-Sensor Normalization for Anomaly Detection
Scenario: You’re building an anomaly detection system for a server room with 3 sensors: temperature (15-35°C), CO2 (400-5000 ppm), and humidity (30-70%). A neural network needs all inputs on the same scale to detect abnormal conditions.
Given:
- Temperature sensor: range 15-35°C, current reading 28°C
- CO2 sensor: range 400-5000 ppm, current reading 1200 ppm
- Humidity sensor: range 30-70%, current reading 55%
- Neural network requires inputs in 0-1 range
Question: Normalize these readings and explain why proper normalization matters for the neural network.
Solution:
Step 1: Calculate min-max normalization for each sensor
Temperature normalization:
normalized_temp = (28 - 15) / (35 - 15)
= 13 / 20
= 0.65
CO2 normalization:
normalized_co2 = (1200 - 400) / (5000 - 400)
= 800 / 4600
= 0.174
Humidity normalization:
normalized_humidity = (55 - 30) / (70 - 30)
= 25 / 40
= 0.625
Step 2: Show the impact WITHOUT normalization
If we fed raw values to the neural network: - Input vector: [28, 1200, 55] - CO2 value is 40x larger than temperature - Neural network gradient updates would be dominated by CO2
Example gradient calculation (simplified):
Cost function: J = (pred - actual)²
Gradient w.r.t. weight: ∂J/∂w = 2 × (pred - actual) × input_value
For temperature: gradient ∝ 28
For CO2: gradient ∝ 1200 (43x larger!)
For humidity: gradient ∝ 55
The network would learn to minimize CO2 error while ignoring temperature and humidity!
Step 3: Verify normalized inputs have equal influence
Normalized input vector: [0.65, 0.174, 0.625]
Now all gradients are on similar scales:
For temperature: gradient ∝ 0.65
For CO2: gradient ∝ 0.174
For humidity: gradient ∝ 0.625
Each sensor contributes equally to gradient updates during training.
Step 4: Calculate percentage of each sensor’s range
This helps interpret the normalized values: - Temperature: 65% of its range (moderately warm) - CO2: 17.4% of its range (relatively low, good ventilation) - Humidity: 62.5% of its range (comfortable level)
Step 5: Detect an anomaly scenario
Normal condition: [0.65, 0.174, 0.625] Anomaly (AC failure): [0.95, 0.350, 0.825] - Temperature: 95% of range = 34°C (very hot!) - CO2: 35% of range = 2010 ppm (rising, poor ventilation) - Humidity: 82.5% of range = 63% (uncomfortable)
The neural network trained on normalized data can now detect this pattern as anomalous, because all three sensors contributed equally during training.
Key Insight: Without normalization, the neural network’s loss function is dominated by the largest-magnitude features. Min-max scaling ensures each sensor contributes proportionally to its information content, not its arbitrary measurement scale. This is why normalization is mandatory for neural networks, SVM, K-means, and any algorithm that uses distance metrics or gradient descent.
44.4.9 Try It: Multi-Sensor Fusion Calculator
Adjust the sensor readings below to see how min-max normalization brings different scales into alignment:
Decision Framework: Normalization Method Selection
Choose the appropriate normalization method based on your data characteristics and downstream algorithm:
| Data Characteristic | Normalization Method | Output Range | Best For | Avoid When |
|---|---|---|---|---|
| Bounded range known (temperature, humidity) | Min-Max Scaling | 0 to 1 | Neural networks, image processing, bounded outputs | Outliers present (they compress valid range) |
| Outliers expected (sensor noise, network latency) | Robust Scaling | Median-centered | K-means, SVM, any distance-based method | Need exact 0-1 bounds |
| Gaussian distribution (many natural phenomena) | Z-Score Normalization | Mean=0, Std=1 | Clustering, PCA, algorithms assuming normal distribution | Binary features (0/1) |
| Exponential/power-law (network traffic, wealth) | Log Transform then Z-Score | Variable | Right-skewed data, multiplicative relationships | Zero or negative values present |
| Mixed data types (some outliers + some bounded) | Hybrid: Robust for outlier features, Min-Max for clean | Variable per feature | Real-world messy datasets | Need uniform scaling method |
Decision Tree:
- Are there extreme outliers (>5% of values beyond 3σ)?
- YES → Use Robust Scaling (median + IQR)
- NO → Continue to step 2
- Is your algorithm neural-network-based?
- YES → Use Min-Max Scaling (0-1) for activation function compatibility
- NO → Continue to step 3
- Does your algorithm assume Gaussian distribution?
- YES (PCA, LDA) → Use Z-Score Normalization
- NO → Continue to step 4
- Is your data heavily right-skewed (long tail)?
- YES → Log Transform + Z-Score
- NO → Default to Min-Max Scaling
Example Python Implementation:
def select_normalizer(data_characteristics):
"""
Select appropriate normalization based on data characteristics.
Returns: (normalizer_class, parameters)
"""
has_outliers = data_characteristics['outlier_rate'] > 0.05
is_neural_network = data_characteristics['model_type'] == 'neural_network'
is_gaussian = data_characteristics['distribution'] == 'normal'
is_skewed = data_characteristics['skewness'] > 2.0
if has_outliers:
return (RobustScaler, {})
elif is_neural_network:
return (MinMaxScaler, {'feature_range': (0, 1)})
elif is_gaussian:
return (ZScoreNormalizer, {})
elif is_skewed:
return (LogTransform, {'then_zscore': True})
else:
return (MinMaxScaler, {'feature_range': (0, 1)})
# Usage example
characteristics = {
'outlier_rate': 0.08, # 8% outliers
'model_type': 'kmeans',
'distribution': 'unknown',
'skewness': 1.2
}
normalizer_class, params = select_normalizer(characteristics)
# Returns: RobustScaler (because outlier_rate > 0.05)Warning Signs of Wrong Normalization:
- Neural network accuracy plateaus at 60% → Inputs not normalized
- Clustering groups all high-magnitude features together → Need Z-score instead of raw
- Model ignores certain sensors → Their raw ranges are too small (need min-max)
- Training loss explodes after first epoch → Gradients too large (need normalization)
Common Mistake: Normalizing Before Train/Test Split
The Mistake: Calculating normalization parameters (min, max, mean, std) on the entire dataset before splitting into train and test sets. This causes data leakage, where the test set’s statistics influence the training process, leading to overly optimistic performance estimates.
Why It Happens: The normalization step feels like “data preparation” rather than “model training,” so developers apply it before the split. Many tutorials skip this detail. Scikit-learn’s fit_transform() makes it easy to accidentally normalize everything at once.
Example of the Problem:
# WRONG: Normalize first, split second
data = load_sensor_data() # 10,000 samples
normalized = MinMaxScaler().fit_transform(data) # Uses ALL data stats!
train, test = train_test_split(normalized, test_size=0.2)
# The test set's min/max values influenced the scaling parameters!
# Model evaluation is now too optimistic.Why This Is Wrong:
- Data Leakage: Test set statistics “leak” into training through normalization parameters
- Overfitting: Model appears to generalize better than it actually does
- Production Failure: Real-world data has different min/max than training data
Real-World Example:
Imagine temperature sensor data: - Training period (winter): 15-25°C - Test period (summer): 20-35°C
WRONG approach:
# Calculate on ALL data (winter + summer)
overall_min = 15°C, overall_max = 35°C
# Normalize training data using overall stats
train_normalized = (train - 15) / (35 - 15)
# Result: Training temps (15-25°C) map to 0.0-0.5
# Normalize test data using SAME overall stats
test_normalized = (test - 15) / (35 - 15)
# Result: Test temps (20-35°C) map to 0.25-1.0
# Model trained on 0.0-0.5 range, tested on 0.25-1.0 range
# Test performance appears good because model "saw" summer data during normalization!The Fix: Fit normalization ONLY on training data, then apply to test:
# CORRECT: Split first, normalize second
data = load_sensor_data()
train, test = train_test_split(data, test_size=0.2) # Split FIRST
# Fit normalization on training data ONLY
scaler = MinMaxScaler()
scaler.fit(train) # Learns min=15, max=25 from training (winter)
# Apply fitted scaler to both train and test
train_normalized = scaler.transform(train)
test_normalized = scaler.transform(test) # Summer data (20-35°C) may exceed [0,1]!
# This is CORRECT - test data reflecting real-world distributionCorrect Pipeline Order:
- Split data into train/validation/test sets (e.g., 70/15/15)
- Fit normalization parameters on TRAINING data only
- Transform train, validation, and test sets using those parameters
- Train model on normalized training data
- Evaluate on normalized validation/test data
Code Template:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
# 1. Split FIRST
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)
# 2. Fit normalizer on TRAINING data only
scaler = MinMaxScaler()
scaler.fit(X_train) # Only training data!
# 3. Transform ALL sets using training parameters
X_train_norm = scaler.transform(X_train)
X_val_norm = scaler.transform(X_val)
X_test_norm = scaler.transform(X_test)
# 4. Train model
model.fit(X_train_norm, y_train)
# 5. Evaluate (no data leakage!)
val_score = model.score(X_val_norm, y_val)
test_score = model.score(X_test_norm, y_test)Warning Signs of This Mistake:
- Test accuracy is suspiciously high (>95%) on first try
- Model performance degrades significantly in production
- Test set values occasionally exceed [0, 1] after normalization (this is actually GOOD - means no leakage!)
- Reviewer asks “when did you fit the scaler?” and you can’t answer clearly
Real-World Impact: A smart building energy prediction model achieved 96% test accuracy during development but only 73% accuracy in production. Root cause: normalization was fit on the entire year’s data (including test set), so the model “saw” summer peak loads during training. In production, the next summer’s peak loads were outside the normalized range, causing poor predictions.
Common Pitfalls
1. Fitting normalisation parameters on the full dataset before train/test split
Computing Min-Max bounds or Z-score mean/std on all data before splitting leaks test set information into training. Always fit normalisation parameters on the training set only and apply them to both train and test sets.
2. Applying the same normalisation to all sensor types
Binary status signals (0/1), count data, and continuous physical measurements require different normalisation strategies. Normalise each channel according to its distribution, not uniformly.
3. Normalising labels (target variables)
In regression tasks, some implementations accidentally normalise the prediction target along with features, causing the model to predict normalised units that require inverse transformation. Keep target variables in their original units unless the algorithm specifically requires otherwise.
4. Not re-normalising when new sensors are added
Adding a new sensor channel to an existing normalised dataset requires recomputing normalisation parameters. Hardcoded normalisation bounds from the initial dataset will not accommodate the new sensor’s value range.
44.5 Summary
Data normalization and scaling complete the data quality preprocessing pipeline:
- Min-Max Scaling: Transforms data to 0-1 range, ideal for neural networks and bounded outputs
- Z-Score Normalization: Centers data around mean with unit variance, best for clustering and SVM
- Robust Scaling: Uses median and IQR, resistant to outliers
- Log Transform: Compresses right-skewed data spanning orders of magnitude
- Complete Pipeline: Validation -> Cleaning -> Transformation, all implementable on edge devices
Critical Design Principle: The complete “validate-clean-transform” pipeline should run at the edge. Catching data quality issues at the source costs 1% of fixing them in the cloud, and normalized data enables fair multi-sensor fusion.
For Kids: Meet the Sensor Squad!
Normalization is like making sure everyone on the team speaks the same language!
44.5.1 The Sensor Squad Adventure: The Unfair Contest
The Sensor Squad was having a “Who Detected the Most?” contest, but something was NOT fair!
Temperature Terry reported: “I measured 25 degrees today!” Light Lucy shouted: “Well, I measured FIFTY THOUSAND lux! I WIN!” Terry looked sad. “But… 25 degrees is actually really important too…”
Max the Microcontroller stepped in: “Wait! This contest is not fair! Terry’s numbers go from 0 to 50, but Lucy’s numbers go from 0 to 100,000. We need to put everyone on the SAME SCALE!”
So Max invented three ways to make it fair:
Method 1 – The Percentage Trick (Min-Max): “Turn everything into a percentage of its range!” Terry’s 25 out of 50 max = 50%. Lucy’s 50,000 out of 100,000 max = 50%. “We are actually EQUAL!” they both cheered.
Method 2 – The “How Unusual?” Trick (Z-Score): “How far is each reading from normal?” Terry’s normal is 20 degrees, so 25 is a little above normal. Lucy’s normal is 40,000 lux, so 50,000 is also a little above normal. Both scored about the same “unusualness.”
Method 3 – The Tough Cookie Trick (Robust): Pressure Pete had a WEIRD reading of 9999 that messed up the averages. “Use the MIDDLE value instead of the average!” said Bella the Battery. “The middle value ignores crazy numbers!”
Now ALL the sensors were on the same scale, and the contest was fair! “Normalization makes sure no sensor gets more attention just because it uses BIGGER numbers!” explained Max.
44.5.2 Key Words for Kids
| Word | What It Means |
|---|---|
| Normalization | Putting different measurements on the same scale so they can be compared fairly |
| Min-Max | Turning a number into a percentage between 0 and 1 |
| Z-Score | Measuring how far a number is from “normal” |
| Robust | A method that still works well even when some numbers are wrong |
| Scale | The range of numbers a sensor uses (like 0-50 or 0-100,000) |
Key Takeaway
Data normalization is essential before combining multi-sensor data or feeding it to machine learning models. Choose your method based on data characteristics: min-max for bounded outputs, Z-score for Gaussian assumptions, robust scaling when outliers are present. Always fit normalization parameters on training data only to avoid data leakage. Implement the complete validate-clean-transform pipeline at the edge to catch data quality issues at the source.
44.6 Concept Relationships
This chapter builds on validation and cleaning while introducing normalization as the final pipeline stage:
Prerequisites (Must understand first):
- Data Validation and Outlier Detection - Range checks and outlier detection must occur before normalization to avoid skewing scaling parameters with invalid data
- Missing Value Imputation and Noise Filtering - Cleaning must precede normalization; otherwise, gaps and noise corrupt min/max or mean/std calculations
Related Concepts (Enhance understanding):
- Multi-Sensor Data Fusion - Normalization enables fair comparison when fusing sensors with vastly different measurement ranges (temperature vs light intensity)
- Edge Data Acquisition - Edge devices normalize locally to reduce transmission bandwidth and prepare data for edge ML inference
Advanced Applications (Build on this):
- Modeling and Inferencing - Neural networks require normalized inputs ([0,1] or mean=0, std=1) for stable gradient descent
- Anomaly Detection - Z-score normalization makes distance-based anomaly detection work across features with different scales
Key Insight: Normalization is the final stage of the validate-clean-transform pipeline. Applying it before validation or cleaning causes incorrect scaling parameters (e.g., min-max uses outlier as max value, compressing all valid data into a tiny range).
44.7 What’s Next
| If you want to… | Read this |
|---|---|
| Understand imputation techniques that precede normalisation | Data Quality Imputation and Filtering |
| Study the full preprocessing pipeline | Data Quality and Preprocessing |
| Apply normalised data to ML model training | Modeling and Inferencing |
| Understand data validation before normalisation | Data Quality Validation |
| Return to the module overview | Big Data Overview |
See Also
Data Quality Series:
- Data Quality and Preprocessing - Overview and index
- Data Validation and Outlier Detection - Validation and outliers
- Missing Value Imputation and Noise Filtering - Handling gaps and noise
Practical Applications:
- Multi-Sensor Data Fusion - Combining preprocessed data
- Stream Processing - Real-time data pipelines
- Edge Compute Patterns - Distributed preprocessing architectures
Advanced Topics:
- Modeling and Inferencing - ML model deployment with quality data
- Anomaly Detection - Finding meaningful outliers in clean data