44  Normalization & Preprocessing

44.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Apply Normalization Techniques: Implement min-max scaling, Z-score normalization, and robust scaling for multi-sensor data fusion
  • Compare Normalization Methods: Evaluate which scaling approach suits each downstream use case (neural networks, clustering, visualization)
  • Implement a Complete Pipeline: Build and test an end-to-end data quality system on an ESP32 microcontroller
  • Assess Data Quality Metrics: Calculate and interpret validation rates, outlier counts, and imputation statistics
In 60 Seconds

This hands-on lab guides you through implementing data normalisation and standardisation on real IoT sensor datasets, demonstrating concretely how improper scaling causes ML models to ignore low-magnitude sensors and how to prevent it. You will compare Min-Max normalisation, Z-score standardisation, and robust scaling on data with outliers, and see the impact on a simple classifier.

44.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Imagine comparing apples and elephants. Temperature might range from -10 to 50 degrees, while light levels range from 0 to 100,000 lux. If you feed both to a machine learning model without normalizing, the model will think light is 2000x more important simply because the numbers are bigger!

Normalization puts all sensors on the same scale:

Before Normalization After Min-Max (0-1)
Temperature: 25C Temperature: 0.58
Light: 50,000 lux Light: 0.50

Now both contribute equally based on their actual information content, not their arbitrary unit scales.

When to use different methods:

Method Use When Output Range
Min-Max Need bounded outputs (neural networks) 0 to 1
Z-Score Gaussian data, using K-means/SVM/PCA Mean=0, StdDev=1
Robust Many outliers you want to ignore Median-centered
Log Data spans orders of magnitude Compressed scale

Key question this chapter answers: “How do I prepare multi-sensor data so it can be combined and analyzed fairly?”

Minimum Viable Understanding: Data Normalization

Core Concept: Normalization transforms sensor readings to a common scale, enabling fair comparison and combination of data from sensors with vastly different measurement ranges.

Why It Matters: Without normalization, a light sensor reading 100,000 lux would dominate a temperature sensor reading 25C in any analysis, even though both carry equal information. Machine learning models and statistical methods assume comparable scales.

Key Takeaway: Use min-max scaling (0-1) for neural networks and bounded outputs; use Z-score normalization for clustering and algorithms assuming Gaussian distributions; use robust scaling when outliers are present and should be dampened.

44.3 Data Normalization and Scaling

  • ~10 min | - - Intermediate | - P10.C09.U06

Key Concepts

  • Min-Max normalisation: Scaling all values to the [0, 1] range by subtracting the minimum and dividing by the range; sensitive to outliers that compress the majority of values into a narrow band.
  • Z-score standardisation: Transforming values to have zero mean and unit variance by subtracting the mean and dividing by the standard deviation; assumes approximately Gaussian distribution.
  • Robust scaling: Scaling using the median and interquartile range instead of mean and standard deviation, making it resistant to outliers — preferred for sensor data with occasional extreme spikes.
  • Feature scaling: The general process of transforming sensor channels to comparable magnitude ranges so that machine learning algorithms treat all features equally regardless of their engineering units.
  • Normalisation artefact: A distortion introduced by incorrect normalisation — for example, scaling based on training set statistics applied incorrectly to the test set, or using the wrong normalisation for the downstream algorithm.
  • Data leakage: The contamination of model evaluation with information from the test set, often caused by normalising all data together before splitting into train/test, which leaks test statistics into the normalisation parameters.

44.3.1 Min-Max Scaling

Scales data to a fixed range (typically 0-1):

class MinMaxScaler:
    def __init__(self, feature_range=(0, 1)):
        self.min_val = None
        self.max_val = None
        self.feature_min, self.feature_max = feature_range

    def fit(self, data):
        """Learn min/max from training data"""
        self.min_val = min(data)
        self.max_val = max(data)

    def transform(self, value):
        """Scale a single value"""
        if self.max_val == self.min_val:
            return self.feature_min

        scaled = (value - self.min_val) / (self.max_val - self.min_val)
        return scaled * (self.feature_max - self.feature_min) + self.feature_min

    def inverse_transform(self, scaled_value):
        """Convert back to original scale"""
        original = (scaled_value - self.feature_min) / (self.feature_max - self.feature_min)
        return original * (self.max_val - self.min_val) + self.min_val

44.3.2 Z-Score Normalization (Standardization)

Centers data around mean with unit variance:

class ZScoreNormalizer:
    def __init__(self):
        self.mean = None
        self.std = None

    def fit(self, data):
        """Learn mean and std from training data"""
        import numpy as np
        self.mean = np.mean(data)
        self.std = np.std(data)

    def transform(self, value):
        """Normalize a single value"""
        if self.std == 0:
            return 0.0
        return (value - self.mean) / self.std

    def inverse_transform(self, normalized_value):
        """Convert back to original scale"""
        return normalized_value * self.std + self.mean

Min-max and Z-score normalization transform sensor data to comparable scales for multi-sensor fusion and machine learning.

Min-max formula: \(x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}\)

Z-score formula: \(z = \frac{x - \mu}{\sigma}\)

Worked example: A temperature sensor collects 5 readings: [18C, 22C, 25C, 19C, 21C]. Normalize using both methods:

Min-max scaling:

  • \(x_{\min} = 18\), \(x_{\max} = 25\)
  • For 22C: \(x' = \frac{22 - 18}{25 - 18} = \frac{4}{7} = 0.571\)
  • For 25C: \(x' = \frac{25 - 18}{25 - 18} = \frac{7}{7} = 1.000\)

Z-score normalization:

  • \(\mu = 21\), \(\sigma = \sqrt{\frac{(18{-}21)^2 + (22{-}21)^2 + (25{-}21)^2 + (19{-}21)^2 + (21{-}21)^2}{5}} = \sqrt{6} \approx 2.449\)
  • For 22C: \(z = \frac{22 - 21}{2.449} = 0.408\)
  • For 25C: \(z = \frac{25 - 21}{2.449} = 1.633\)

Result: Min-max gives bounded [0,1] range, while Z-score preserves statistical distance (25C is 1.63 std deviations above mean).

44.3.3 Robust Scaling

Uses median and IQR, robust to outliers:

class RobustScaler:
    def __init__(self):
        self.median = None
        self.iqr = None

    def fit(self, data):
        """Learn median and IQR from training data"""
        import numpy as np
        self.median = np.median(data)
        q1 = np.percentile(data, 25)
        q3 = np.percentile(data, 75)
        self.iqr = q3 - q1

    def transform(self, value):
        """Scale using median and IQR"""
        if self.iqr == 0:
            return 0.0
        return (value - self.median) / self.iqr

    def inverse_transform(self, scaled_value):
        """Convert back to original scale"""
        return scaled_value * self.iqr + self.median
Try It: Outlier Impact on Scaling Methods

See how a single outlier distorts Min-Max and Z-Score scaling while Robust Scaling remains stable. Drag the outlier value to see the effect in real time.

44.3.4 When to Use Each Method

Method Use Case Preserves Sensitive To
Min-Max Neural networks, bounded outputs Distribution shape Outliers
Z-Score SVM, K-means, PCA, Gaussian assumptions Relative distances Outliers (shift mean/std)
Robust Scaling When outliers should not affect range Median-based Extremely sparse data
Log Transform Right-skewed data (power, counts) Multiplicative relationships Zero/negative values

44.3.5 Interactive Normalization Calculator

Try different sensor values and see how each normalization method transforms them:

44.4 Data Quality Lab: ESP32 Wokwi Simulation

  • ~45 min | - - - Advanced | - P10.C09.LAB

44.4.1 Lab Overview

In this hands-on lab, you will implement a complete data quality preprocessing pipeline on an ESP32 microcontroller. The simulation demonstrates real-world techniques for handling sensor data problems including outliers, missing values, noise, and the need for normalization.

What You Will Learn:

  1. Sensor Data Validation: Implementing range checks and rate-of-change validation
  2. Outlier Detection: Using Z-score and IQR methods to identify anomalous readings
  3. Missing Value Handling: Forward-fill and interpolation techniques for gap handling
  4. Noise Filtering: Moving average, median filter, and exponential smoothing
  5. Data Normalization: Min-max scaling and Z-score normalization for multi-sensor fusion

Skills Practiced:

  • Real-time data processing on embedded systems
  • Statistical calculations with limited memory
  • Circular buffer implementations
  • Streaming algorithm design

44.4.2 Lab Components

Technique Implementation Visual Indicator
Range Validation Physical bounds checking Red LED on violation
Z-Score Outliers Rolling statistics Yellow LED on outlier
Median Filter 5-sample sliding window Smoothed output in serial
Moving Average 10-sample window Trend visualization
Normalization 0-1 scaling Percentage output

44.4.3 Wokwi Simulator

Use the embedded simulator below to build your data quality preprocessing system:

44.4.4 Circuit Setup

Connect the sensors and indicators to the ESP32:

Component ESP32 Pin Purpose
Temperature Sensor (NTC) GPIO 34 Primary data source
Light Sensor (LDR) GPIO 35 Secondary data source
Potentiometer GPIO 32 Simulate sensor drift
Red LED GPIO 18 Range violation indicator
Yellow LED GPIO 19 Outlier detection indicator
Green LED GPIO 21 Valid data indicator
Blue LED GPIO 22 Missing data indicator

Add this diagram.json configuration in Wokwi:

{
  "version": 1,
  "author": "IoT Class - Data Quality Lab",
  "editor": "wokwi",
  "parts": [
    { "type": "wokwi-esp32-devkit-v1", "id": "esp", "top": 0, "left": 0 },
    { "type": "wokwi-ntc-temperature-sensor", "id": "temp1", "top": -120, "left": 80 },
    { "type": "wokwi-photoresistor-sensor", "id": "ldr1", "top": -120, "left": 180 },
    { "type": "wokwi-potentiometer", "id": "pot1", "top": -120, "left": 280 },
    { "type": "wokwi-led", "id": "led_red", "top": 180, "left": 80, "attrs": { "color": "red" } },
    { "type": "wokwi-led", "id": "led_yellow", "top": 180, "left": 130, "attrs": { "color": "yellow" } },
    { "type": "wokwi-led", "id": "led_green", "top": 180, "left": 180, "attrs": { "color": "green" } },
    { "type": "wokwi-led", "id": "led_blue", "top": 180, "left": 230, "attrs": { "color": "blue" } },
    { "type": "wokwi-resistor", "id": "r1", "top": 230, "left": 80, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r2", "top": 230, "left": 130, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r3", "top": 230, "left": 180, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r4", "top": 230, "left": 230, "attrs": { "value": "220" } }
  ],
  "connections": [
    ["esp:GND.1", "temp1:GND", "black", ["h0"]],
    ["esp:3V3", "temp1:VCC", "red", ["h0"]],
    ["esp:34", "temp1:OUT", "green", ["h0"]],
    ["esp:GND.1", "ldr1:GND", "black", ["h0"]],
    ["esp:3V3", "ldr1:VCC", "red", ["h0"]],
    ["esp:35", "ldr1:OUT", "orange", ["h0"]],
    ["esp:GND.1", "pot1:GND", "black", ["h0"]],
    ["esp:3V3", "pot1:VCC", "red", ["h0"]],
    ["esp:32", "pot1:SIG", "purple", ["h0"]],
    ["esp:18", "led_red:A", "red", ["h0"]],
    ["led_red:C", "r1:1", "black", ["h0"]],
    ["r1:2", "esp:GND.1", "black", ["h0"]],
    ["esp:19", "led_yellow:A", "yellow", ["h0"]],
    ["led_yellow:C", "r2:1", "black", ["h0"]],
    ["r2:2", "esp:GND.1", "black", ["h0"]],
    ["esp:21", "led_green:A", "green", ["h0"]],
    ["led_green:C", "r3:1", "black", ["h0"]],
    ["r3:2", "esp:GND.1", "black", ["h0"]],
    ["esp:22", "led_blue:A", "blue", ["h0"]],
    ["led_blue:C", "r4:1", "black", ["h0"]],
    ["r4:2", "esp:GND.1", "black", ["h0"]]
  ]
}

44.4.5 Complete Arduino Code

Copy this code into the Wokwi editor:

// ============================================================================
// DATA QUALITY LAB: Comprehensive IoT Data Preprocessing Pipeline
// ============================================================================
// Demonstrates: Validation, Outlier Detection, Missing Value Handling,
//               Noise Filtering, Normalization, and Data Scaling
// ============================================================================

#include <Arduino.h>
#include <math.h>

// ====================== PIN DEFINITIONS ======================
const int TEMP_PIN = 34;        // Temperature sensor (NTC)
const int LIGHT_PIN = 35;       // Light sensor (LDR)
const int DRIFT_PIN = 32;       // Potentiometer for drift simulation

const int LED_RED = 18;         // Range violation indicator
const int LED_YELLOW = 19;      // Outlier detection indicator
const int LED_GREEN = 21;       // Valid data indicator
const int LED_BLUE = 22;        // Missing data indicator

// ====================== SAMPLING PARAMETERS ======================
const int SAMPLE_INTERVAL_MS = 200;        // 5 Hz sampling rate
const int VALIDATION_WINDOW = 50;          // Samples for statistics
const int FILTER_WINDOW_MEDIAN = 5;        // Median filter window
const int FILTER_WINDOW_MA = 10;           // Moving average window
const float EXP_SMOOTHING_ALPHA = 0.3;     // Exponential smoothing factor

// ====================== VALIDATION THRESHOLDS ======================
// Temperature sensor valid range (Celsius after conversion)
const float TEMP_MIN_VALID = -10.0;
const float TEMP_MAX_VALID = 60.0;
const float TEMP_MAX_RATE = 2.0;           // Max 2C change per second

// Light sensor valid range (0-4095 ADC, 0-100000 lux mapped)
const float LIGHT_MIN_VALID = 0.0;
const float LIGHT_MAX_VALID = 100000.0;

// Outlier detection thresholds
const float ZSCORE_THRESHOLD = 3.0;        // Standard deviations
const float IQR_MULTIPLIER = 1.5;          // IQR outlier multiplier
const float MAD_THRESHOLD = 3.5;           // Modified Z-score threshold

// Missing data parameters
const int MAX_MISSING_SAMPLES = 10;        // Max consecutive missing allowed
const float MISSING_PROBABILITY = 0.05;    // Simulate 5% missing data

// ====================== DATA STRUCTURES ======================

// Circular buffer for streaming statistics
template<int SIZE>
struct CircularBuffer {
    float data[SIZE];
    int head;
    int count;

    CircularBuffer() : head(0), count(0) {
        for (int i = 0; i < SIZE; i++) data[i] = 0;
    }

    void push(float value) {
        data[head] = value;
        head = (head + 1) % SIZE;
        if (count < SIZE) count++;
    }

    float get(int index) const {
        // Get value at index (0 = oldest)
        int actual = (head - count + index + SIZE) % SIZE;
        return data[actual];
    }

    bool isFull() const { return count == SIZE; }
    int size() const { return count; }
};

// Statistics calculator for streaming data
struct StreamingStats {
    float sum;
    float sumSq;
    int count;
    float min_val;
    float max_val;

    StreamingStats() : sum(0), sumSq(0), count(0),
                       min_val(INFINITY), max_val(-INFINITY) {}

    void reset() {
        sum = sumSq = 0;
        count = 0;
        min_val = INFINITY;
        max_val = -INFINITY;
    }

    void add(float value) {
        sum += value;
        sumSq += value * value;
        count++;
        if (value < min_val) min_val = value;
        if (value > max_val) max_val = value;
    }

    float mean() const { return count > 0 ? sum / count : 0; }

    float variance() const {
        if (count < 2) return 0;
        return (sumSq - sum * sum / count) / (count - 1);
    }

    float stdDev() const { return sqrt(variance()); }
};

// Sensor reading with quality metadata
struct SensorReading {
    float rawValue;
    float cleanedValue;
    float normalizedValue;
    unsigned long timestamp;
    bool isValid;
    bool isOutlier;
    bool isMissing;
    bool isImputed;
    String qualityFlags;
};

// ====================== GLOBAL STATE ======================

// Circular buffers for different processing stages
CircularBuffer<VALIDATION_WINDOW> tempRawBuffer;
CircularBuffer<VALIDATION_WINDOW> tempCleanBuffer;
CircularBuffer<FILTER_WINDOW_MEDIAN> tempMedianBuffer;
CircularBuffer<FILTER_WINDOW_MA> tempMABuffer;

CircularBuffer<VALIDATION_WINDOW> lightRawBuffer;
CircularBuffer<VALIDATION_WINDOW> lightCleanBuffer;

// Streaming statistics
StreamingStats tempStats;
StreamingStats lightStats;

// For rate-of-change validation
float lastValidTemp = NAN;
unsigned long lastValidTempTime = 0;
float lastValidLight = NAN;
unsigned long lastValidLightTime = 0;

// For exponential smoothing
float expSmoothedTemp = NAN;
float expSmoothedLight = NAN;

// For missing data handling
int consecutiveMissingTemp = 0;
int consecutiveMissingLight = 0;
float lastImputedTemp = NAN;
float lastImputedLight = NAN;

// Normalization parameters (learned from data)
float tempMinSeen = INFINITY;
float tempMaxSeen = -INFINITY;
float lightMinSeen = INFINITY;
float lightMaxSeen = -INFINITY;

// Statistics counters
unsigned long totalSamples = 0;
unsigned long validSamples = 0;
unsigned long outlierSamples = 0;
unsigned long missingSamples = 0;
unsigned long imputedSamples = 0;
unsigned long rangeViolations = 0;
unsigned long rateViolations = 0;

// ====================== HELPER FUNCTIONS ======================

// Convert ADC reading to temperature (NTC thermistor approximation)
float adcToTemperature(int adcValue) {
    if (adcValue == 0) return -INFINITY;
    if (adcValue >= 4095) return INFINITY;

    // Simplified Steinhart-Hart approximation for 10K NTC
    float resistance = 10000.0 * (4095.0 / adcValue - 1.0);
    float steinhart = resistance / 10000.0;
    steinhart = log(steinhart);
    steinhart /= 3950.0;
    steinhart += 1.0 / (25.0 + 273.15);
    steinhart = 1.0 / steinhart;
    steinhart -= 273.15;

    return steinhart;
}

// Convert ADC reading to light level (LDR approximation)
float adcToLight(int adcValue) {
    // Map ADC to approximate lux (logarithmic)
    if (adcValue < 10) return 0;
    float lux = 100000.0 * pow((float)adcValue / 4095.0, 2);
    return lux;
}

// ====================== VALIDATION FUNCTIONS ======================

// Range validation - check if value is within physical bounds
bool validateRange(float value, float minValid, float maxValid, String& errorMsg) {
    if (isnan(value) || isinf(value)) {
        errorMsg = "NaN/Inf";
        return false;
    }
    if (value < minValid) {
        errorMsg = "Below min (" + String(minValid) + ")";
        return false;
    }
    if (value > maxValid) {
        errorMsg = "Above max (" + String(maxValid) + ")";
        return false;
    }
    errorMsg = "OK";
    return true;
}

// Rate-of-change validation - detect impossible jumps
bool validateRateOfChange(float currentValue, float lastValue,
                          unsigned long currentTime, unsigned long lastTime,
                          float maxRate, String& errorMsg) {
    if (isnan(lastValue) || lastTime == 0) {
        errorMsg = "First reading";
        return true;
    }

    float timeDelta = (currentTime - lastTime) / 1000.0;  // seconds
    if (timeDelta <= 0) {
        errorMsg = "Invalid timestamp";
        return false;
    }

    float rate = abs(currentValue - lastValue) / timeDelta;
    if (rate > maxRate) {
        errorMsg = "Rate " + String(rate, 2) + " exceeds max " + String(maxRate);
        return false;
    }

    errorMsg = "Rate OK (" + String(rate, 2) + ")";
    return true;
}

// ====================== OUTLIER DETECTION ======================

// Z-Score outlier detection
bool detectOutlierZScore(float value, float mean, float stdDev,
                         float threshold, float& zScore) {
    if (stdDev == 0) {
        zScore = 0;
        return false;
    }

    zScore = abs((value - mean) / stdDev);
    return zScore > threshold;
}

// IQR outlier detection (requires sorted buffer)
bool detectOutlierIQR(CircularBuffer<VALIDATION_WINDOW>& buffer, float value,
                      float multiplier, float& lowerBound, float& upperBound) {
    if (buffer.size() < 10) return false;

    // Copy to temporary array for sorting
    float sorted[VALIDATION_WINDOW];
    int n = buffer.size();
    for (int i = 0; i < n; i++) {
        sorted[i] = buffer.get(i);
    }

    // Simple bubble sort (OK for small buffers)
    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - i - 1; j++) {
            if (sorted[j] > sorted[j + 1]) {
                float temp = sorted[j];
                sorted[j] = sorted[j + 1];
                sorted[j + 1] = temp;
            }
        }
    }

    // Calculate quartiles
    int q1Idx = n / 4;
    int q3Idx = 3 * n / 4;
    float q1 = sorted[q1Idx];
    float q3 = sorted[q3Idx];
    float iqr = q3 - q1;

    lowerBound = q1 - multiplier * iqr;
    upperBound = q3 + multiplier * iqr;

    return (value < lowerBound) || (value > upperBound);
}

// ====================== NOISE FILTERING ======================

// Moving average filter
float filterMovingAverage(CircularBuffer<FILTER_WINDOW_MA>& buffer, float newValue) {
    buffer.push(newValue);

    float sum = 0;
    for (int i = 0; i < buffer.size(); i++) {
        sum += buffer.get(i);
    }

    return sum / buffer.size();
}

// Median filter (excellent for spike removal)
float filterMedian(CircularBuffer<FILTER_WINDOW_MEDIAN>& buffer, float newValue) {
    buffer.push(newValue);

    // Copy and sort
    float sorted[FILTER_WINDOW_MEDIAN];
    int n = buffer.size();
    for (int i = 0; i < n; i++) {
        sorted[i] = buffer.get(i);
    }

    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - i - 1; j++) {
            if (sorted[j] > sorted[j + 1]) {
                float temp = sorted[j];
                sorted[j] = sorted[j + 1];
                sorted[j + 1] = temp;
            }
        }
    }

    if (n % 2 == 0) {
        return (sorted[n/2 - 1] + sorted[n/2]) / 2.0;
    }
    return sorted[n / 2];
}

// Exponential smoothing filter
float filterExponentialSmoothing(float& smoothed, float newValue, float alpha) {
    if (isnan(smoothed)) {
        smoothed = newValue;
    } else {
        smoothed = alpha * newValue + (1 - alpha) * smoothed;
    }
    return smoothed;
}

// ====================== MISSING VALUE HANDLING ======================

// Simulate missing data (for demonstration)
bool simulateMissingData() {
    return random(1000) < (MISSING_PROBABILITY * 1000);
}

// Forward-fill imputation with limit
float imputeForwardFill(float lastValidValue, int& consecutiveMissing,
                        int maxMissing, bool& wasImputed) {
    consecutiveMissing++;

    if (consecutiveMissing > maxMissing || isnan(lastValidValue)) {
        wasImputed = false;
        return NAN;
    }

    wasImputed = true;
    return lastValidValue;
}

// ====================== NORMALIZATION ======================

// Min-Max scaling to 0-1 range
float normalizeMinMax(float value, float minSeen, float maxSeen) {
    if (maxSeen == minSeen) return 0.5;
    return (value - minSeen) / (maxSeen - minSeen);
}

// Z-Score normalization (standardization)
float normalizeZScore(float value, float mean, float stdDev) {
    if (stdDev == 0) return 0;
    return (value - mean) / stdDev;
}

// Update normalization parameters
void updateNormalizationParams(float value, float& minSeen, float& maxSeen) {
    if (value < minSeen) minSeen = value;
    if (value > maxSeen) maxSeen = value;
}

// ====================== MAIN PROCESSING FUNCTION ======================

SensorReading processTemperature(int adcValue, unsigned long timestamp) {
    SensorReading reading;
    reading.timestamp = timestamp;
    reading.isValid = true;
    reading.isOutlier = false;
    reading.isMissing = false;
    reading.isImputed = false;
    reading.qualityFlags = "";

    // Step 0: Check for simulated missing data
    if (simulateMissingData()) {
        reading.isMissing = true;
        missingSamples++;

        // Try forward-fill imputation
        bool wasImputed;
        float imputed = imputeForwardFill(lastValidTemp, consecutiveMissingTemp,
                                          MAX_MISSING_SAMPLES, wasImputed);

        if (wasImputed) {
            reading.rawValue = NAN;
            reading.cleanedValue = imputed;
            reading.isImputed = true;
            reading.qualityFlags += "IMPUTED_FFILL ";
            imputedSamples++;
            lastImputedTemp = imputed;
        } else {
            reading.rawValue = NAN;
            reading.cleanedValue = NAN;
            reading.normalizedValue = NAN;
            reading.qualityFlags += "MISSING_NO_IMPUTE ";
            return reading;
        }
    } else {
        consecutiveMissingTemp = 0;

        // Step 1: Convert ADC to temperature
        reading.rawValue = adcToTemperature(adcValue);

        // Step 2: Range validation
        String rangeError;
        if (!validateRange(reading.rawValue, TEMP_MIN_VALID, TEMP_MAX_VALID, rangeError)) {
            reading.isValid = false;
            reading.qualityFlags += "RANGE_VIOLATION(" + rangeError + ") ";
            rangeViolations++;
        }

        // Step 3: Rate-of-change validation
        String rateError;
        if (!validateRateOfChange(reading.rawValue, lastValidTemp,
                                  timestamp, lastValidTempTime,
                                  TEMP_MAX_RATE, rateError)) {
            reading.isValid = false;
            reading.qualityFlags += "RATE_VIOLATION(" + rateError + ") ";
            rateViolations++;
        }

        // Step 4: Outlier detection (only if range-valid)
        if (reading.isValid && tempStats.count > 20) {
            float zScore;
            if (detectOutlierZScore(reading.rawValue, tempStats.mean(),
                                    tempStats.stdDev(), ZSCORE_THRESHOLD, zScore)) {
                reading.isOutlier = true;
                reading.qualityFlags += "ZSCORE_OUTLIER(z=" + String(zScore, 2) + ") ";
                outlierSamples++;
            }

            float lowerBound, upperBound;
            if (detectOutlierIQR(tempRawBuffer, reading.rawValue,
                                 IQR_MULTIPLIER, lowerBound, upperBound)) {
                reading.isOutlier = true;
                reading.qualityFlags += "IQR_OUTLIER ";
            }
        }

        // Step 5: Apply noise filtering
        if (reading.isValid && !reading.isOutlier) {
            // Median filter first (removes spikes)
            float medianFiltered = filterMedian(tempMedianBuffer, reading.rawValue);

            // Then moving average (smooths remaining noise)
            float maFiltered = filterMovingAverage(tempMABuffer, medianFiltered);

            // Exponential smoothing for final output
            reading.cleanedValue = filterExponentialSmoothing(expSmoothedTemp,
                                                              maFiltered,
                                                              EXP_SMOOTHING_ALPHA);

            // Update statistics with valid data
            tempStats.add(reading.rawValue);
            tempRawBuffer.push(reading.rawValue);
            tempCleanBuffer.push(reading.cleanedValue);

            // Update last valid
            lastValidTemp = reading.rawValue;
            lastValidTempTime = timestamp;
            validSamples++;
        } else if (reading.isOutlier) {
            // Use median of recent values for outlier replacement
            reading.cleanedValue = filterMedian(tempMedianBuffer,
                                                tempMedianBuffer.get(tempMedianBuffer.size() - 1));
            reading.qualityFlags += "OUTLIER_REPLACED ";
        } else {
            // Range/rate violation - use last valid with flag
            if (!isnan(lastValidTemp)) {
                reading.cleanedValue = lastValidTemp;
                reading.qualityFlags += "USING_LAST_VALID ";
            } else {
                reading.cleanedValue = NAN;
            }
        }
    }

    // Step 6: Normalization
    if (!isnan(reading.cleanedValue)) {
        updateNormalizationParams(reading.cleanedValue, tempMinSeen, tempMaxSeen);
        reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
                                                   tempMinSeen, tempMaxSeen);
    } else {
        reading.normalizedValue = NAN;
    }

    if (reading.qualityFlags == "") {
        reading.qualityFlags = "CLEAN";
    }

    totalSamples++;
    return reading;
}

SensorReading processLight(int adcValue, unsigned long timestamp) {
    SensorReading reading;
    reading.timestamp = timestamp;
    reading.isValid = true;
    reading.isOutlier = false;
    reading.isMissing = false;
    reading.isImputed = false;
    reading.qualityFlags = "";

    // Similar processing as temperature (abbreviated for space)
    if (simulateMissingData()) {
        reading.isMissing = true;
        missingSamples++;

        bool wasImputed;
        float imputed = imputeForwardFill(lastValidLight, consecutiveMissingLight,
                                          MAX_MISSING_SAMPLES, wasImputed);

        if (wasImputed) {
            reading.rawValue = NAN;
            reading.cleanedValue = imputed;
            reading.isImputed = true;
            reading.qualityFlags = "IMPUTED_FFILL";
            imputedSamples++;
        } else {
            reading.rawValue = NAN;
            reading.cleanedValue = NAN;
            reading.normalizedValue = NAN;
            reading.qualityFlags = "MISSING_NO_IMPUTE";
            return reading;
        }
    } else {
        consecutiveMissingLight = 0;
        reading.rawValue = adcToLight(adcValue);

        // Simplified processing for light sensor
        String rangeError;
        if (!validateRange(reading.rawValue, LIGHT_MIN_VALID, LIGHT_MAX_VALID, rangeError)) {
            reading.isValid = false;
            reading.qualityFlags = "RANGE_VIOLATION";
            rangeViolations++;
        }

        if (reading.isValid) {
            reading.cleanedValue = filterExponentialSmoothing(expSmoothedLight,
                                                              reading.rawValue,
                                                              EXP_SMOOTHING_ALPHA);
            lightStats.add(reading.rawValue);
            lightRawBuffer.push(reading.rawValue);
            lastValidLight = reading.rawValue;
            lastValidLightTime = timestamp;
            validSamples++;
        } else {
            reading.cleanedValue = lastValidLight;
        }
    }

    // Normalization
    if (!isnan(reading.cleanedValue)) {
        updateNormalizationParams(reading.cleanedValue, lightMinSeen, lightMaxSeen);
        reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
                                                   lightMinSeen, lightMaxSeen);
    }

    if (reading.qualityFlags == "") {
        reading.qualityFlags = "CLEAN";
    }

    totalSamples++;
    return reading;
}

// ====================== LED INDICATOR CONTROL ======================

void updateLEDs(const SensorReading& tempReading, const SensorReading& lightReading) {
    // Red LED: Range violation
    if (!tempReading.isValid || !lightReading.isValid) {
        digitalWrite(LED_RED, HIGH);
    } else {
        digitalWrite(LED_RED, LOW);
    }

    // Yellow LED: Outlier detected
    if (tempReading.isOutlier || lightReading.isOutlier) {
        digitalWrite(LED_YELLOW, HIGH);
    } else {
        digitalWrite(LED_YELLOW, LOW);
    }

    // Green LED: Clean valid data
    if (tempReading.qualityFlags == "CLEAN" && lightReading.qualityFlags == "CLEAN") {
        digitalWrite(LED_GREEN, HIGH);
    } else {
        digitalWrite(LED_GREEN, LOW);
    }

    // Blue LED: Missing/imputed data
    if (tempReading.isMissing || tempReading.isImputed ||
        lightReading.isMissing || lightReading.isImputed) {
        digitalWrite(LED_BLUE, HIGH);
    } else {
        digitalWrite(LED_BLUE, LOW);
    }
}

// ====================== SERIAL OUTPUT ======================

void printSensorReading(const char* sensorName, const SensorReading& reading) {
    Serial.print(sensorName);
    Serial.print(": Raw=");
    if (isnan(reading.rawValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.rawValue, 2);
    }

    Serial.print(", Clean=");
    if (isnan(reading.cleanedValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.cleanedValue, 2);
    }

    Serial.print(", Norm=");
    if (isnan(reading.normalizedValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.normalizedValue, 3);
    }

    Serial.print(" [");
    Serial.print(reading.qualityFlags);
    Serial.println("]");
}

void printStatistics() {
    Serial.println("\n========== DATA QUALITY STATISTICS ==========");
    Serial.print("Total Samples: ");
    Serial.println(totalSamples);
    Serial.print("Valid Samples: ");
    Serial.print(validSamples);
    Serial.print(" (");
    Serial.print(100.0 * validSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Outliers Detected: ");
    Serial.print(outlierSamples);
    Serial.print(" (");
    Serial.print(100.0 * outlierSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Missing Values: ");
    Serial.print(missingSamples);
    Serial.print(" (");
    Serial.print(100.0 * missingSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Imputed Values: ");
    Serial.print(imputedSamples);
    Serial.println();
    Serial.print("Range Violations: ");
    Serial.println(rangeViolations);
    Serial.print("Rate Violations: ");
    Serial.println(rateViolations);

    Serial.println("\n--- Temperature Statistics ---");
    Serial.print("Mean: ");
    Serial.print(tempStats.mean(), 2);
    Serial.print(" C, StdDev: ");
    Serial.print(tempStats.stdDev(), 2);
    Serial.print(" C, Range: [");
    Serial.print(tempMinSeen, 1);
    Serial.print(", ");
    Serial.print(tempMaxSeen, 1);
    Serial.println("] C");

    Serial.println("\n--- Light Statistics ---");
    Serial.print("Mean: ");
    Serial.print(lightStats.mean(), 0);
    Serial.print(" lux, StdDev: ");
    Serial.print(lightStats.stdDev(), 0);
    Serial.print(" lux, Range: [");
    Serial.print(lightMinSeen, 0);
    Serial.print(", ");
    Serial.print(lightMaxSeen, 0);
    Serial.println("] lux");
    Serial.println("==============================================\n");
}

// ====================== SETUP AND LOOP ======================

void setup() {
    Serial.begin(115200);
    delay(1000);

    // Initialize pins
    pinMode(TEMP_PIN, INPUT);
    pinMode(LIGHT_PIN, INPUT);
    pinMode(DRIFT_PIN, INPUT);

    pinMode(LED_RED, OUTPUT);
    pinMode(LED_YELLOW, OUTPUT);
    pinMode(LED_GREEN, OUTPUT);
    pinMode(LED_BLUE, OUTPUT);

    // All LEDs off initially
    digitalWrite(LED_RED, LOW);
    digitalWrite(LED_YELLOW, LOW);
    digitalWrite(LED_GREEN, LOW);
    digitalWrite(LED_BLUE, LOW);

    // Seed random for missing data simulation
    randomSeed(analogRead(0));

    Serial.println("============================================");
    Serial.println("  DATA QUALITY LAB: IoT Preprocessing Demo  ");
    Serial.println("============================================");
    Serial.println("Features demonstrated:");
    Serial.println("  - Range validation (physical bounds)");
    Serial.println("  - Rate-of-change validation");
    Serial.println("  - Z-Score outlier detection");
    Serial.println("  - IQR outlier detection");
    Serial.println("  - Median filter (spike removal)");
    Serial.println("  - Moving average filter (smoothing)");
    Serial.println("  - Exponential smoothing");
    Serial.println("  - Missing value imputation (forward-fill)");
    Serial.println("  - Min-Max normalization (0-1 scaling)");
    Serial.println("============================================");
    Serial.println("LED Indicators:");
    Serial.println("  RED: Range/Rate violation");
    Serial.println("  YELLOW: Outlier detected");
    Serial.println("  GREEN: Clean valid data");
    Serial.println("  BLUE: Missing/Imputed data");
    Serial.println("============================================\n");

    Serial.println("Starting data collection...\n");
}

unsigned long lastSampleTime = 0;
unsigned long lastStatsTime = 0;
const unsigned long STATS_INTERVAL_MS = 10000;  // Print stats every 10 seconds

void loop() {
    unsigned long currentTime = millis();

    // Sample at defined interval
    if (currentTime - lastSampleTime >= SAMPLE_INTERVAL_MS) {
        lastSampleTime = currentTime;

        // Read sensors
        int tempADC = analogRead(TEMP_PIN);
        int lightADC = analogRead(LIGHT_PIN);
        int driftADC = analogRead(DRIFT_PIN);

        // Add simulated drift from potentiometer (optional)
        float driftFactor = (driftADC - 2048) / 2048.0;  // -1 to +1
        tempADC = constrain(tempADC + (int)(driftFactor * 500), 0, 4095);

        // Process readings through data quality pipeline
        SensorReading tempReading = processTemperature(tempADC, currentTime);
        SensorReading lightReading = processLight(lightADC, currentTime);

        // Update LED indicators
        updateLEDs(tempReading, lightReading);

        // Print readings
        printSensorReading("TEMP", tempReading);
        printSensorReading("LIGHT", lightReading);
        Serial.println();
    }

    // Print statistics periodically
    if (currentTime - lastStatsTime >= STATS_INTERVAL_MS) {
        lastStatsTime = currentTime;
        printStatistics();
    }
}

44.4.6 Step-by-Step Instructions

44.4.6.1 Step 1: Set Up the Simulator

  1. Open the Wokwi simulator embedded above (or visit wokwi.com)
  2. Create a new ESP32 project
  3. Click the diagram.json tab and paste the circuit configuration
  4. Replace the default code with the complete Arduino code above

44.4.6.2 Step 2: Run and Observe Validation

  1. Click the Play button to start the simulation
  2. Open the Serial Monitor to see the data processing output
  3. Observe the quality flags showing validation status for each reading
  4. Watch for “CLEAN” flags indicating data passed all quality checks

44.4.6.3 Step 3: Trigger Range Violations

  1. Click the NTC temperature sensor in the simulator
  2. Drag the slider to extreme values (very hot or very cold)
  3. Watch the RED LED turn on when values exceed physical bounds
  4. Note the “RANGE_VIOLATION” flag in the serial output
  5. Observe how the system uses the last valid value when current is invalid

44.4.6.4 Step 4: Observe Outlier Detection

  1. Make sudden temperature changes by quickly dragging the sensor slider
  2. Watch the YELLOW LED blink when outliers are detected
  3. See the “ZSCORE_OUTLIER” and “IQR_OUTLIER” flags in output
  4. Notice how outliers are replaced with median values

44.4.6.5 Step 5: Observe Missing Data Handling

  1. The code simulates 5% random missing data
  2. Watch the BLUE LED flash when data is missing or imputed
  3. See “IMPUTED_FFILL” flags showing forward-fill imputation
  4. Note “MISSING_NO_IMPUTE” when too many consecutive values are missing

44.4.6.6 Step 6: Experiment with Filtering

  1. Observe the Raw vs Clean values in the serial output
  2. Notice how Clean values are smoother due to median + moving average filters
  3. Compare the normalized values (0-1 range) for multi-sensor comparison

44.4.6.7 Step 7: Analyze Statistics

  1. Wait for the statistics report (prints every 10 seconds)
  2. Review the data quality percentages: valid, outliers, missing
  3. Examine the sensor statistics: mean, standard deviation, range
  4. Consider how these metrics would inform production monitoring

44.4.7 Challenge Exercises

Difficulty: Intermediate

Task: The code defines a MAD_THRESHOLD constant but does not implement MAD outlier detection. Implement a detectOutlierMAD() function and integrate it into the processTemperature() pipeline as the primary outlier detection method.

Hints:

  • MAD is more robust than Z-score for non-Gaussian data
  • You will need to implement a detectOutlierMAD() function (modified Z-score = 0.6745 * (x - median) / MAD)
  • Replace or complement the Z-score check with MAD

Expected Outcome: MAD should detect outliers even when extreme values skew the mean and standard deviation.

Difficulty: Intermediate

Task: Currently, forward-fill uses a fixed limit (MAX_MISSING_SAMPLES). Implement an exponential decay on the confidence of imputed values.

Requirements:

  • Add a confidence field to SensorReading
  • Reduce confidence by 10% for each consecutive imputed value
  • Stop imputing when confidence drops below 50%
  • Display confidence in serial output

Expected Outcome: Imputed values should be flagged with decreasing confidence as gaps grow longer.

Difficulty: Advanced

Task: Add plausibility checking between temperature and light sensors. If it is very bright (high light), temperature should be reasonable for daytime.

Requirements:

  • If light > 50000 lux and temperature < 10C, flag as suspicious
  • If light < 100 lux and temperature > 35C (outdoors), flag as suspicious
  • Add a new LED or serial indicator for cross-sensor anomalies

Expected Outcome: The system should detect when sensor readings are physically inconsistent with each other.

Difficulty: Advanced

Task: Replace the exponential smoothing filter with a simple Kalman filter for temperature.

Requirements:

  • Implement 1D Kalman filter with process noise and measurement noise
  • Estimate the Kalman gain dynamically
  • Output both the filtered value and the uncertainty estimate

Learning: Kalman filters provide optimal estimation when process and measurement noise characteristics are known.

44.4.8 Expected Outcomes

After completing this lab, you should be able to:

  1. Understand validation trade-offs: Strict validation catches more errors but may reject valid extreme readings
  2. Choose appropriate outlier methods: Z-score for Gaussian data, IQR/MAD for robust detection
  3. Select imputation strategies: Forward-fill for slow-changing, interpolation for trending data
  4. Apply noise filters correctly: Median for spikes, moving average for steady-state noise
  5. Normalize for fusion: Understand when to use min-max vs Z-score normalization

Quality Metrics to Observe:

  • Valid sample rate should be >90% under normal conditions
  • Outlier rate should be <5% for stable sensors
  • Imputed values should maintain temporal continuity
Try It: Exponential Smoothing Explorer

Adjust the smoothing factor (alpha) and noise level to see how exponential smoothing filters noisy sensor data. A low alpha trusts the history more (smoother), while a high alpha trusts new readings more (more responsive).

Scenario: You’re building an anomaly detection system for a server room with 3 sensors: temperature (15-35°C), CO2 (400-5000 ppm), and humidity (30-70%). A neural network needs all inputs on the same scale to detect abnormal conditions.

Given:

  • Temperature sensor: range 15-35°C, current reading 28°C
  • CO2 sensor: range 400-5000 ppm, current reading 1200 ppm
  • Humidity sensor: range 30-70%, current reading 55%
  • Neural network requires inputs in 0-1 range

Question: Normalize these readings and explain why proper normalization matters for the neural network.

Solution:

Step 1: Calculate min-max normalization for each sensor

Temperature normalization:

normalized_temp = (28 - 15) / (35 - 15)
               = 13 / 20
               = 0.65

CO2 normalization:

normalized_co2 = (1200 - 400) / (5000 - 400)
              = 800 / 4600
              = 0.174

Humidity normalization:

normalized_humidity = (55 - 30) / (70 - 30)
                   = 25 / 40
                   = 0.625

Step 2: Show the impact WITHOUT normalization

If we fed raw values to the neural network: - Input vector: [28, 1200, 55] - CO2 value is 40x larger than temperature - Neural network gradient updates would be dominated by CO2

Example gradient calculation (simplified):

Cost function: J = (pred - actual)²
Gradient w.r.t. weight: ∂J/∂w = 2 × (pred - actual) × input_value

For temperature: gradient ∝ 28
For CO2:         gradient ∝ 1200  (43x larger!)
For humidity:    gradient ∝ 55

The network would learn to minimize CO2 error while ignoring temperature and humidity!

Step 3: Verify normalized inputs have equal influence

Normalized input vector: [0.65, 0.174, 0.625]

Now all gradients are on similar scales:

For temperature: gradient ∝ 0.65
For CO2:         gradient ∝ 0.174
For humidity:    gradient ∝ 0.625

Each sensor contributes equally to gradient updates during training.

Step 4: Calculate percentage of each sensor’s range

This helps interpret the normalized values: - Temperature: 65% of its range (moderately warm) - CO2: 17.4% of its range (relatively low, good ventilation) - Humidity: 62.5% of its range (comfortable level)

Step 5: Detect an anomaly scenario

Normal condition: [0.65, 0.174, 0.625] Anomaly (AC failure): [0.95, 0.350, 0.825] - Temperature: 95% of range = 34°C (very hot!) - CO2: 35% of range = 2010 ppm (rising, poor ventilation) - Humidity: 82.5% of range = 63% (uncomfortable)

The neural network trained on normalized data can now detect this pattern as anomalous, because all three sensors contributed equally during training.

Key Insight: Without normalization, the neural network’s loss function is dominated by the largest-magnitude features. Min-max scaling ensures each sensor contributes proportionally to its information content, not its arbitrary measurement scale. This is why normalization is mandatory for neural networks, SVM, K-means, and any algorithm that uses distance metrics or gradient descent.

44.4.9 Try It: Multi-Sensor Fusion Calculator

Adjust the sensor readings below to see how min-max normalization brings different scales into alignment:

Choose the appropriate normalization method based on your data characteristics and downstream algorithm:

Data Characteristic Normalization Method Output Range Best For Avoid When
Bounded range known (temperature, humidity) Min-Max Scaling 0 to 1 Neural networks, image processing, bounded outputs Outliers present (they compress valid range)
Outliers expected (sensor noise, network latency) Robust Scaling Median-centered K-means, SVM, any distance-based method Need exact 0-1 bounds
Gaussian distribution (many natural phenomena) Z-Score Normalization Mean=0, Std=1 Clustering, PCA, algorithms assuming normal distribution Binary features (0/1)
Exponential/power-law (network traffic, wealth) Log Transform then Z-Score Variable Right-skewed data, multiplicative relationships Zero or negative values present
Mixed data types (some outliers + some bounded) Hybrid: Robust for outlier features, Min-Max for clean Variable per feature Real-world messy datasets Need uniform scaling method

Decision Tree:

  1. Are there extreme outliers (>5% of values beyond 3σ)?
    • YES → Use Robust Scaling (median + IQR)
    • NO → Continue to step 2
  2. Is your algorithm neural-network-based?
    • YES → Use Min-Max Scaling (0-1) for activation function compatibility
    • NO → Continue to step 3
  3. Does your algorithm assume Gaussian distribution?
    • YES (PCA, LDA) → Use Z-Score Normalization
    • NO → Continue to step 4
  4. Is your data heavily right-skewed (long tail)?
    • YES → Log Transform + Z-Score
    • NO → Default to Min-Max Scaling

Example Python Implementation:

def select_normalizer(data_characteristics):
    """
    Select appropriate normalization based on data characteristics.
    Returns: (normalizer_class, parameters)
    """
    has_outliers = data_characteristics['outlier_rate'] > 0.05
    is_neural_network = data_characteristics['model_type'] == 'neural_network'
    is_gaussian = data_characteristics['distribution'] == 'normal'
    is_skewed = data_characteristics['skewness'] > 2.0

    if has_outliers:
        return (RobustScaler, {})
    elif is_neural_network:
        return (MinMaxScaler, {'feature_range': (0, 1)})
    elif is_gaussian:
        return (ZScoreNormalizer, {})
    elif is_skewed:
        return (LogTransform, {'then_zscore': True})
    else:
        return (MinMaxScaler, {'feature_range': (0, 1)})

# Usage example
characteristics = {
    'outlier_rate': 0.08,  # 8% outliers
    'model_type': 'kmeans',
    'distribution': 'unknown',
    'skewness': 1.2
}

normalizer_class, params = select_normalizer(characteristics)
# Returns: RobustScaler (because outlier_rate > 0.05)

Warning Signs of Wrong Normalization:

  • Neural network accuracy plateaus at 60% → Inputs not normalized
  • Clustering groups all high-magnitude features together → Need Z-score instead of raw
  • Model ignores certain sensors → Their raw ranges are too small (need min-max)
  • Training loss explodes after first epoch → Gradients too large (need normalization)
Common Mistake: Normalizing Before Train/Test Split

The Mistake: Calculating normalization parameters (min, max, mean, std) on the entire dataset before splitting into train and test sets. This causes data leakage, where the test set’s statistics influence the training process, leading to overly optimistic performance estimates.

Why It Happens: The normalization step feels like “data preparation” rather than “model training,” so developers apply it before the split. Many tutorials skip this detail. Scikit-learn’s fit_transform() makes it easy to accidentally normalize everything at once.

Example of the Problem:

# WRONG: Normalize first, split second
data = load_sensor_data()  # 10,000 samples
normalized = MinMaxScaler().fit_transform(data)  # Uses ALL data stats!
train, test = train_test_split(normalized, test_size=0.2)

# The test set's min/max values influenced the scaling parameters!
# Model evaluation is now too optimistic.

Why This Is Wrong:

  1. Data Leakage: Test set statistics “leak” into training through normalization parameters
  2. Overfitting: Model appears to generalize better than it actually does
  3. Production Failure: Real-world data has different min/max than training data

Real-World Example:

Imagine temperature sensor data: - Training period (winter): 15-25°C - Test period (summer): 20-35°C

WRONG approach:

# Calculate on ALL data (winter + summer)
overall_min = 15°C, overall_max = 35°C

# Normalize training data using overall stats
train_normalized = (train - 15) / (35 - 15)
# Result: Training temps (15-25°C) map to 0.0-0.5

# Normalize test data using SAME overall stats
test_normalized = (test - 15) / (35 - 15)
# Result: Test temps (20-35°C) map to 0.25-1.0

# Model trained on 0.0-0.5 range, tested on 0.25-1.0 range
# Test performance appears good because model "saw" summer data during normalization!

The Fix: Fit normalization ONLY on training data, then apply to test:

# CORRECT: Split first, normalize second
data = load_sensor_data()
train, test = train_test_split(data, test_size=0.2)  # Split FIRST

# Fit normalization on training data ONLY
scaler = MinMaxScaler()
scaler.fit(train)  # Learns min=15, max=25 from training (winter)

# Apply fitted scaler to both train and test
train_normalized = scaler.transform(train)
test_normalized = scaler.transform(test)  # Summer data (20-35°C) may exceed [0,1]!

# This is CORRECT - test data reflecting real-world distribution

Correct Pipeline Order:

  1. Split data into train/validation/test sets (e.g., 70/15/15)
  2. Fit normalization parameters on TRAINING data only
  3. Transform train, validation, and test sets using those parameters
  4. Train model on normalized training data
  5. Evaluate on normalized validation/test data

Code Template:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# 1. Split FIRST
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

# 2. Fit normalizer on TRAINING data only
scaler = MinMaxScaler()
scaler.fit(X_train)  # Only training data!

# 3. Transform ALL sets using training parameters
X_train_norm = scaler.transform(X_train)
X_val_norm = scaler.transform(X_val)
X_test_norm = scaler.transform(X_test)

# 4. Train model
model.fit(X_train_norm, y_train)

# 5. Evaluate (no data leakage!)
val_score = model.score(X_val_norm, y_val)
test_score = model.score(X_test_norm, y_test)

Warning Signs of This Mistake:

  • Test accuracy is suspiciously high (>95%) on first try
  • Model performance degrades significantly in production
  • Test set values occasionally exceed [0, 1] after normalization (this is actually GOOD - means no leakage!)
  • Reviewer asks “when did you fit the scaler?” and you can’t answer clearly

Real-World Impact: A smart building energy prediction model achieved 96% test accuracy during development but only 73% accuracy in production. Root cause: normalization was fit on the entire year’s data (including test set), so the model “saw” summer peak loads during training. In production, the next summer’s peak loads were outside the normalized range, causing poor predictions.

Try It: Data Leakage Visualizer

Explore how normalizing before vs. after the train/test split changes the scaling parameters. Adjust the test set distribution to see how leakage distorts the training normalization.

Common Pitfalls

Computing Min-Max bounds or Z-score mean/std on all data before splitting leaks test set information into training. Always fit normalisation parameters on the training set only and apply them to both train and test sets.

Binary status signals (0/1), count data, and continuous physical measurements require different normalisation strategies. Normalise each channel according to its distribution, not uniformly.

In regression tasks, some implementations accidentally normalise the prediction target along with features, causing the model to predict normalised units that require inverse transformation. Keep target variables in their original units unless the algorithm specifically requires otherwise.

Adding a new sensor channel to an existing normalised dataset requires recomputing normalisation parameters. Hardcoded normalisation bounds from the initial dataset will not accommodate the new sensor’s value range.

44.5 Summary

Data normalization and scaling complete the data quality preprocessing pipeline:

  • Min-Max Scaling: Transforms data to 0-1 range, ideal for neural networks and bounded outputs
  • Z-Score Normalization: Centers data around mean with unit variance, best for clustering and SVM
  • Robust Scaling: Uses median and IQR, resistant to outliers
  • Log Transform: Compresses right-skewed data spanning orders of magnitude
  • Complete Pipeline: Validation -> Cleaning -> Transformation, all implementable on edge devices

Critical Design Principle: The complete “validate-clean-transform” pipeline should run at the edge. Catching data quality issues at the source costs 1% of fixing them in the cloud, and normalized data enables fair multi-sensor fusion.

Normalization is like making sure everyone on the team speaks the same language!

44.5.1 The Sensor Squad Adventure: The Unfair Contest

The Sensor Squad was having a “Who Detected the Most?” contest, but something was NOT fair!

Temperature Terry reported: “I measured 25 degrees today!” Light Lucy shouted: “Well, I measured FIFTY THOUSAND lux! I WIN!” Terry looked sad. “But… 25 degrees is actually really important too…”

Max the Microcontroller stepped in: “Wait! This contest is not fair! Terry’s numbers go from 0 to 50, but Lucy’s numbers go from 0 to 100,000. We need to put everyone on the SAME SCALE!”

So Max invented three ways to make it fair:

Method 1 – The Percentage Trick (Min-Max): “Turn everything into a percentage of its range!” Terry’s 25 out of 50 max = 50%. Lucy’s 50,000 out of 100,000 max = 50%. “We are actually EQUAL!” they both cheered.

Method 2 – The “How Unusual?” Trick (Z-Score): “How far is each reading from normal?” Terry’s normal is 20 degrees, so 25 is a little above normal. Lucy’s normal is 40,000 lux, so 50,000 is also a little above normal. Both scored about the same “unusualness.”

Method 3 – The Tough Cookie Trick (Robust): Pressure Pete had a WEIRD reading of 9999 that messed up the averages. “Use the MIDDLE value instead of the average!” said Bella the Battery. “The middle value ignores crazy numbers!”

Now ALL the sensors were on the same scale, and the contest was fair! “Normalization makes sure no sensor gets more attention just because it uses BIGGER numbers!” explained Max.

44.5.2 Key Words for Kids

Word What It Means
Normalization Putting different measurements on the same scale so they can be compared fairly
Min-Max Turning a number into a percentage between 0 and 1
Z-Score Measuring how far a number is from “normal”
Robust A method that still works well even when some numbers are wrong
Scale The range of numbers a sensor uses (like 0-50 or 0-100,000)
Key Takeaway

Data normalization is essential before combining multi-sensor data or feeding it to machine learning models. Choose your method based on data characteristics: min-max for bounded outputs, Z-score for Gaussian assumptions, robust scaling when outliers are present. Always fit normalization parameters on training data only to avoid data leakage. Implement the complete validate-clean-transform pipeline at the edge to catch data quality issues at the source.

44.6 Concept Relationships

This chapter builds on validation and cleaning while introducing normalization as the final pipeline stage:

Prerequisites (Must understand first):

Related Concepts (Enhance understanding):

  • Multi-Sensor Data Fusion - Normalization enables fair comparison when fusing sensors with vastly different measurement ranges (temperature vs light intensity)
  • Edge Data Acquisition - Edge devices normalize locally to reduce transmission bandwidth and prepare data for edge ML inference

Advanced Applications (Build on this):

  • Modeling and Inferencing - Neural networks require normalized inputs ([0,1] or mean=0, std=1) for stable gradient descent
  • Anomaly Detection - Z-score normalization makes distance-based anomaly detection work across features with different scales

Key Insight: Normalization is the final stage of the validate-clean-transform pipeline. Applying it before validation or cleaning causes incorrect scaling parameters (e.g., min-max uses outlier as max value, compressing all valid data into a tiny range).

44.7 What’s Next

If you want to… Read this
Understand imputation techniques that precede normalisation Data Quality Imputation and Filtering
Study the full preprocessing pipeline Data Quality and Preprocessing
Apply normalised data to ML model training Modeling and Inferencing
Understand data validation before normalisation Data Quality Validation
Return to the module overview Big Data Overview

Data Quality Series:

Practical Applications:

Advanced Topics: