44 Normalization & Preprocessing

44.1 Learning Objectives

By the end of this chapter, you will be able to:

Apply Normalization Techniques: Implement min-max scaling, Z-score normalization, and robust scaling for multi-sensor data fusion
Compare Normalization Methods: Evaluate which scaling approach suits each downstream use case (neural networks, clustering, visualization)
Implement a Complete Pipeline: Build and test an end-to-end data quality system on an ESP32 microcontroller
Assess Data Quality Metrics: Calculate and interpret validation rates, outlier counts, and imputation statistics

In 60 Seconds

This hands-on lab guides you through implementing data normalisation and standardisation on real IoT sensor datasets, demonstrating concretely how improper scaling causes ML models to ignore low-magnitude sensors and how to prevent it. You will compare Min-Max normalisation, Z-score standardisation, and robust scaling on data with outliers, and see the impact on a simple classifier.

44.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Data Validation and Outlier Detection: Understanding validation as the first stage of data quality
Missing Value Imputation and Noise Filtering: Handling gaps and noise in sensor data
Edge Data Acquisition: Understanding how sensor data is collected at the edge

For Beginners: Why Normalize Data?

Imagine comparing apples and elephants. Temperature might range from -10 to 50 degrees, while light levels range from 0 to 100,000 lux. If you feed both to a machine learning model without normalizing, the model will think light is 2000x more important simply because the numbers are bigger!

Normalization puts all sensors on the same scale:

Before Normalization	After Min-Max (0-1)
Temperature: 25C	Temperature: 0.58
Light: 50,000 lux	Light: 0.50

Now both contribute equally based on their actual information content, not their arbitrary unit scales.

When to use different methods:

Method	Use When	Output Range
Min-Max	Need bounded outputs (neural networks)	0 to 1
Z-Score	Gaussian data, using K-means/SVM/PCA	Mean=0, StdDev=1
Robust	Many outliers you want to ignore	Median-centered
Log	Data spans orders of magnitude	Compressed scale

Key question this chapter answers: “How do I prepare multi-sensor data so it can be combined and analyzed fairly?”

Minimum Viable Understanding: Data Normalization

Core Concept: Normalization transforms sensor readings to a common scale, enabling fair comparison and combination of data from sensors with vastly different measurement ranges.

Why It Matters: Without normalization, a light sensor reading 100,000 lux would dominate a temperature sensor reading 25C in any analysis, even though both carry equal information. Machine learning models and statistical methods assume comparable scales.

Key Takeaway: Use min-max scaling (0-1) for neural networks and bounded outputs; use Z-score normalization for clustering and algorithms assuming Gaussian distributions; use robust scaling when outliers are present and should be dampened.

44.3 Data Normalization and Scaling

~10 min | - - Intermediate | - P10.C09.U06

Key Concepts

Min-Max normalisation: Scaling all values to the [0, 1] range by subtracting the minimum and dividing by the range; sensitive to outliers that compress the majority of values into a narrow band.
Z-score standardisation: Transforming values to have zero mean and unit variance by subtracting the mean and dividing by the standard deviation; assumes approximately Gaussian distribution.
Robust scaling: Scaling using the median and interquartile range instead of mean and standard deviation, making it resistant to outliers — preferred for sensor data with occasional extreme spikes.
Feature scaling: The general process of transforming sensor channels to comparable magnitude ranges so that machine learning algorithms treat all features equally regardless of their engineering units.
Normalisation artefact: A distortion introduced by incorrect normalisation — for example, scaling based on training set statistics applied incorrectly to the test set, or using the wrong normalisation for the downstream algorithm.
Data leakage: The contamination of model evaluation with information from the test set, often caused by normalising all data together before splitting into train/test, which leaks test statistics into the normalisation parameters.

44.3.1 Min-Max Scaling

Scales data to a fixed range (typically 0-1):

class MinMaxScaler:
    def __init__(self, feature_range=(0, 1)):
        self.min_val = None
        self.max_val = None
        self.feature_min, self.feature_max = feature_range

    def fit(self, data):
        """Learn min/max from training data"""
        self.min_val = min(data)
        self.max_val = max(data)

    def transform(self, value):
        """Scale a single value"""
        if self.max_val == self.min_val:
            return self.feature_min

        scaled = (value - self.min_val) / (self.max_val - self.min_val)
        return scaled * (self.feature_max - self.feature_min) + self.feature_min

    def inverse_transform(self, scaled_value):
        """Convert back to original scale"""
        original = (scaled_value - self.feature_min) / (self.feature_max - self.feature_min)
        return original * (self.max_val - self.min_val) + self.min_val

44.3.2 Z-Score Normalization (Standardization)

Centers data around mean with unit variance:

class ZScoreNormalizer:
    def __init__(self):
        self.mean = None
        self.std = None

    def fit(self, data):
        """Learn mean and std from training data"""
        import numpy as np
        self.mean = np.mean(data)
        self.std = np.std(data)

    def transform(self, value):
        """Normalize a single value"""
        if self.std == 0:
            return 0.0
        return (value - self.mean) / self.std

    def inverse_transform(self, normalized_value):
        """Convert back to original scale"""
        return normalized_value * self.std + self.mean

Putting Numbers to It

Min-max and Z-score normalization transform sensor data to comparable scales for multi-sensor fusion and machine learning.

Min-max formula: \(x' = \frac{x - x_{\min}}{x_{\max} - x_{\min}}\)

Z-score formula: \(z = \frac{x - \mu}{\sigma}\)

Worked example: A temperature sensor collects 5 readings: [18C, 22C, 25C, 19C, 21C]. Normalize using both methods:

Min-max scaling:

\(x_{\min} = 18\), \(x_{\max} = 25\)
For 22C: \(x' = \frac{22 - 18}{25 - 18} = \frac{4}{7} = 0.571\)
For 25C: \(x' = \frac{25 - 18}{25 - 18} = \frac{7}{7} = 1.000\)

Z-score normalization:

\(\mu = 21\), \(\sigma = \sqrt{\frac{(18{-}21)^2 + (22{-}21)^2 + (25{-}21)^2 + (19{-}21)^2 + (21{-}21)^2}{5}} = \sqrt{6} \approx 2.449\)
For 22C: \(z = \frac{22 - 21}{2.449} = 0.408\)
For 25C: \(z = \frac{25 - 21}{2.449} = 1.633\)

Result: Min-max gives bounded [0,1] range, while Z-score preserves statistical distance (25C is 1.63 std deviations above mean).

44.3.3 Robust Scaling

Uses median and IQR, robust to outliers:

class RobustScaler:
    def __init__(self):
        self.median = None
        self.iqr = None

    def fit(self, data):
        """Learn median and IQR from training data"""
        import numpy as np
        self.median = np.median(data)
        q1 = np.percentile(data, 25)
        q3 = np.percentile(data, 75)
        self.iqr = q3 - q1

    def transform(self, value):
        """Scale using median and IQR"""
        if self.iqr == 0:
            return 0.0
        return (value - self.median) / self.iqr

    def inverse_transform(self, scaled_value):
        """Convert back to original scale"""
        return scaled_value * self.iqr + self.median

Try It: Outlier Impact on Scaling Methods

See how a single outlier distorts Min-Max and Z-Score scaling while Robust Scaling remains stable. Drag the outlier value to see the effect in real time.

Show code

viewof outlierVal = Inputs.range([50, 500], {
  value: 50, step: 5, label: "Outlier value:"
})

viewof baseDataChoice = Inputs.select(
  ["Temperature (18-30 C)", "Humidity (40-60 %)", "Pressure (1000-1020 hPa)"],
  { value: "Temperature (18-30 C)", label: "Base dataset:" }
)

Show code

{
  const baseMap = {
    "Temperature (18-30 C)": [18, 20, 22, 23, 24, 25, 26, 27, 28, 30],
    "Humidity (40-60 %)": [40, 42, 45, 48, 50, 52, 54, 56, 58, 60],
    "Pressure (1000-1020 hPa)": [1000, 1002, 1005, 1008, 1010, 1012, 1014, 1016, 1018, 1020]
  };
  const base = baseMap[baseDataChoice];
  const dataWithOutlier = [...base, outlierVal];

  const mean = dataWithOutlier.reduce((a, b) => a + b, 0) / dataWithOutlier.length;
  const std = Math.sqrt(dataWithOutlier.reduce((a, b) => a + (b - mean) ** 2, 0) / dataWithOutlier.length);
  const minV = Math.min(...dataWithOutlier);
  const maxV = Math.max(...dataWithOutlier);

  const sorted = [...dataWithOutlier].sort((a, b) => a - b);
  const n = sorted.length;
  const q1 = sorted[Math.floor(n * 0.25)];
  const q3 = sorted[Math.floor(n * 0.75)];
  const median = sorted[Math.floor(n / 2)];
  const iqr = q3 - q1;

  const meanNoOut = base.reduce((a, b) => a + b, 0) / base.length;
  const stdNoOut = Math.sqrt(base.reduce((a, b) => a + (b - meanNoOut) ** 2, 0) / base.length);
  const minNoOut = Math.min(...base);
  const maxNoOut = Math.max(...base);

  const testVal = base[Math.floor(base.length / 2)];

  const mmWith = maxV === minV ? 0.5 : (testVal - minV) / (maxV - minV);
  const mmWithout = maxNoOut === minNoOut ? 0.5 : (testVal - minNoOut) / (maxNoOut - minNoOut);
  const zsWith = std === 0 ? 0 : (testVal - mean) / std;
  const zsWithout = stdNoOut === 0 ? 0 : (testVal - meanNoOut) / stdNoOut;
  const robWith = iqr === 0 ? 0 : (testVal - median) / iqr;

  const sortedBase = [...base].sort((a, b) => a - b);
  const nB = sortedBase.length;
  const medianB = sortedBase[Math.floor(nB / 2)];
  const q1B = sortedBase[Math.floor(nB * 0.25)];
  const q3B = sortedBase[Math.floor(nB * 0.75)];
  const iqrB = q3B - q1B;
  const robWithout = iqrB === 0 ? 0 : (testVal - medianB) / iqrB;

  const mmDistortion = mmWithout === 0 ? 0 : Math.abs((mmWith - mmWithout) / mmWithout * 100);
  const zsDistortion = zsWithout === 0 ? 0 : Math.abs((zsWith - zsWithout) / zsWithout * 100);
  const robDistortion = robWithout === 0 ? 0 : Math.abs((robWith - robWithout) / robWithout * 100);

  const w = 520;
  const h = 140;
  const barH = 24;
  const maxDist = Math.max(mmDistortion, zsDistortion, robDistortion, 1);

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #9B59B6; color: var(--bs-body-color);">
    <h4 style="margin-top: 0; color: #2C3E50;">Outlier Impact: Scaling test value ${testVal} with outlier = ${outlierVal}</h4>
    <svg width="${w}" height="${h}" style="font-family: Arial, sans-serif; max-width: 100%;" viewBox="0 0 ${w} ${h}">
      <text x="0" y="18" font-size="13" fill="#2C3E50" font-weight="bold">Min-Max distortion</text>
      <rect x="180" y="5" width="${(mmDistortion / maxDist) * 300}" height="${barH}" rx="4" fill="#E74C3C" opacity="0.85"/>
      <text x="${185 + (mmDistortion / maxDist) * 300}" y="22" font-size="12" fill="#2C3E50">${mmDistortion.toFixed(1)}%</text>

      <text x="0" y="58" font-size="13" fill="#2C3E50" font-weight="bold">Z-Score distortion</text>
      <rect x="180" y="45" width="${(zsDistortion / maxDist) * 300}" height="${barH}" rx="4" fill="#E67E22" opacity="0.85"/>
      <text x="${185 + (zsDistortion / maxDist) * 300}" y="62" font-size="12" fill="#2C3E50">${zsDistortion.toFixed(1)}%</text>

      <text x="0" y="98" font-size="13" fill="#2C3E50" font-weight="bold">Robust distortion</text>
      <rect x="180" y="85" width="${(robDistortion / maxDist) * 300}" height="${barH}" rx="4" fill="#16A085" opacity="0.85"/>
      <text x="${185 + (robDistortion / maxDist) * 300}" y="102" font-size="12" fill="#2C3E50">${robDistortion.toFixed(1)}%</text>

      <text x="0" y="${h - 5}" font-size="11" fill="#7F8C8D">Distortion = % change in normalized value vs. clean dataset</text>
    </svg>
    <p style="margin-top: 0.6rem; margin-bottom: 0; font-size: 0.9em; color: #7F8C8D;">
      ${outlierVal > base[base.length - 1] * 1.5
        ? "The outlier is far from the data range. Notice how Robust Scaling barely changes while Min-Max is severely compressed."
        : "Try increasing the outlier value further to see dramatic differences between scaling methods."}
    </p>
  </div>`;
}

44.3.4 When to Use Each Method

Min-Max

Use case: Neural networks and bounded outputs
Preserves: Distribution shape
Sensitive to: Outliers

Z-Score

Use case: SVM, K-means, PCA, and Gaussian-style assumptions
Preserves: Relative distances
Sensitive to: Outliers that shift the mean or standard deviation

Robust Scaling

Use case: Datasets where outliers should not dominate the scaling range
Preserves: Median-based relationships
Sensitive to: Extremely sparse data

Log Transform

Use case: Right-skewed data such as power measurements or event counts
Preserves: Multiplicative relationships
Sensitive to: Zero or negative values

44.3.5 Interactive Normalization Calculator

Try different sensor values and see how each normalization method transforms them:

Show code

viewof normSensorValue = Inputs.range([0, 100], {
  value: 25, step: 0.1, label: "Sensor reading:"
})

viewof normMinRange = Inputs.range([-50, 100], {
  value: 0, step: 1, label: "Sensor min range:"
})

viewof normMaxRange = Inputs.range([0, 200], {
  value: 50, step: 1, label: "Sensor max range:"
})

viewof normMean = Inputs.range([-50, 150], {
  value: 25, step: 0.1, label: "Population mean (μ):"
})

viewof normStdDev = Inputs.range([0.1, 50], {
  value: 10, step: 0.1, label: "Std deviation (σ):"
})

viewof normMedian = Inputs.range([-50, 150], {
  value: 25, step: 0.1, label: "Median:"
})

viewof normIQR = Inputs.range([0.1, 50], {
  value: 15, step: 0.1, label: "IQR (Q3 - Q1):"
})

Show code

{
  const value = normSensorValue;
  const minR = normMinRange;
  const maxR = normMaxRange;
  const mu = normMean;
  const sigma = normStdDev;
  const median = normMedian;
  const iqr = normIQR;

  const minMax = maxR === minR ? 0.5 : (value - minR) / (maxR - minR);
  const zScore = sigma === 0 ? 0 : (value - mu) / sigma;
  const robust = iqr === 0 ? 0 : (value - median) / iqr;
  const logVal = value > 0 ? Math.log(value) : "N/A (value must be > 0)";

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #16A085; color: var(--bs-body-color);">
    <h4 style="margin-top: 0; color: #2C3E50;">Normalization Results for value = ${value.toFixed(1)}</h4>
    <table style="width: 100%; border-collapse: collapse; color: var(--bs-body-color);">
      <thead>
        <tr style="border-bottom: 2px solid #16A085;">
          <th style="text-align: left; padding: 6px;">Method</th>
          <th style="text-align: left; padding: 6px;">Formula</th>
          <th style="text-align: right; padding: 6px;">Result</th>
        </tr>
      </thead>
      <tbody>
        <tr style="border-bottom: 1px solid var(--bs-border-color, #dee2e6);">
          <td style="padding: 6px; font-weight: bold;">Min-Max</td>
          <td style="padding: 6px;">(${value.toFixed(1)} - ${minR}) / (${maxR} - ${minR})</td>
          <td style="padding: 6px; text-align: right; font-family: monospace; font-weight: bold; color: #16A085;">${minMax.toFixed(4)}</td>
        </tr>
        <tr style="border-bottom: 1px solid var(--bs-border-color, #dee2e6);">
          <td style="padding: 6px; font-weight: bold;">Z-Score</td>
          <td style="padding: 6px;">(${value.toFixed(1)} - ${mu.toFixed(1)}) / ${sigma.toFixed(1)}</td>
          <td style="padding: 6px; text-align: right; font-family: monospace; font-weight: bold; color: #E67E22;">${zScore.toFixed(4)}</td>
        </tr>
        <tr style="border-bottom: 1px solid var(--bs-border-color, #dee2e6);">
          <td style="padding: 6px; font-weight: bold;">Robust</td>
          <td style="padding: 6px;">(${value.toFixed(1)} - ${median.toFixed(1)}) / ${iqr.toFixed(1)}</td>
          <td style="padding: 6px; text-align: right; font-family: monospace; font-weight: bold; color: #3498DB;">${robust.toFixed(4)}</td>
        </tr>
        <tr>
          <td style="padding: 6px; font-weight: bold;">Log Transform</td>
          <td style="padding: 6px;">ln(${value.toFixed(1)})</td>
          <td style="padding: 6px; text-align: right; font-family: monospace; font-weight: bold; color: #9B59B6;">${typeof logVal === "number" ? logVal.toFixed(4) : logVal}</td>
        </tr>
      </tbody>
    </table>
    <p style="margin-top: 0.8rem; margin-bottom: 0; font-size: 0.9em; opacity: 0.8;">
      ${minMax < 0 || minMax > 1 ? "⚠ Min-Max result is outside [0,1] — the reading is outside the specified range." : "Min-Max result is within the expected [0,1] range."}
      ${Math.abs(zScore) > 3 ? " ⚠ Z-Score exceeds ±3σ — this reading would be flagged as an outlier." : ""}
    </p>
  </div>`;
}

44.4 Data Quality Lab: ESP32 Wokwi Simulation

~45 min | - - - Advanced | - P10.C09.LAB

44.4.1 Lab Overview

In this hands-on lab, you will implement a complete data quality preprocessing pipeline on an ESP32 microcontroller. The simulation demonstrates real-world techniques for handling sensor data problems including outliers, missing values, noise, and the need for normalization.

What You Will Learn:

Sensor Data Validation: Implementing range checks and rate-of-change validation
Outlier Detection: Using Z-score and IQR methods to identify anomalous readings
Missing Value Handling: Forward-fill and interpolation techniques for gap handling
Noise Filtering: Moving average, median filter, and exponential smoothing
Data Normalization: Min-max scaling and Z-score normalization for multi-sensor fusion

Skills Practiced:

Real-time data processing on embedded systems
Statistical calculations with limited memory
Circular buffer implementations
Streaming algorithm design

44.4.2 Lab Components

Range Validation

Implementation: Physical bounds checking
Visual indicator: Red LED on violation

Z-Score Outliers

Implementation: Rolling statistics
Visual indicator: Yellow LED on outlier

Median Filter

Implementation: 5-sample sliding window
Visual indicator: Smoothed output in serial

Moving Average

Implementation: 10-sample window
Visual indicator: Trend visualization

Normalization

Implementation: 0-1 scaling
Visual indicator: Percentage output

44.4.3 Wokwi Simulator

Use the simulator launch link below to build your data quality preprocessing system while keeping the lab instructions open in this chapter.

Launch the ESP32 Workspace

Open the Wokwi ESP32 simulator

Add the components listed in the circuit setup below.
Paste the diagram.json and Arduino sketch from this lab into Wokwi.
Start the simulation and use the Serial Monitor to inspect the preprocessing pipeline.

44.4.4 Circuit Setup

Connect the sensors and indicators to the ESP32:

Component	ESP32 Pin	Purpose
Temperature Sensor (NTC)	GPIO 34	Primary data source
Light Sensor (LDR)	GPIO 35	Secondary data source
Potentiometer	GPIO 32	Simulate sensor drift
Red LED	GPIO 18	Range violation indicator
Yellow LED	GPIO 19	Outlier detection indicator
Green LED	GPIO 21	Valid data indicator
Blue LED	GPIO 22	Missing data indicator

Add this diagram.json configuration in Wokwi:

{
  "version": 1,
  "author": "IoT Class - Data Quality Lab",
  "editor": "wokwi",
  "parts": [
    { "type": "wokwi-esp32-devkit-v1", "id": "esp", "top": 0, "left": 0 },
    { "type": "wokwi-ntc-temperature-sensor", "id": "temp1", "top": -120, "left": 80 },
    { "type": "wokwi-photoresistor-sensor", "id": "ldr1", "top": -120, "left": 180 },
    { "type": "wokwi-potentiometer", "id": "pot1", "top": -120, "left": 280 },
    { "type": "wokwi-led", "id": "led_red", "top": 180, "left": 80, "attrs": { "color": "red" } },
    { "type": "wokwi-led", "id": "led_yellow", "top": 180, "left": 130, "attrs": { "color": "yellow" } },
    { "type": "wokwi-led", "id": "led_green", "top": 180, "left": 180, "attrs": { "color": "green" } },
    { "type": "wokwi-led", "id": "led_blue", "top": 180, "left": 230, "attrs": { "color": "blue" } },
    { "type": "wokwi-resistor", "id": "r1", "top": 230, "left": 80, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r2", "top": 230, "left": 130, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r3", "top": 230, "left": 180, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r4", "top": 230, "left": 230, "attrs": { "value": "220" } }
  ],
  "connections": [
    ["esp:GND.1", "temp1:GND", "black", ["h0"]],
    ["esp:3V3", "temp1:VCC", "red", ["h0"]],
    ["esp:34", "temp1:OUT", "green", ["h0"]],
    ["esp:GND.1", "ldr1:GND", "black", ["h0"]],
    ["esp:3V3", "ldr1:VCC", "red", ["h0"]],
    ["esp:35", "ldr1:OUT", "orange", ["h0"]],
    ["esp:GND.1", "pot1:GND", "black", ["h0"]],
    ["esp:3V3", "pot1:VCC", "red", ["h0"]],
    ["esp:32", "pot1:SIG", "purple", ["h0"]],
    ["esp:18", "led_red:A", "red", ["h0"]],
    ["led_red:C", "r1:1", "black", ["h0"]],
    ["r1:2", "esp:GND.1", "black", ["h0"]],
    ["esp:19", "led_yellow:A", "yellow", ["h0"]],
    ["led_yellow:C", "r2:1", "black", ["h0"]],
    ["r2:2", "esp:GND.1", "black", ["h0"]],
    ["esp:21", "led_green:A", "green", ["h0"]],
    ["led_green:C", "r3:1", "black", ["h0"]],
    ["r3:2", "esp:GND.1", "black", ["h0"]],
    ["esp:22", "led_blue:A", "blue", ["h0"]],
    ["led_blue:C", "r4:1", "black", ["h0"]],
    ["r4:2", "esp:GND.1", "black", ["h0"]]
  ]
}

44.4.5 Complete Arduino Code

Copy this code into the Wokwi editor:

// ============================================================================
// DATA QUALITY LAB: Comprehensive IoT Data Preprocessing Pipeline
// ============================================================================
// Demonstrates: Validation, Outlier Detection, Missing Value Handling,
//               Noise Filtering, Normalization, and Data Scaling
// ============================================================================

#include <Arduino.h>
#include <math.h>

// ====================== PIN DEFINITIONS ======================
const int TEMP_PIN = 34;        // Temperature sensor (NTC)
const int LIGHT_PIN = 35;       // Light sensor (LDR)
const int DRIFT_PIN = 32;       // Potentiometer for drift simulation

const int LED_RED = 18;         // Range violation indicator
const int LED_YELLOW = 19;      // Outlier detection indicator
const int LED_GREEN = 21;       // Valid data indicator
const int LED_BLUE = 22;        // Missing data indicator

// ====================== SAMPLING PARAMETERS ======================
const int SAMPLE_INTERVAL_MS = 200;        // 5 Hz sampling rate
const int VALIDATION_WINDOW = 50;          // Samples for statistics
const int FILTER_WINDOW_MEDIAN = 5;        // Median filter window
const int FILTER_WINDOW_MA = 10;           // Moving average window
const float EXP_SMOOTHING_ALPHA = 0.3;     // Exponential smoothing factor

// ====================== VALIDATION THRESHOLDS ======================
// Temperature sensor valid range (Celsius after conversion)
const float TEMP_MIN_VALID = -10.0;
const float TEMP_MAX_VALID = 60.0;
const float TEMP_MAX_RATE = 2.0;           // Max 2C change per second

// Light sensor valid range (0-4095 ADC, 0-100000 lux mapped)
const float LIGHT_MIN_VALID = 0.0;
const float LIGHT_MAX_VALID = 100000.0;

// Outlier detection thresholds
const float ZSCORE_THRESHOLD = 3.0;        // Standard deviations
const float IQR_MULTIPLIER = 1.5;          // IQR outlier multiplier
const float MAD_THRESHOLD = 3.5;           // Modified Z-score threshold

// Missing data parameters
const int MAX_MISSING_SAMPLES = 10;        // Max consecutive missing allowed
const float MISSING_PROBABILITY = 0.05;    // Simulate 5% missing data

// ====================== DATA STRUCTURES ======================

// Circular buffer for streaming statistics
template<int SIZE>
struct CircularBuffer {
    float data[SIZE];
    int head;
    int count;

    CircularBuffer() : head(0), count(0) {
        for (int i = 0; i < SIZE; i++) data[i] = 0;
    }

    void push(float value) {
        data[head] = value;
        head = (head + 1) % SIZE;
        if (count < SIZE) count++;
    }

    float get(int index) const {
        // Get value at index (0 = oldest)
        int actual = (head - count + index + SIZE) % SIZE;
        return data[actual];
    }

    bool isFull() const { return count == SIZE; }
    int size() const { return count; }
};

// Statistics calculator for streaming data
struct StreamingStats {
    float sum;
    float sumSq;
    int count;
    float min_val;
    float max_val;

    StreamingStats() : sum(0), sumSq(0), count(0),
                       min_val(INFINITY), max_val(-INFINITY) {}

    void reset() {
        sum = sumSq = 0;
        count = 0;
        min_val = INFINITY;
        max_val = -INFINITY;
    }

    void add(float value) {
        sum += value;
        sumSq += value * value;
        count++;
        if (value < min_val) min_val = value;
        if (value > max_val) max_val = value;
    }

    float mean() const { return count > 0 ? sum / count : 0; }

    float variance() const {
        if (count < 2) return 0;
        return (sumSq - sum * sum / count) / (count - 1);
    }

    float stdDev() const { return sqrt(variance()); }
};

// Sensor reading with quality metadata
struct SensorReading {
    float rawValue;
    float cleanedValue;
    float normalizedValue;
    unsigned long timestamp;
    bool isValid;
    bool isOutlier;
    bool isMissing;
    bool isImputed;
    String qualityFlags;
};

// ====================== GLOBAL STATE ======================

// Circular buffers for different processing stages
CircularBuffer<VALIDATION_WINDOW> tempRawBuffer;
CircularBuffer<VALIDATION_WINDOW> tempCleanBuffer;
CircularBuffer<FILTER_WINDOW_MEDIAN> tempMedianBuffer;
CircularBuffer<FILTER_WINDOW_MA> tempMABuffer;

CircularBuffer<VALIDATION_WINDOW> lightRawBuffer;
CircularBuffer<VALIDATION_WINDOW> lightCleanBuffer;

// Streaming statistics
StreamingStats tempStats;
StreamingStats lightStats;

// For rate-of-change validation
float lastValidTemp = NAN;
unsigned long lastValidTempTime = 0;
float lastValidLight = NAN;
unsigned long lastValidLightTime = 0;

// For exponential smoothing
float expSmoothedTemp = NAN;
float expSmoothedLight = NAN;

// For missing data handling
int consecutiveMissingTemp = 0;
int consecutiveMissingLight = 0;
float lastImputedTemp = NAN;
float lastImputedLight = NAN;

// Normalization parameters (learned from data)
float tempMinSeen = INFINITY;
float tempMaxSeen = -INFINITY;
float lightMinSeen = INFINITY;
float lightMaxSeen = -INFINITY;

// Statistics counters
unsigned long totalSamples = 0;
unsigned long validSamples = 0;
unsigned long outlierSamples = 0;
unsigned long missingSamples = 0;
unsigned long imputedSamples = 0;
unsigned long rangeViolations = 0;
unsigned long rateViolations = 0;

// ====================== HELPER FUNCTIONS ======================

// Convert ADC reading to temperature (NTC thermistor approximation)
float adcToTemperature(int adcValue) {
    if (adcValue == 0) return -INFINITY;
    if (adcValue >= 4095) return INFINITY;

    // Simplified Steinhart-Hart approximation for 10K NTC
    float resistance = 10000.0 * (4095.0 / adcValue - 1.0);
    float steinhart = resistance / 10000.0;
    steinhart = log(steinhart);
    steinhart /= 3950.0;
    steinhart += 1.0 / (25.0 + 273.15);
    steinhart = 1.0 / steinhart;
    steinhart -= 273.15;

    return steinhart;
}

// Convert ADC reading to light level (LDR approximation)
float adcToLight(int adcValue) {
    // Map ADC to approximate lux (logarithmic)
    if (adcValue < 10) return 0;
    float lux = 100000.0 * pow((float)adcValue / 4095.0, 2);
    return lux;
}

// ====================== VALIDATION FUNCTIONS ======================

// Range validation - check if value is within physical bounds
bool validateRange(float value, float minValid, float maxValid, String& errorMsg) {
    if (isnan(value) || isinf(value)) {
        errorMsg = "NaN/Inf";
        return false;
    }
    if (value < minValid) {
        errorMsg = "Below min (" + String(minValid) + ")";
        return false;
    }
    if (value > maxValid) {
        errorMsg = "Above max (" + String(maxValid) + ")";
        return false;
    }
    errorMsg = "OK";
    return true;
}

// Rate-of-change validation - detect impossible jumps
bool validateRateOfChange(float currentValue, float lastValue,
                          unsigned long currentTime, unsigned long lastTime,
                          float maxRate, String& errorMsg) {
    if (isnan(lastValue) || lastTime == 0) {
        errorMsg = "First reading";
        return true;
    }

    float timeDelta = (currentTime - lastTime) / 1000.0;  // seconds
    if (timeDelta <= 0) {
        errorMsg = "Invalid timestamp";
        return false;
    }

    float rate = abs(currentValue - lastValue) / timeDelta;
    if (rate > maxRate) {
        errorMsg = "Rate " + String(rate, 2) + " exceeds max " + String(maxRate);
        return false;
    }

    errorMsg = "Rate OK (" + String(rate, 2) + ")";
    return true;
}

// ====================== OUTLIER DETECTION ======================

// Z-Score outlier detection
bool detectOutlierZScore(float value, float mean, float stdDev,
                         float threshold, float& zScore) {
    if (stdDev == 0) {
        zScore = 0;
        return false;
    }

    zScore = abs((value - mean) / stdDev);
    return zScore > threshold;
}

// IQR outlier detection (requires sorted buffer)
bool detectOutlierIQR(CircularBuffer<VALIDATION_WINDOW>& buffer, float value,
                      float multiplier, float& lowerBound, float& upperBound) {
    if (buffer.size() < 10) return false;

    // Copy to temporary array for sorting
    float sorted[VALIDATION_WINDOW];
    int n = buffer.size();
    for (int i = 0; i < n; i++) {
        sorted[i] = buffer.get(i);
    }

    // Simple bubble sort (OK for small buffers)
    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - i - 1; j++) {
            if (sorted[j] > sorted[j + 1]) {
                float temp = sorted[j];
                sorted[j] = sorted[j + 1];
                sorted[j + 1] = temp;
            }
        }
    }

    // Calculate quartiles
    int q1Idx = n / 4;
    int q3Idx = 3 * n / 4;
    float q1 = sorted[q1Idx];
    float q3 = sorted[q3Idx];
    float iqr = q3 - q1;

    lowerBound = q1 - multiplier * iqr;
    upperBound = q3 + multiplier * iqr;

    return (value < lowerBound) || (value > upperBound);
}

// ====================== NOISE FILTERING ======================

// Moving average filter
float filterMovingAverage(CircularBuffer<FILTER_WINDOW_MA>& buffer, float newValue) {
    buffer.push(newValue);

    float sum = 0;
    for (int i = 0; i < buffer.size(); i++) {
        sum += buffer.get(i);
    }

    return sum / buffer.size();
}

// Median filter (excellent for spike removal)
float filterMedian(CircularBuffer<FILTER_WINDOW_MEDIAN>& buffer, float newValue) {
    buffer.push(newValue);

    // Copy and sort
    float sorted[FILTER_WINDOW_MEDIAN];
    int n = buffer.size();
    for (int i = 0; i < n; i++) {
        sorted[i] = buffer.get(i);
    }

    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - i - 1; j++) {
            if (sorted[j] > sorted[j + 1]) {
                float temp = sorted[j];
                sorted[j] = sorted[j + 1];
                sorted[j + 1] = temp;
            }
        }
    }

    if (n % 2 == 0) {
        return (sorted[n/2 - 1] + sorted[n/2]) / 2.0;
    }
    return sorted[n / 2];
}

// Exponential smoothing filter
float filterExponentialSmoothing(float& smoothed, float newValue, float alpha) {
    if (isnan(smoothed)) {
        smoothed = newValue;
    } else {
        smoothed = alpha * newValue + (1 - alpha) * smoothed;
    }
    return smoothed;
}

// ====================== MISSING VALUE HANDLING ======================

// Simulate missing data (for demonstration)
bool simulateMissingData() {
    return random(1000) < (MISSING_PROBABILITY * 1000);
}

// Forward-fill imputation with limit
float imputeForwardFill(float lastValidValue, int& consecutiveMissing,
                        int maxMissing, bool& wasImputed) {
    consecutiveMissing++;

    if (consecutiveMissing > maxMissing || isnan(lastValidValue)) {
        wasImputed = false;
        return NAN;
    }

    wasImputed = true;
    return lastValidValue;
}

// ====================== NORMALIZATION ======================

// Min-Max scaling to 0-1 range
float normalizeMinMax(float value, float minSeen, float maxSeen) {
    if (maxSeen == minSeen) return 0.5;
    return (value - minSeen) / (maxSeen - minSeen);
}

// Z-Score normalization (standardization)
float normalizeZScore(float value, float mean, float stdDev) {
    if (stdDev == 0) return 0;
    return (value - mean) / stdDev;
}

// Update normalization parameters
void updateNormalizationParams(float value, float& minSeen, float& maxSeen) {
    if (value < minSeen) minSeen = value;
    if (value > maxSeen) maxSeen = value;
}

// ====================== MAIN PROCESSING FUNCTION ======================

SensorReading processTemperature(int adcValue, unsigned long timestamp) {
    SensorReading reading;
    reading.timestamp = timestamp;
    reading.isValid = true;
    reading.isOutlier = false;
    reading.isMissing = false;
    reading.isImputed = false;
    reading.qualityFlags = "";

    // Step 0: Check for simulated missing data
    if (simulateMissingData()) {
        reading.isMissing = true;
        missingSamples++;

        // Try forward-fill imputation
        bool wasImputed;
        float imputed = imputeForwardFill(lastValidTemp, consecutiveMissingTemp,
                                          MAX_MISSING_SAMPLES, wasImputed);

        if (wasImputed) {
            reading.rawValue = NAN;
            reading.cleanedValue = imputed;
            reading.isImputed = true;
            reading.qualityFlags += "IMPUTED_FFILL ";
            imputedSamples++;
            lastImputedTemp = imputed;
        } else {
            reading.rawValue = NAN;
            reading.cleanedValue = NAN;
            reading.normalizedValue = NAN;
            reading.qualityFlags += "MISSING_NO_IMPUTE ";
            return reading;
        }
    } else {
        consecutiveMissingTemp = 0;

        // Step 1: Convert ADC to temperature
        reading.rawValue = adcToTemperature(adcValue);

        // Step 2: Range validation
        String rangeError;
        if (!validateRange(reading.rawValue, TEMP_MIN_VALID, TEMP_MAX_VALID, rangeError)) {
            reading.isValid = false;
            reading.qualityFlags += "RANGE_VIOLATION(" + rangeError + ") ";
            rangeViolations++;
        }

        // Step 3: Rate-of-change validation
        String rateError;
        if (!validateRateOfChange(reading.rawValue, lastValidTemp,
                                  timestamp, lastValidTempTime,
                                  TEMP_MAX_RATE, rateError)) {
            reading.isValid = false;
            reading.qualityFlags += "RATE_VIOLATION(" + rateError + ") ";
            rateViolations++;
        }

        // Step 4: Outlier detection (only if range-valid)
        if (reading.isValid && tempStats.count > 20) {
            float zScore;
            if (detectOutlierZScore(reading.rawValue, tempStats.mean(),
                                    tempStats.stdDev(), ZSCORE_THRESHOLD, zScore)) {
                reading.isOutlier = true;
                reading.qualityFlags += "ZSCORE_OUTLIER(z=" + String(zScore, 2) + ") ";
                outlierSamples++;
            }

            float lowerBound, upperBound;
            if (detectOutlierIQR(tempRawBuffer, reading.rawValue,
                                 IQR_MULTIPLIER, lowerBound, upperBound)) {
                reading.isOutlier = true;
                reading.qualityFlags += "IQR_OUTLIER ";
            }
        }

        // Step 5: Apply noise filtering
        if (reading.isValid && !reading.isOutlier) {
            // Median filter first (removes spikes)
            float medianFiltered = filterMedian(tempMedianBuffer, reading.rawValue);

            // Then moving average (smooths remaining noise)
            float maFiltered = filterMovingAverage(tempMABuffer, medianFiltered);

            // Exponential smoothing for final output
            reading.cleanedValue = filterExponentialSmoothing(expSmoothedTemp,
                                                              maFiltered,
                                                              EXP_SMOOTHING_ALPHA);

            // Update statistics with valid data
            tempStats.add(reading.rawValue);
            tempRawBuffer.push(reading.rawValue);
            tempCleanBuffer.push(reading.cleanedValue);

            // Update last valid
            lastValidTemp = reading.rawValue;
            lastValidTempTime = timestamp;
            validSamples++;
        } else if (reading.isOutlier) {
            // Use median of recent values for outlier replacement
            reading.cleanedValue = filterMedian(tempMedianBuffer,
                                                tempMedianBuffer.get(tempMedianBuffer.size() - 1));
            reading.qualityFlags += "OUTLIER_REPLACED ";
        } else {
            // Range/rate violation - use last valid with flag
            if (!isnan(lastValidTemp)) {
                reading.cleanedValue = lastValidTemp;
                reading.qualityFlags += "USING_LAST_VALID ";
            } else {
                reading.cleanedValue = NAN;
            }
        }
    }

    // Step 6: Normalization
    if (!isnan(reading.cleanedValue)) {
        updateNormalizationParams(reading.cleanedValue, tempMinSeen, tempMaxSeen);
        reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
                                                   tempMinSeen, tempMaxSeen);
    } else {
        reading.normalizedValue = NAN;
    }

    if (reading.qualityFlags == "") {
        reading.qualityFlags = "CLEAN";
    }

    totalSamples++;
    return reading;
}

SensorReading processLight(int adcValue, unsigned long timestamp) {
    SensorReading reading;
    reading.timestamp = timestamp;
    reading.isValid = true;
    reading.isOutlier = false;
    reading.isMissing = false;
    reading.isImputed = false;
    reading.qualityFlags = "";

    // Similar processing as temperature (abbreviated for space)
    if (simulateMissingData()) {
        reading.isMissing = true;
        missingSamples++;

        bool wasImputed;
        float imputed = imputeForwardFill(lastValidLight, consecutiveMissingLight,
                                          MAX_MISSING_SAMPLES, wasImputed);

        if (wasImputed) {
            reading.rawValue = NAN;
            reading.cleanedValue = imputed;
            reading.isImputed = true;
            reading.qualityFlags = "IMPUTED_FFILL";
            imputedSamples++;
        } else {
            reading.rawValue = NAN;
            reading.cleanedValue = NAN;
            reading.normalizedValue = NAN;
            reading.qualityFlags = "MISSING_NO_IMPUTE";
            return reading;
        }
    } else {
        consecutiveMissingLight = 0;
        reading.rawValue = adcToLight(adcValue);

        // Simplified processing for light sensor
        String rangeError;
        if (!validateRange(reading.rawValue, LIGHT_MIN_VALID, LIGHT_MAX_VALID, rangeError)) {
            reading.isValid = false;
            reading.qualityFlags = "RANGE_VIOLATION";
            rangeViolations++;
        }

        if (reading.isValid) {
            reading.cleanedValue = filterExponentialSmoothing(expSmoothedLight,
                                                              reading.rawValue,
                                                              EXP_SMOOTHING_ALPHA);
            lightStats.add(reading.rawValue);
            lightRawBuffer.push(reading.rawValue);
            lastValidLight = reading.rawValue;
            lastValidLightTime = timestamp;
            validSamples++;
        } else {
            reading.cleanedValue = lastValidLight;
        }
    }

    // Normalization
    if (!isnan(reading.cleanedValue)) {
        updateNormalizationParams(reading.cleanedValue, lightMinSeen, lightMaxSeen);
        reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
                                                   lightMinSeen, lightMaxSeen);
    }

    if (reading.qualityFlags == "") {
        reading.qualityFlags = "CLEAN";
    }

    totalSamples++;
    return reading;
}

// ====================== LED INDICATOR CONTROL ======================

void updateLEDs(const SensorReading& tempReading, const SensorReading& lightReading) {
    // Red LED: Range violation
    if (!tempReading.isValid || !lightReading.isValid) {
        digitalWrite(LED_RED, HIGH);
    } else {
        digitalWrite(LED_RED, LOW);
    }

    // Yellow LED: Outlier detected
    if (tempReading.isOutlier || lightReading.isOutlier) {
        digitalWrite(LED_YELLOW, HIGH);
    } else {
        digitalWrite(LED_YELLOW, LOW);
    }

    // Green LED: Clean valid data
    if (tempReading.qualityFlags == "CLEAN" && lightReading.qualityFlags == "CLEAN") {
        digitalWrite(LED_GREEN, HIGH);
    } else {
        digitalWrite(LED_GREEN, LOW);
    }

    // Blue LED: Missing/imputed data
    if (tempReading.isMissing || tempReading.isImputed ||
        lightReading.isMissing || lightReading.isImputed) {
        digitalWrite(LED_BLUE, HIGH);
    } else {
        digitalWrite(LED_BLUE, LOW);
    }
}

// ====================== SERIAL OUTPUT ======================

void printSensorReading(const char* sensorName, const SensorReading& reading) {
    Serial.print(sensorName);
    Serial.print(": Raw=");
    if (isnan(reading.rawValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.rawValue, 2);
    }

    Serial.print(", Clean=");
    if (isnan(reading.cleanedValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.cleanedValue, 2);
    }

    Serial.print(", Norm=");
    if (isnan(reading.normalizedValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.normalizedValue, 3);
    }

    Serial.print(" [");
    Serial.print(reading.qualityFlags);
    Serial.println("]");
}

void printStatistics() {
    Serial.println("\n========== DATA QUALITY STATISTICS ==========");
    Serial.print("Total Samples: ");
    Serial.println(totalSamples);
    Serial.print("Valid Samples: ");
    Serial.print(validSamples);
    Serial.print(" (");
    Serial.print(100.0 * validSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Outliers Detected: ");
    Serial.print(outlierSamples);
    Serial.print(" (");
    Serial.print(100.0 * outlierSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Missing Values: ");
    Serial.print(missingSamples);
    Serial.print(" (");
    Serial.print(100.0 * missingSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Imputed Values: ");
    Serial.print(imputedSamples);
    Serial.println();
    Serial.print("Range Violations: ");
    Serial.println(rangeViolations);
    Serial.print("Rate Violations: ");
    Serial.println(rateViolations);

    Serial.println("\n--- Temperature Statistics ---");
    Serial.print("Mean: ");
    Serial.print(tempStats.mean(), 2);
    Serial.print(" C, StdDev: ");
    Serial.print(tempStats.stdDev(), 2);
    Serial.print(" C, Range: [");
    Serial.print(tempMinSeen, 1);
    Serial.print(", ");
    Serial.print(tempMaxSeen, 1);
    Serial.println("] C");

    Serial.println("\n--- Light Statistics ---");
    Serial.print("Mean: ");
    Serial.print(lightStats.mean(), 0);
    Serial.print(" lux, StdDev: ");
    Serial.print(lightStats.stdDev(), 0);
    Serial.print(" lux, Range: [");
    Serial.print(lightMinSeen, 0);
    Serial.print(", ");
    Serial.print(lightMaxSeen, 0);
    Serial.println("] lux");
    Serial.println("==============================================\n");
}

// ====================== SETUP AND LOOP ======================

void setup() {
    Serial.begin(115200);
    delay(1000);

    // Initialize pins
    pinMode(TEMP_PIN, INPUT);
    pinMode(LIGHT_PIN, INPUT);
    pinMode(DRIFT_PIN, INPUT);

    pinMode(LED_RED, OUTPUT);
    pinMode(LED_YELLOW, OUTPUT);
    pinMode(LED_GREEN, OUTPUT);
    pinMode(LED_BLUE, OUTPUT);

    // All LEDs off initially
    digitalWrite(LED_RED, LOW);
    digitalWrite(LED_YELLOW, LOW);
    digitalWrite(LED_GREEN, LOW);
    digitalWrite(LED_BLUE, LOW);

    // Seed random for missing data simulation
    randomSeed(analogRead(0));

    Serial.println("============================================");
    Serial.println("  DATA QUALITY LAB: IoT Preprocessing Demo  ");
    Serial.println("============================================");
    Serial.println("Features demonstrated:");
    Serial.println("  - Range validation (physical bounds)");
    Serial.println("  - Rate-of-change validation");
    Serial.println("  - Z-Score outlier detection");
    Serial.println("  - IQR outlier detection");
    Serial.println("  - Median filter (spike removal)");
    Serial.println("  - Moving average filter (smoothing)");
    Serial.println("  - Exponential smoothing");
    Serial.println("  - Missing value imputation (forward-fill)");
    Serial.println("  - Min-Max normalization (0-1 scaling)");
    Serial.println("============================================");
    Serial.println("LED Indicators:");
    Serial.println("  RED: Range/Rate violation");
    Serial.println("  YELLOW: Outlier detected");
    Serial.println("  GREEN: Clean valid data");
    Serial.println("  BLUE: Missing/Imputed data");
    Serial.println("============================================\n");

    Serial.println("Starting data collection...\n");
}

unsigned long lastSampleTime = 0;
unsigned long lastStatsTime = 0;
const unsigned long STATS_INTERVAL_MS = 10000;  // Print stats every 10 seconds

void loop() {
    unsigned long currentTime = millis();

    // Sample at defined interval
    if (currentTime - lastSampleTime >= SAMPLE_INTERVAL_MS) {
        lastSampleTime = currentTime;

        // Read sensors
        int tempADC = analogRead(TEMP_PIN);
        int lightADC = analogRead(LIGHT_PIN);
        int driftADC = analogRead(DRIFT_PIN);

        // Add simulated drift from potentiometer (optional)
        float driftFactor = (driftADC - 2048) / 2048.0;  // -1 to +1
        tempADC = constrain(tempADC + (int)(driftFactor * 500), 0, 4095);

        // Process readings through data quality pipeline
        SensorReading tempReading = processTemperature(tempADC, currentTime);
        SensorReading lightReading = processLight(lightADC, currentTime);

        // Update LED indicators
        updateLEDs(tempReading, lightReading);

        // Print readings
        printSensorReading("TEMP", tempReading);
        printSensorReading("LIGHT", lightReading);
        Serial.println();
    }

    // Print statistics periodically
    if (currentTime - lastStatsTime >= STATS_INTERVAL_MS) {
        lastStatsTime = currentTime;
        printStatistics();
    }
}

44.4.6 Step-by-Step Instructions

44.4.6.1 Step 1: Set Up the Simulator

Open the Wokwi simulator using the launch link above (or visit wokwi.com/projects/new/esp32)
Create a new ESP32 project
Click the diagram.json tab and paste the circuit configuration
Replace the default code with the complete Arduino code above

44.4.6.2 Step 2: Run and Observe Validation

Click the Play button to start the simulation
Open the Serial Monitor to see the data processing output
Observe the quality flags showing validation status for each reading
Watch for “CLEAN” flags indicating data passed all quality checks

44.4.6.3 Step 3: Trigger Range Violations

Click the NTC temperature sensor in the simulator
Drag the slider to extreme values (very hot or very cold)
Watch the RED LED turn on when values exceed physical bounds
Note the “RANGE_VIOLATION” flag in the serial output
Observe how the system uses the last valid value when current is invalid

44.4.6.4 Step 4: Observe Outlier Detection

Make sudden temperature changes by quickly dragging the sensor slider
Watch the YELLOW LED blink when outliers are detected
See the “ZSCORE_OUTLIER” and “IQR_OUTLIER” flags in output
Notice how outliers are replaced with median values

44.4.6.5 Step 5: Observe Missing Data Handling

The code simulates 5% random missing data
Watch the BLUE LED flash when data is missing or imputed
See “IMPUTED_FFILL” flags showing forward-fill imputation
Note “MISSING_NO_IMPUTE” when too many consecutive values are missing

44.4.6.6 Step 6: Experiment with Filtering

Observe the Raw vs Clean values in the serial output
Notice how Clean values are smoother due to median + moving average filters
Compare the normalized values (0-1 range) for multi-sensor comparison

44.4.6.7 Step 7: Analyze Statistics

Wait for the statistics report (prints every 10 seconds)
Review the data quality percentages: valid, outliers, missing
Examine the sensor statistics: mean, standard deviation, range
Consider how these metrics would inform production monitoring

44.4.7 Challenge Exercises

Challenge 1: Implement Median Absolute Deviation (MAD) Filtering

Difficulty: Intermediate

Task: The code defines a MAD_THRESHOLD constant but does not implement MAD outlier detection. Implement a detectOutlierMAD() function and integrate it into the processTemperature() pipeline as the primary outlier detection method.

Hints:

MAD is more robust than Z-score for non-Gaussian data
You will need to implement a detectOutlierMAD() function (modified Z-score = 0.6745 * (x - median) / MAD)
Replace or complement the Z-score check with MAD

Expected Outcome: MAD should detect outliers even when extreme values skew the mean and standard deviation.

Challenge 2: Add Exponential Backoff for Missing Data

Difficulty: Intermediate

Task: Currently, forward-fill uses a fixed limit (MAX_MISSING_SAMPLES). Implement an exponential decay on the confidence of imputed values.

Requirements:

Add a confidence field to SensorReading
Reduce confidence by 10% for each consecutive imputed value
Stop imputing when confidence drops below 50%
Display confidence in serial output

Expected Outcome: Imputed values should be flagged with decreasing confidence as gaps grow longer.

Challenge 3: Cross-Sensor Validation

Difficulty: Advanced

Task: Add plausibility checking between temperature and light sensors. If it is very bright (high light), temperature should be reasonable for daytime.

Requirements:

If light > 50000 lux and temperature < 10C, flag as suspicious
If light < 100 lux and temperature > 35C (outdoors), flag as suspicious
Add a new LED or serial indicator for cross-sensor anomalies

Expected Outcome: The system should detect when sensor readings are physically inconsistent with each other.

Challenge 4: Implement Kalman Filter

Difficulty: Advanced

Task: Replace the exponential smoothing filter with a simple Kalman filter for temperature.

Requirements:

Implement 1D Kalman filter with process noise and measurement noise
Estimate the Kalman gain dynamically
Output both the filtered value and the uncertainty estimate

Learning: Kalman filters provide optimal estimation when process and measurement noise characteristics are known.

44.4.8 Expected Outcomes

After completing this lab, you should be able to:

Understand validation trade-offs: Strict validation catches more errors but may reject valid extreme readings
Choose appropriate outlier methods: Z-score for Gaussian data, IQR/MAD for robust detection
Select imputation strategies: Forward-fill for slow-changing, interpolation for trending data
Apply noise filters correctly: Median for spikes, moving average for steady-state noise
Normalize for fusion: Understand when to use min-max vs Z-score normalization

Quality Metrics to Observe:

Valid sample rate should be >90% under normal conditions
Outlier rate should be <5% for stable sensors
Imputed values should maintain temporal continuity

Try It: Exponential Smoothing Explorer

Adjust the smoothing factor (alpha) and noise level to see how exponential smoothing filters noisy sensor data. A low alpha trusts the history more (smoother), while a high alpha trusts new readings more (more responsive).

Show code

viewof smoothAlpha = Inputs.range([0.01, 0.99], {
  value: 0.3, step: 0.01, label: "Smoothing factor (alpha):"
})

viewof noiseLevel = Inputs.range([0, 5], {
  value: 2, step: 0.1, label: "Noise amplitude:"
})

viewof signalType = Inputs.select(
  ["Steady (25 C)", "Gradual rise (20-30 C)", "Step change at t=25"],
  { value: "Steady (25 C)", label: "Signal pattern:" }
)

Show code

{
  const nPts = 50;
  const alpha = smoothAlpha;
  const noise = noiseLevel;
  const sig = signalType;

  function trueSignal(i) {
    if (sig === "Steady (25 C)") return 25;
    if (sig === "Gradual rise (20-30 C)") return 20 + (10 * i / (nPts - 1));
    return i < 25 ? 22 : 28;
  }

  // Seeded pseudo-random for consistent noise per alpha/noise change
  function seededRand(seed) {
    let x = Math.sin(seed * 127.1 + 311.7) * 43758.5453;
    return x - Math.floor(x);
  }

  const trueVals = [];
  const noisyVals = [];
  const smoothVals = [];

  for (let i = 0; i < nPts; i++) {
    const tv = trueSignal(i);
    const nv = tv + (seededRand(i) - 0.5) * 2 * noise;
    trueVals.push(tv);
    noisyVals.push(nv);
    if (i === 0) {
      smoothVals.push(nv);
    } else {
      smoothVals.push(alpha * nv + (1 - alpha) * smoothVals[i - 1]);
    }
  }

  const allVals = [...noisyVals, ...smoothVals, ...trueVals];
  const yMin = Math.min(...allVals) - 1;
  const yMax = Math.max(...allVals) + 1;

  const w = 580;
  const h = 260;
  const pad = { top: 30, right: 20, bottom: 40, left: 50 };
  const plotW = w - pad.left - pad.right;
  const plotH = h - pad.top - pad.bottom;

  function xPos(i) { return pad.left + (i / (nPts - 1)) * plotW; }
  function yPos(v) { return pad.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH; }

  let noisyPath = `M ${xPos(0)} ${yPos(noisyVals[0])}`;
  let smoothPath = `M ${xPos(0)} ${yPos(smoothVals[0])}`;
  let truePath = `M ${xPos(0)} ${yPos(trueVals[0])}`;
  for (let i = 1; i < nPts; i++) {
    noisyPath += ` L ${xPos(i)} ${yPos(noisyVals[i])}`;
    smoothPath += ` L ${xPos(i)} ${yPos(smoothVals[i])}`;
    truePath += ` L ${xPos(i)} ${yPos(trueVals[i])}`;
  }

  const mse = noisyVals.reduce((sum, v, i) => sum + (smoothVals[i] - trueVals[i]) ** 2, 0) / nPts;
  const lag = sig === "Step change at t=25" ? Math.round(-1 / Math.log(1 - alpha)) : 0;

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #3498DB; color: var(--bs-body-color);">
    <h4 style="margin-top: 0; color: #2C3E50;">Exponential Smoothing: alpha = ${alpha.toFixed(2)}</h4>
    <svg width="${w}" height="${h}" style="font-family: Arial, sans-serif; max-width: 100%;" viewBox="0 0 ${w} ${h}">
      <!-- Grid lines -->
      ${[0, 0.25, 0.5, 0.75, 1].map(f => {
        const y = pad.top + f * plotH;
        const val = (yMax - (yMax - yMin) * f).toFixed(1);
        return `<line x1="${pad.left}" y1="${y}" x2="${w - pad.right}" y2="${y}" stroke="#dee2e6" stroke-width="0.5"/>
                 <text x="${pad.left - 5}" y="${y + 4}" text-anchor="end" font-size="10" fill="#7F8C8D">${val}</text>`;
      }).join("")}

      <!-- Axes -->
      <line x1="${pad.left}" y1="${pad.top}" x2="${pad.left}" y2="${h - pad.bottom}" stroke="#2C3E50" stroke-width="1"/>
      <line x1="${pad.left}" y1="${h - pad.bottom}" x2="${w - pad.right}" y2="${h - pad.bottom}" stroke="#2C3E50" stroke-width="1"/>
      <text x="${w / 2}" y="${h - 5}" text-anchor="middle" font-size="12" fill="#2C3E50">Sample index</text>
      <text x="14" y="${h / 2}" text-anchor="middle" font-size="12" fill="#2C3E50" transform="rotate(-90, 14, ${h / 2})">Value</text>

      <!-- Noisy data -->
      <path d="${noisyPath}" fill="none" stroke="#E74C3C" stroke-width="1" opacity="0.4"/>
      <!-- True signal -->
      <path d="${truePath}" fill="none" stroke="#7F8C8D" stroke-width="1.5" stroke-dasharray="6,3"/>
      <!-- Smoothed -->
      <path d="${smoothPath}" fill="none" stroke="#3498DB" stroke-width="2.5"/>

      <!-- Legend -->
      <line x1="${pad.left + 10}" y1="${pad.top + 8}" x2="${pad.left + 30}" y2="${pad.top + 8}" stroke="#E74C3C" stroke-width="1.5" opacity="0.5"/>
      <text x="${pad.left + 34}" y="${pad.top + 12}" font-size="11" fill="#2C3E50">Noisy</text>
      <line x1="${pad.left + 80}" y1="${pad.top + 8}" x2="${pad.left + 100}" y2="${pad.top + 8}" stroke="#7F8C8D" stroke-width="1.5" stroke-dasharray="4,2"/>
      <text x="${pad.left + 104}" y="${pad.top + 12}" font-size="11" fill="#2C3E50">True</text>
      <line x1="${pad.left + 145}" y1="${pad.top + 8}" x2="${pad.left + 165}" y2="${pad.top + 8}" stroke="#3498DB" stroke-width="2.5"/>
      <text x="${pad.left + 169}" y="${pad.top + 12}" font-size="11" fill="#2C3E50">Smoothed</text>
    </svg>
    <p style="margin-top: 0.6rem; margin-bottom: 0; font-size: 0.9em;">
      <strong>MSE (smoothed vs true):</strong> <span style="color: #3498DB; font-family: monospace;">${mse.toFixed(3)}</span>
      ${sig === "Step change at t=25" ? ` | <strong>Approx. response time:</strong> <span style="color: #E67E22; font-family: monospace;">${lag} samples</span>` : ""}
      | <strong>Interpretation:</strong> ${alpha < 0.2 ? "Very smooth output but slow to react to real changes." : alpha < 0.5 ? "Good balance between smoothing and responsiveness." : "Highly responsive but lets more noise through."}
    </p>
  </div>`;
}

Worked Example: Multi-Sensor Normalization for Anomaly Detection

Scenario: You’re building an anomaly detection system for a server room with 3 sensors: temperature (15-35°C), CO2 (400-5000 ppm), and humidity (30-70%). A neural network needs all inputs on the same scale to detect abnormal conditions.

Given:

Temperature sensor: range 15-35°C, current reading 28°C
CO2 sensor: range 400-5000 ppm, current reading 1200 ppm
Humidity sensor: range 30-70%, current reading 55%
Neural network requires inputs in 0-1 range

Question: Normalize these readings and explain why proper normalization matters for the neural network.

Solution:

Step 1: Calculate min-max normalization for each sensor

Temperature normalization:

normalized_temp = (28 - 15) / (35 - 15)
               = 13 / 20
               = 0.65

CO2 normalization:

normalized_co2 = (1200 - 400) / (5000 - 400)
              = 800 / 4600
              = 0.174

Humidity normalization:

normalized_humidity = (55 - 30) / (70 - 30)
                   = 25 / 40
                   = 0.625

Step 2: Show the impact WITHOUT normalization

If we fed raw values to the neural network: - Input vector: [28, 1200, 55] - CO2 value is 40x larger than temperature - Neural network gradient updates would be dominated by CO2

Example gradient calculation (simplified):

Cost function: J = (pred - actual)²
Gradient w.r.t. weight: ∂J/∂w = 2 × (pred - actual) × input_value

For temperature: gradient ∝ 28
For CO2:         gradient ∝ 1200  (43x larger!)
For humidity:    gradient ∝ 55

The network would learn to minimize CO2 error while ignoring temperature and humidity!

Step 3: Verify normalized inputs have equal influence

Normalized input vector: [0.65, 0.174, 0.625]

Now all gradients are on similar scales:

For temperature: gradient ∝ 0.65
For CO2:         gradient ∝ 0.174
For humidity:    gradient ∝ 0.625

Each sensor contributes equally to gradient updates during training.

Step 4: Calculate percentage of each sensor’s range

This helps interpret the normalized values: - Temperature: 65% of its range (moderately warm) - CO2: 17.4% of its range (relatively low, good ventilation) - Humidity: 62.5% of its range (comfortable level)

Step 5: Detect an anomaly scenario

Normal condition: [0.65, 0.174, 0.625] Anomaly (AC failure): [0.95, 0.350, 0.825] - Temperature: 95% of range = 34°C (very hot!) - CO2: 35% of range = 2010 ppm (rising, poor ventilation) - Humidity: 82.5% of range = 63% (uncomfortable)

The neural network trained on normalized data can now detect this pattern as anomalous, because all three sensors contributed equally during training.

Key Insight: Without normalization, the neural network’s loss function is dominated by the largest-magnitude features. Min-max scaling ensures each sensor contributes proportionally to its information content, not its arbitrary measurement scale. This is why normalization is mandatory for neural networks, SVM, K-means, and any algorithm that uses distance metrics or gradient descent.

44.4.9 Try It: Multi-Sensor Fusion Calculator

Adjust the sensor readings below to see how min-max normalization brings different scales into alignment:

Show code

viewof fusionTemp = Inputs.range([15, 35], {
  value: 28, step: 0.5, label: "Temperature (°C):"
})

viewof fusionCO2 = Inputs.range([400, 5000], {
  value: 1200, step: 50, label: "CO₂ (ppm):"
})

viewof fusionHumidity = Inputs.range([30, 70], {
  value: 55, step: 1, label: "Humidity (%):"
})

Show code

{
  const t = fusionTemp;
  const c = fusionCO2;
  const h = fusionHumidity;

  const tNorm = (t - 15) / (35 - 15);
  const cNorm = (c - 400) / (5000 - 400);
  const hNorm = (h - 30) / (70 - 30);

  const maxRaw = Math.max(t, c, h);
  const tBar = (t / maxRaw) * 100;
  const cBar = (c / maxRaw) * 100;
  const hBar = (h / maxRaw) * 100;

  const tNormBar = tNorm * 100;
  const cNormBar = cNorm * 100;
  const hNormBar = hNorm * 100;

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #E67E22; color: var(--bs-body-color);">
    <h4 style="margin-top: 0; color: #2C3E50;">Raw vs. Normalized Comparison</h4>
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1.5rem;">
      <div>
        <strong>Raw values (unequal scales):</strong>
        <div style="margin-top: 0.5rem;">
          <div style="display: flex; align-items: center; margin-bottom: 4px;">
            <span style="width: 90px; font-size: 0.85em;">Temp: ${t}°C</span>
            <div style="flex: 1; background: var(--bs-border-color, #dee2e6); border-radius: 4px; height: 18px;">
              <div style="width: ${tBar.toFixed(1)}%; background: #E74C3C; border-radius: 4px; height: 100%;"></div>
            </div>
          </div>
          <div style="display: flex; align-items: center; margin-bottom: 4px;">
            <span style="width: 90px; font-size: 0.85em;">CO₂: ${c} ppm</span>
            <div style="flex: 1; background: var(--bs-border-color, #dee2e6); border-radius: 4px; height: 18px;">
              <div style="width: ${cBar.toFixed(1)}%; background: #3498DB; border-radius: 4px; height: 100%;"></div>
            </div>
          </div>
          <div style="display: flex; align-items: center;">
            <span style="width: 90px; font-size: 0.85em;">Hum: ${h}%</span>
            <div style="flex: 1; background: var(--bs-border-color, #dee2e6); border-radius: 4px; height: 18px;">
              <div style="width: ${hBar.toFixed(1)}%; background: #16A085; border-radius: 4px; height: 100%;"></div>
            </div>
          </div>
        </div>
        <p style="font-size: 0.8em; margin-top: 0.5rem; opacity: 0.7;">CO₂ dominates: ${(c/t).toFixed(0)}x larger than temperature</p>
      </div>
      <div>
        <strong>Min-Max normalized [0, 1]:</strong>
        <div style="margin-top: 0.5rem;">
          <div style="display: flex; align-items: center; margin-bottom: 4px;">
            <span style="width: 90px; font-size: 0.85em;">Temp: ${tNorm.toFixed(3)}</span>
            <div style="flex: 1; background: var(--bs-border-color, #dee2e6); border-radius: 4px; height: 18px;">
              <div style="width: ${tNormBar.toFixed(1)}%; background: #E74C3C; border-radius: 4px; height: 100%;"></div>
            </div>
          </div>
          <div style="display: flex; align-items: center; margin-bottom: 4px;">
            <span style="width: 90px; font-size: 0.85em;">CO₂: ${cNorm.toFixed(3)}</span>
            <div style="flex: 1; background: var(--bs-border-color, #dee2e6); border-radius: 4px; height: 18px;">
              <div style="width: ${cNormBar.toFixed(1)}%; background: #3498DB; border-radius: 4px; height: 100%;"></div>
            </div>
          </div>
          <div style="display: flex; align-items: center;">
            <span style="width: 90px; font-size: 0.85em;">Hum: ${hNorm.toFixed(3)}</span>
            <div style="flex: 1; background: var(--bs-border-color, #dee2e6); border-radius: 4px; height: 18px;">
              <div style="width: ${hNormBar.toFixed(1)}%; background: #16A085; border-radius: 4px; height: 100%;"></div>
            </div>
          </div>
        </div>
        <p style="font-size: 0.8em; margin-top: 0.5rem; opacity: 0.7;">All sensors on equal footing for ML algorithms</p>
      </div>
    </div>
  </div>`;
}

Decision Framework: Normalization Method Selection

Choose the appropriate normalization method based on your data characteristics and downstream algorithm:

Data Characteristic	Normalization Method	Output Range	Best For	Avoid When
Bounded range known (temperature, humidity)	Min-Max Scaling	0 to 1	Neural networks, image processing, bounded outputs	Outliers present (they compress valid range)
Outliers expected (sensor noise, network latency)	Robust Scaling	Median-centered	K-means, SVM, any distance-based method	Need exact 0-1 bounds
Gaussian distribution (many natural phenomena)	Z-Score Normalization	Mean=0, Std=1	Clustering, PCA, algorithms assuming normal distribution	Binary features (0/1)
Exponential/power-law (network traffic, wealth)	Log Transform then Z-Score	Variable	Right-skewed data, multiplicative relationships	Zero or negative values present
Mixed data types (some outliers + some bounded)	Hybrid: Robust for outlier features, Min-Max for clean	Variable per feature	Real-world messy datasets	Need uniform scaling method

Decision Tree:

Are there extreme outliers (>5% of values beyond 3σ)?
- YES → Use Robust Scaling (median + IQR)
- NO → Continue to step 2
Is your algorithm neural-network-based?
- YES → Use Min-Max Scaling (0-1) for activation function compatibility
- NO → Continue to step 3
Does your algorithm assume Gaussian distribution?
- YES (PCA, LDA) → Use Z-Score Normalization
- NO → Continue to step 4
Is your data heavily right-skewed (long tail)?
- YES → Log Transform + Z-Score
- NO → Default to Min-Max Scaling

Example Python Implementation:

def select_normalizer(data_characteristics):
    """
    Select appropriate normalization based on data characteristics.
    Returns: (normalizer_class, parameters)
    """
    has_outliers = data_characteristics['outlier_rate'] > 0.05
    is_neural_network = data_characteristics['model_type'] == 'neural_network'
    is_gaussian = data_characteristics['distribution'] == 'normal'
    is_skewed = data_characteristics['skewness'] > 2.0

    if has_outliers:
        return (RobustScaler, {})
    elif is_neural_network:
        return (MinMaxScaler, {'feature_range': (0, 1)})
    elif is_gaussian:
        return (ZScoreNormalizer, {})
    elif is_skewed:
        return (LogTransform, {'then_zscore': True})
    else:
        return (MinMaxScaler, {'feature_range': (0, 1)})

# Usage example
characteristics = {
    'outlier_rate': 0.08,  # 8% outliers
    'model_type': 'kmeans',
    'distribution': 'unknown',
    'skewness': 1.2
}

normalizer_class, params = select_normalizer(characteristics)
# Returns: RobustScaler (because outlier_rate > 0.05)

Warning Signs of Wrong Normalization:

Neural network accuracy plateaus at 60% → Inputs not normalized
Clustering groups all high-magnitude features together → Need Z-score instead of raw
Model ignores certain sensors → Their raw ranges are too small (need min-max)
Training loss explodes after first epoch → Gradients too large (need normalization)

Common Mistake: Normalizing Before Train/Test Split

The Mistake: Calculating normalization parameters (min, max, mean, std) on the entire dataset before splitting into train and test sets. This causes data leakage, where the test set’s statistics influence the training process, leading to overly optimistic performance estimates.

Why It Happens: The normalization step feels like “data preparation” rather than “model training,” so developers apply it before the split. Many tutorials skip this detail. Scikit-learn’s fit_transform() makes it easy to accidentally normalize everything at once.

Example of the Problem:

# WRONG: Normalize first, split second
data = load_sensor_data()  # 10,000 samples
normalized = MinMaxScaler().fit_transform(data)  # Uses ALL data stats!
train, test = train_test_split(normalized, test_size=0.2)

# The test set's min/max values influenced the scaling parameters!
# Model evaluation is now too optimistic.

Why This Is Wrong:

Data Leakage: Test set statistics “leak” into training through normalization parameters
Overfitting: Model appears to generalize better than it actually does
Production Failure: Real-world data has different min/max than training data

Real-World Example:

Imagine temperature sensor data: - Training period (winter): 15-25°C - Test period (summer): 20-35°C

WRONG approach:

# Calculate on ALL data (winter + summer)
overall_min = 15°C, overall_max = 35°C

# Normalize training data using overall stats
train_normalized = (train - 15) / (35 - 15)
# Result: Training temps (15-25°C) map to 0.0-0.5

# Normalize test data using SAME overall stats
test_normalized = (test - 15) / (35 - 15)
# Result: Test temps (20-35°C) map to 0.25-1.0

# Model trained on 0.0-0.5 range, tested on 0.25-1.0 range
# Test performance appears good because model "saw" summer data during normalization!

The Fix: Fit normalization ONLY on training data, then apply to test:

# CORRECT: Split first, normalize second
data = load_sensor_data()
train, test = train_test_split(data, test_size=0.2)  # Split FIRST

# Fit normalization on training data ONLY
scaler = MinMaxScaler()
scaler.fit(train)  # Learns min=15, max=25 from training (winter)

# Apply fitted scaler to both train and test
train_normalized = scaler.transform(train)
test_normalized = scaler.transform(test)  # Summer data (20-35°C) may exceed [0,1]!

# This is CORRECT - test data reflecting real-world distribution

Correct Pipeline Order:

Split data into train/validation/test sets (e.g., 70/15/15)
Fit normalization parameters on TRAINING data only
Transform train, validation, and test sets using those parameters
Train model on normalized training data
Evaluate on normalized validation/test data

Code Template:

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

# 1. Split FIRST
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5)

# 2. Fit normalizer on TRAINING data only
scaler = MinMaxScaler()
scaler.fit(X_train)  # Only training data!

# 3. Transform ALL sets using training parameters
X_train_norm = scaler.transform(X_train)
X_val_norm = scaler.transform(X_val)
X_test_norm = scaler.transform(X_test)

# 4. Train model
model.fit(X_train_norm, y_train)

# 5. Evaluate (no data leakage!)
val_score = model.score(X_val_norm, y_val)
test_score = model.score(X_test_norm, y_test)

Warning Signs of This Mistake:

Test accuracy is suspiciously high (>95%) on first try
Model performance degrades significantly in production
Test set values occasionally exceed [0, 1] after normalization (this is actually GOOD - means no leakage!)
Reviewer asks “when did you fit the scaler?” and you can’t answer clearly

Real-World Impact: A smart building energy prediction model achieved 96% test accuracy during development but only 73% accuracy in production. Root cause: normalization was fit on the entire year’s data (including test set), so the model “saw” summer peak loads during training. In production, the next summer’s peak loads were outside the normalized range, causing poor predictions.

Try It: Data Leakage Visualizer

Explore how normalizing before vs. after the train/test split changes the scaling parameters. Adjust the test set distribution to see how leakage distorts the training normalization.

Show code

viewof trainMin = Inputs.range([0, 30], {
  value: 15, step: 1, label: "Training data min (C):"
})

viewof trainMax = Inputs.range([20, 50], {
  value: 25, step: 1, label: "Training data max (C):"
})

viewof testMin = Inputs.range([0, 40], {
  value: 20, step: 1, label: "Test data min (C):"
})

viewof testMax = Inputs.range([20, 60], {
  value: 35, step: 1, label: "Test data max (C):"
})

Show code

{
  const trMin = trainMin;
  const trMax = Math.max(trainMax, trMin + 1);
  const teMin = testMin;
  const teMax = Math.max(testMax, teMin + 1);

  // WRONG: fit on all data
  const allMin = Math.min(trMin, teMin);
  const allMax = Math.max(trMax, teMax);

  // CORRECT: fit only on train
  const correctMin = trMin;
  const correctMax = trMax;

  // Test a value from the test set midpoint
  const testMid = (teMin + teMax) / 2;

  const wrongNorm = (testMid - allMin) / (allMax - allMin);
  const correctNorm = (testMid - correctMin) / (correctMax - correctMin);

  const leakagePct = wrongNorm === 0 ? 0 : Math.abs((wrongNorm - correctNorm) / correctNorm * 100);

  const w = 560;
  const h = 200;
  const absMin = Math.min(trMin, teMin) - 3;
  const absMax = Math.max(trMax, teMax) + 3;
  const scale = (v) => 60 + ((v - absMin) / (absMax - absMin)) * (w - 80);

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #E67E22; color: var(--bs-body-color);">
    <h4 style="margin-top: 0; color: #2C3E50;">Train/Test Normalization Comparison</h4>
    <svg width="${w}" height="${h}" style="font-family: Arial, sans-serif; max-width: 100%;" viewBox="0 0 ${w} ${h}">
      <!-- Training range -->
      <rect x="${scale(trMin)}" y="20" width="${scale(trMax) - scale(trMin)}" height="30" rx="4" fill="#3498DB" opacity="0.3" stroke="#3498DB" stroke-width="1.5"/>
      <text x="${(scale(trMin) + scale(trMax)) / 2}" y="40" text-anchor="middle" font-size="12" fill="#2C3E50" font-weight="bold">Train [${trMin}-${trMax}]</text>

      <!-- Test range -->
      <rect x="${scale(teMin)}" y="60" width="${scale(teMax) - scale(teMin)}" height="30" rx="4" fill="#E67E22" opacity="0.3" stroke="#E67E22" stroke-width="1.5"/>
      <text x="${(scale(teMin) + scale(teMax)) / 2}" y="80" text-anchor="middle" font-size="12" fill="#2C3E50" font-weight="bold">Test [${teMin}-${teMax}]</text>

      <!-- Wrong: all-data range -->
      <line x1="${scale(allMin)}" y1="110" x2="${scale(allMax)}" y2="110" stroke="#E74C3C" stroke-width="3"/>
      <circle cx="${scale(allMin)}" cy="110" r="4" fill="#E74C3C"/>
      <circle cx="${scale(allMax)}" cy="110" r="4" fill="#E74C3C"/>
      <text x="${scale(allMin) - 2}" y="128" text-anchor="start" font-size="10" fill="#E74C3C">WRONG: fit on all [${allMin}-${allMax}]</text>

      <!-- Correct: train-only range -->
      <line x1="${scale(correctMin)}" y1="145" x2="${scale(correctMax)}" y2="145" stroke="#16A085" stroke-width="3"/>
      <circle cx="${scale(correctMin)}" cy="145" r="4" fill="#16A085"/>
      <circle cx="${scale(correctMax)}" cy="145" r="4" fill="#16A085"/>
      <text x="${scale(correctMin) - 2}" y="163" text-anchor="start" font-size="10" fill="#16A085">CORRECT: fit on train [${correctMin}-${correctMax}]</text>

      <!-- Test midpoint marker -->
      <line x1="${scale(testMid)}" y1="15" x2="${scale(testMid)}" y2="170" stroke="#9B59B6" stroke-width="1" stroke-dasharray="4,3"/>
      <text x="${scale(testMid)}" y="185" text-anchor="middle" font-size="10" fill="#9B59B6">test value = ${testMid.toFixed(0)}C</text>
    </svg>
    <table style="width: 100%; border-collapse: collapse; margin-top: 0.5rem; font-size: 0.9em; color: var(--bs-body-color);">
      <tr style="border-bottom: 1px solid var(--bs-border-color, #dee2e6);">
        <td style="padding: 4px;"><span style="color: #E74C3C; font-weight: bold;">WRONG</span> (leakage)</td>
        <td style="padding: 4px; font-family: monospace;">(${testMid.toFixed(0)} - ${allMin}) / (${allMax} - ${allMin}) = <strong>${wrongNorm.toFixed(3)}</strong></td>
      </tr>
      <tr style="border-bottom: 1px solid var(--bs-border-color, #dee2e6);">
        <td style="padding: 4px;"><span style="color: #16A085; font-weight: bold;">CORRECT</span> (no leakage)</td>
        <td style="padding: 4px; font-family: monospace;">(${testMid.toFixed(0)} - ${correctMin}) / (${correctMax} - ${correctMin}) = <strong>${correctNorm.toFixed(3)}</strong></td>
      </tr>
      <tr>
        <td style="padding: 4px; font-weight: bold;">Distortion from leakage</td>
        <td style="padding: 4px; font-family: monospace; color: ${leakagePct > 10 ? "#E74C3C" : "#16A085"};">${leakagePct.toFixed(1)}%</td>
      </tr>
    </table>
    <p style="margin-top: 0.5rem; margin-bottom: 0; font-size: 0.85em; color: #7F8C8D;">
      ${correctNorm > 1 ? "The correct normalization exceeds 1.0 because the test value is outside the training range. This is expected and correct -- it reveals the model is seeing unseen conditions." : correctNorm < 0 ? "The correct normalization is negative because the test value is below the training minimum. The wrong method hides this by using the full data range." : "Try making the test range very different from the training range to see how leakage hides the distribution shift."}
    </p>
  </div>`;
}

Interactive Quiz: Match Concepts

Interactive Quiz: Sequence the Steps

Common Pitfalls

1. Fitting normalisation parameters on the full dataset before train/test split

Computing Min-Max bounds or Z-score mean/std on all data before splitting leaks test set information into training. Always fit normalisation parameters on the training set only and apply them to both train and test sets.

2. Applying the same normalisation to all sensor types

Binary status signals (0/1), count data, and continuous physical measurements require different normalisation strategies. Normalise each channel according to its distribution, not uniformly.

3. Normalising labels (target variables)

In regression tasks, some implementations accidentally normalise the prediction target along with features, causing the model to predict normalised units that require inverse transformation. Keep target variables in their original units unless the algorithm specifically requires otherwise.

4. Not re-normalising when new sensors are added

Adding a new sensor channel to an existing normalised dataset requires recomputing normalisation parameters. Hardcoded normalisation bounds from the initial dataset will not accommodate the new sensor’s value range.

Label the Diagram

Code Challenge

44.5 Summary

Data normalization and scaling complete the data quality preprocessing pipeline:

Min-Max Scaling: Transforms data to 0-1 range, ideal for neural networks and bounded outputs
Z-Score Normalization: Centers data around mean with unit variance, best for clustering and SVM
Robust Scaling: Uses median and IQR, resistant to outliers
Log Transform: Compresses right-skewed data spanning orders of magnitude
Complete Pipeline: Validation -> Cleaning -> Transformation, all implementable on edge devices

Critical Design Principle: The complete “validate-clean-transform” pipeline should run at the edge. Catching data quality issues at the source costs 1% of fixing them in the cloud, and normalized data enables fair multi-sensor fusion.

Quiz: Data Normalization

For Kids: Meet the Sensor Squad!

Normalization is like making sure everyone on the team speaks the same language!

44.5.1 The Sensor Squad Adventure: The Unfair Contest

The Sensor Squad was having a “Who Detected the Most?” contest, but something was NOT fair!

Temperature Terry reported: “I measured 25 degrees today!” Light Lucy shouted: “Well, I measured FIFTY THOUSAND lux! I WIN!” Terry looked sad. “But… 25 degrees is actually really important too…”

Max the Microcontroller stepped in: “Wait! This contest is not fair! Terry’s numbers go from 0 to 50, but Lucy’s numbers go from 0 to 100,000. We need to put everyone on the SAME SCALE!”

So Max invented three ways to make it fair:

Method 1 – The Percentage Trick (Min-Max): “Turn everything into a percentage of its range!” Terry’s 25 out of 50 max = 50%. Lucy’s 50,000 out of 100,000 max = 50%. “We are actually EQUAL!” they both cheered.

Method 2 – The “How Unusual?” Trick (Z-Score): “How far is each reading from normal?” Terry’s normal is 20 degrees, so 25 is a little above normal. Lucy’s normal is 40,000 lux, so 50,000 is also a little above normal. Both scored about the same “unusualness.”

Method 3 – The Tough Cookie Trick (Robust): Pressure Pete had a WEIRD reading of 9999 that messed up the averages. “Use the MIDDLE value instead of the average!” said Bella the Battery. “The middle value ignores crazy numbers!”

Now ALL the sensors were on the same scale, and the contest was fair! “Normalization makes sure no sensor gets more attention just because it uses BIGGER numbers!” explained Max.

44.5.2 Key Words for Kids

Word	What It Means
Normalization	Putting different measurements on the same scale so they can be compared fairly
Min-Max	Turning a number into a percentage between 0 and 1
Z-Score	Measuring how far a number is from “normal”
Robust	A method that still works well even when some numbers are wrong
Scale	The range of numbers a sensor uses (like 0-50 or 0-100,000)

Key Takeaway

Data normalization is essential before combining multi-sensor data or feeding it to machine learning models. Choose your method based on data characteristics: min-max for bounded outputs, Z-score for Gaussian assumptions, robust scaling when outliers are present. Always fit normalization parameters on training data only to avoid data leakage. Implement the complete validate-clean-transform pipeline at the edge to catch data quality issues at the source.

44.6 Concept Relationships

This chapter builds on validation and cleaning while introducing normalization as the final pipeline stage:

Prerequisites (Must understand first):

Data Validation and Outlier Detection - Range checks and outlier detection must occur before normalization to avoid skewing scaling parameters with invalid data
Missing Value Imputation and Noise Filtering - Cleaning must precede normalization; otherwise, gaps and noise corrupt min/max or mean/std calculations

Related Concepts (Enhance understanding):

Multi-Sensor Data Fusion - Normalization enables fair comparison when fusing sensors with vastly different measurement ranges (temperature vs light intensity)
Edge Data Acquisition - Edge devices normalize locally to reduce transmission bandwidth and prepare data for edge ML inference

Advanced Applications (Build on this):

Modeling and Inferencing - Neural networks require normalized inputs ([0,1] or mean=0, std=1) for stable gradient descent
Anomaly Detection - Z-score normalization makes distance-based anomaly detection work across features with different scales

Key Insight: Normalization is the final stage of the validate-clean-transform pipeline. Applying it before validation or cleaning causes incorrect scaling parameters (e.g., min-max uses outlier as max value, compressing all valid data into a tiny range).

44.7 What’s Next

If you want to…	Read this
Understand imputation techniques that precede normalisation	Data Quality Imputation and Filtering
Study the full preprocessing pipeline	Data Quality and Preprocessing
Apply normalised data to ML model training	Modeling and Inferencing
Understand data validation before normalisation	Data Quality Validation
Return to the module overview	Big Data Overview