1308  Data Normalization and Preprocessing Lab

1308.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Normalize and Scale Data: Apply min-max scaling, Z-score normalization, and other techniques for multi-sensor data fusion
  • Choose Normalization Methods: Select appropriate scaling based on downstream use case (neural networks, clustering, visualization)
  • Implement a Complete Pipeline: Build an end-to-end data quality system on an ESP32 microcontroller
  • Monitor Data Quality Metrics: Track validation rates, outlier counts, and imputation statistics

1308.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Imagine comparing apples and elephants. Temperature might range from -10 to 50 degrees, while light levels range from 0 to 100,000 lux. If you feed both to a machine learning model without normalizing, the model will think light is 2000x more important simply because the numbers are bigger!

Normalization puts all sensors on the same scale:

Before Normalization After Min-Max (0-1)
Temperature: 25C Temperature: 0.58
Light: 50,000 lux Light: 0.50

Now both contribute equally based on their actual information content, not their arbitrary unit scales.

When to use different methods:

Method Use When Output Range
Min-Max Need bounded outputs (neural networks) 0 to 1
Z-Score Data has outliers, using K-means/SVM Mean=0, StdDev=1
Robust Many outliers you want to ignore Median-centered
Log Data spans orders of magnitude Compressed scale

Key question this chapter answers: “How do I prepare multi-sensor data so it can be combined and analyzed fairly?”

TipMinimum Viable Understanding: Data Normalization

Core Concept: Normalization transforms sensor readings to a common scale, enabling fair comparison and combination of data from sensors with vastly different measurement ranges.

Why It Matters: Without normalization, a light sensor reading 100,000 lux would dominate a temperature sensor reading 25C in any analysis, even though both carry equal information. Machine learning models and statistical methods assume comparable scales.

Key Takeaway: Use min-max scaling (0-1) for neural networks and bounded outputs; use Z-score normalization for clustering and when outliers should retain their influence; use robust scaling when outliers should be dampened.

1308.3 Data Normalization and Scaling

  • ~10 min | - - Intermediate | - P10.C09.U06

1308.3.1 Min-Max Scaling

Scales data to a fixed range (typically 0-1):

class MinMaxScaler:
    def __init__(self, feature_range=(0, 1)):
        self.min_val = None
        self.max_val = None
        self.feature_min, self.feature_max = feature_range

    def fit(self, data):
        """Learn min/max from training data"""
        self.min_val = min(data)
        self.max_val = max(data)

    def transform(self, value):
        """Scale a single value"""
        if self.max_val == self.min_val:
            return self.feature_min

        scaled = (value - self.min_val) / (self.max_val - self.min_val)
        return scaled * (self.feature_max - self.feature_min) + self.feature_min

    def inverse_transform(self, scaled_value):
        """Convert back to original scale"""
        original = (scaled_value - self.feature_min) / (self.feature_max - self.feature_min)
        return original * (self.max_val - self.min_val) + self.min_val

1308.3.2 Z-Score Normalization (Standardization)

Centers data around mean with unit variance:

class ZScoreNormalizer:
    def __init__(self):
        self.mean = None
        self.std = None

    def fit(self, data):
        """Learn mean and std from training data"""
        import numpy as np
        self.mean = np.mean(data)
        self.std = np.std(data)

    def transform(self, value):
        """Normalize a single value"""
        if self.std == 0:
            return 0.0
        return (value - self.mean) / self.std

    def inverse_transform(self, normalized_value):
        """Convert back to original scale"""
        return normalized_value * self.std + self.mean

1308.3.3 Robust Scaling

Uses median and IQR, robust to outliers:

class RobustScaler:
    def __init__(self):
        self.median = None
        self.iqr = None

    def fit(self, data):
        """Learn median and IQR from training data"""
        import numpy as np
        self.median = np.median(data)
        q1 = np.percentile(data, 25)
        q3 = np.percentile(data, 75)
        self.iqr = q3 - q1

    def transform(self, value):
        """Scale using median and IQR"""
        if self.iqr == 0:
            return 0.0
        return (value - self.median) / self.iqr

    def inverse_transform(self, scaled_value):
        """Convert back to original scale"""
        return scaled_value * self.iqr + self.median

1308.3.4 When to Use Each Method

Method Use Case Preserves Sensitive To
Min-Max Neural networks, bounded outputs Distribution shape Outliers
Z-Score SVM, K-means, when outliers present Relative distances Nothing
Robust Scaling When outliers should not affect range Median-based Nothing
Log Transform Right-skewed data (power, counts) Multiplicative relationships Zero/negative values

1308.3.5 Normalization Decision Tree

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
    Start[Choose Normalization<br/>Method] --> Outliers{Significant<br/>Outliers?}

    Outliers -->|Yes| Robust[Robust Scaling<br/>Median/IQR]
    Outliers -->|No| Downstream{Downstream<br/>Use Case?}

    Downstream -->|Neural Network| MinMax[Min-Max<br/>0 to 1]
    Downstream -->|Clustering/SVM| ZScore[Z-Score<br/>Mean=0, Std=1]
    Downstream -->|Visualization| MinMax

    Start --> Skewed{Data<br/>Distribution?}
    Skewed -->|Right-Skewed| Log[Log Transform<br/>Then Normalize]
    Skewed -->|Normal| Continue[Continue to<br/>Other Checks]

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style MinMax fill:#27AE60,stroke:#2C3E50,color:#fff
    style ZScore fill:#16A085,stroke:#2C3E50,color:#fff
    style Robust fill:#E67E22,stroke:#2C3E50,color:#fff
    style Log fill:#9B59B6,stroke:#2C3E50,color:#fff

1308.4 Data Quality Lab: ESP32 Wokwi Simulation

  • ~45 min | - - - Advanced | - P10.C09.LAB

1308.4.1 Lab Overview

In this hands-on lab, you will implement a complete data quality preprocessing pipeline on an ESP32 microcontroller. The simulation demonstrates real-world techniques for handling sensor data problems including outliers, missing values, noise, and the need for normalization.

What You Will Learn:

  1. Sensor Data Validation: Implementing range checks and rate-of-change validation
  2. Outlier Detection: Using Z-score and IQR methods to identify anomalous readings
  3. Missing Value Handling: Forward-fill and interpolation techniques for gap handling
  4. Noise Filtering: Moving average, median filter, and exponential smoothing
  5. Data Normalization: Min-max scaling and Z-score normalization for multi-sensor fusion

Skills Practiced:

  • Real-time data processing on embedded systems
  • Statistical calculations with limited memory
  • Circular buffer implementations
  • Streaming algorithm design

1308.4.2 Lab Components

Technique Implementation Visual Indicator
Range Validation Physical bounds checking Red LED on violation
Z-Score Outliers Rolling statistics Yellow LED on outlier
Median Filter 5-sample sliding window Smoothed output in serial
Moving Average 10-sample window Trend visualization
Normalization 0-1 scaling Percentage output

1308.4.3 Wokwi Simulator

Use the embedded simulator below to build your data quality preprocessing system:

1308.4.4 Circuit Setup

Connect the sensors and indicators to the ESP32:

Component ESP32 Pin Purpose
Temperature Sensor (NTC) GPIO 34 Primary data source
Light Sensor (LDR) GPIO 35 Secondary data source
Potentiometer GPIO 32 Simulate sensor drift
Red LED GPIO 18 Range violation indicator
Yellow LED GPIO 19 Outlier detection indicator
Green LED GPIO 21 Valid data indicator
Blue LED GPIO 22 Missing data indicator

Add this diagram.json configuration in Wokwi:

{
  "version": 1,
  "author": "IoT Class - Data Quality Lab",
  "editor": "wokwi",
  "parts": [
    { "type": "wokwi-esp32-devkit-v1", "id": "esp", "top": 0, "left": 0 },
    { "type": "wokwi-ntc-temperature-sensor", "id": "temp1", "top": -120, "left": 80 },
    { "type": "wokwi-photoresistor-sensor", "id": "ldr1", "top": -120, "left": 180 },
    { "type": "wokwi-potentiometer", "id": "pot1", "top": -120, "left": 280 },
    { "type": "wokwi-led", "id": "led_red", "top": 180, "left": 80, "attrs": { "color": "red" } },
    { "type": "wokwi-led", "id": "led_yellow", "top": 180, "left": 130, "attrs": { "color": "yellow" } },
    { "type": "wokwi-led", "id": "led_green", "top": 180, "left": 180, "attrs": { "color": "green" } },
    { "type": "wokwi-led", "id": "led_blue", "top": 180, "left": 230, "attrs": { "color": "blue" } },
    { "type": "wokwi-resistor", "id": "r1", "top": 230, "left": 80, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r2", "top": 230, "left": 130, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r3", "top": 230, "left": 180, "attrs": { "value": "220" } },
    { "type": "wokwi-resistor", "id": "r4", "top": 230, "left": 230, "attrs": { "value": "220" } }
  ],
  "connections": [
    ["esp:GND.1", "temp1:GND", "black", ["h0"]],
    ["esp:3V3", "temp1:VCC", "red", ["h0"]],
    ["esp:34", "temp1:OUT", "green", ["h0"]],
    ["esp:GND.1", "ldr1:GND", "black", ["h0"]],
    ["esp:3V3", "ldr1:VCC", "red", ["h0"]],
    ["esp:35", "ldr1:OUT", "orange", ["h0"]],
    ["esp:GND.1", "pot1:GND", "black", ["h0"]],
    ["esp:3V3", "pot1:VCC", "red", ["h0"]],
    ["esp:32", "pot1:SIG", "purple", ["h0"]],
    ["esp:18", "led_red:A", "red", ["h0"]],
    ["led_red:C", "r1:1", "black", ["h0"]],
    ["r1:2", "esp:GND.1", "black", ["h0"]],
    ["esp:19", "led_yellow:A", "yellow", ["h0"]],
    ["led_yellow:C", "r2:1", "black", ["h0"]],
    ["r2:2", "esp:GND.1", "black", ["h0"]],
    ["esp:21", "led_green:A", "green", ["h0"]],
    ["led_green:C", "r3:1", "black", ["h0"]],
    ["r3:2", "esp:GND.1", "black", ["h0"]],
    ["esp:22", "led_blue:A", "blue", ["h0"]],
    ["led_blue:C", "r4:1", "black", ["h0"]],
    ["r4:2", "esp:GND.1", "black", ["h0"]]
  ]
}

1308.4.5 Complete Arduino Code

Copy this code into the Wokwi editor:

// ============================================================================
// DATA QUALITY LAB: Comprehensive IoT Data Preprocessing Pipeline
// ============================================================================
// Demonstrates: Validation, Outlier Detection, Missing Value Handling,
//               Noise Filtering, Normalization, and Data Scaling
// ============================================================================

#include <Arduino.h>
#include <math.h>

// ====================== PIN DEFINITIONS ======================
const int TEMP_PIN = 34;        // Temperature sensor (NTC)
const int LIGHT_PIN = 35;       // Light sensor (LDR)
const int DRIFT_PIN = 32;       // Potentiometer for drift simulation

const int LED_RED = 18;         // Range violation indicator
const int LED_YELLOW = 19;      // Outlier detection indicator
const int LED_GREEN = 21;       // Valid data indicator
const int LED_BLUE = 22;        // Missing data indicator

// ====================== SAMPLING PARAMETERS ======================
const int SAMPLE_INTERVAL_MS = 200;        // 5 Hz sampling rate
const int VALIDATION_WINDOW = 50;          // Samples for statistics
const int FILTER_WINDOW_MEDIAN = 5;        // Median filter window
const int FILTER_WINDOW_MA = 10;           // Moving average window
const float EXP_SMOOTHING_ALPHA = 0.3;     // Exponential smoothing factor

// ====================== VALIDATION THRESHOLDS ======================
// Temperature sensor valid range (Celsius after conversion)
const float TEMP_MIN_VALID = -10.0;
const float TEMP_MAX_VALID = 60.0;
const float TEMP_MAX_RATE = 2.0;           // Max 2C change per second

// Light sensor valid range (0-4095 ADC, 0-100000 lux mapped)
const float LIGHT_MIN_VALID = 0.0;
const float LIGHT_MAX_VALID = 100000.0;

// Outlier detection thresholds
const float ZSCORE_THRESHOLD = 3.0;        // Standard deviations
const float IQR_MULTIPLIER = 1.5;          // IQR outlier multiplier
const float MAD_THRESHOLD = 3.5;           // Modified Z-score threshold

// Missing data parameters
const int MAX_MISSING_SAMPLES = 10;        // Max consecutive missing allowed
const float MISSING_PROBABILITY = 0.05;    // Simulate 5% missing data

// ====================== DATA STRUCTURES ======================

// Circular buffer for streaming statistics
template<int SIZE>
struct CircularBuffer {
    float data[SIZE];
    int head;
    int count;

    CircularBuffer() : head(0), count(0) {
        for (int i = 0; i < SIZE; i++) data[i] = 0;
    }

    void push(float value) {
        data[head] = value;
        head = (head + 1) % SIZE;
        if (count < SIZE) count++;
    }

    float get(int index) const {
        // Get value at index (0 = oldest)
        int actual = (head - count + index + SIZE) % SIZE;
        return data[actual];
    }

    bool isFull() const { return count == SIZE; }
    int size() const { return count; }
};

// Statistics calculator for streaming data
struct StreamingStats {
    float sum;
    float sumSq;
    int count;
    float min_val;
    float max_val;

    StreamingStats() : sum(0), sumSq(0), count(0),
                       min_val(INFINITY), max_val(-INFINITY) {}

    void reset() {
        sum = sumSq = 0;
        count = 0;
        min_val = INFINITY;
        max_val = -INFINITY;
    }

    void add(float value) {
        sum += value;
        sumSq += value * value;
        count++;
        if (value < min_val) min_val = value;
        if (value > max_val) max_val = value;
    }

    float mean() const { return count > 0 ? sum / count : 0; }

    float variance() const {
        if (count < 2) return 0;
        return (sumSq - sum * sum / count) / (count - 1);
    }

    float stdDev() const { return sqrt(variance()); }
};

// Sensor reading with quality metadata
struct SensorReading {
    float rawValue;
    float cleanedValue;
    float normalizedValue;
    unsigned long timestamp;
    bool isValid;
    bool isOutlier;
    bool isMissing;
    bool isImputed;
    String qualityFlags;
};

// ====================== GLOBAL STATE ======================

// Circular buffers for different processing stages
CircularBuffer<VALIDATION_WINDOW> tempRawBuffer;
CircularBuffer<VALIDATION_WINDOW> tempCleanBuffer;
CircularBuffer<FILTER_WINDOW_MEDIAN> tempMedianBuffer;
CircularBuffer<FILTER_WINDOW_MA> tempMABuffer;

CircularBuffer<VALIDATION_WINDOW> lightRawBuffer;
CircularBuffer<VALIDATION_WINDOW> lightCleanBuffer;

// Streaming statistics
StreamingStats tempStats;
StreamingStats lightStats;

// For rate-of-change validation
float lastValidTemp = NAN;
unsigned long lastValidTempTime = 0;
float lastValidLight = NAN;
unsigned long lastValidLightTime = 0;

// For exponential smoothing
float expSmoothedTemp = NAN;
float expSmoothedLight = NAN;

// For missing data handling
int consecutiveMissingTemp = 0;
int consecutiveMissingLight = 0;
float lastImputedTemp = NAN;
float lastImputedLight = NAN;

// Normalization parameters (learned from data)
float tempMinSeen = INFINITY;
float tempMaxSeen = -INFINITY;
float lightMinSeen = INFINITY;
float lightMaxSeen = -INFINITY;

// Statistics counters
unsigned long totalSamples = 0;
unsigned long validSamples = 0;
unsigned long outlierSamples = 0;
unsigned long missingSamples = 0;
unsigned long imputedSamples = 0;
unsigned long rangeViolations = 0;
unsigned long rateViolations = 0;

// ====================== HELPER FUNCTIONS ======================

// Convert ADC reading to temperature (NTC thermistor approximation)
float adcToTemperature(int adcValue) {
    if (adcValue == 0) return -INFINITY;
    if (adcValue >= 4095) return INFINITY;

    // Simplified Steinhart-Hart approximation for 10K NTC
    float resistance = 10000.0 * (4095.0 / adcValue - 1.0);
    float steinhart = resistance / 10000.0;
    steinhart = log(steinhart);
    steinhart /= 3950.0;
    steinhart += 1.0 / (25.0 + 273.15);
    steinhart = 1.0 / steinhart;
    steinhart -= 273.15;

    return steinhart;
}

// Convert ADC reading to light level (LDR approximation)
float adcToLight(int adcValue) {
    // Map ADC to approximate lux (logarithmic)
    if (adcValue < 10) return 0;
    float lux = 100000.0 * pow((float)adcValue / 4095.0, 2);
    return lux;
}

// ====================== VALIDATION FUNCTIONS ======================

// Range validation - check if value is within physical bounds
bool validateRange(float value, float minValid, float maxValid, String& errorMsg) {
    if (isnan(value) || isinf(value)) {
        errorMsg = "NaN/Inf";
        return false;
    }
    if (value < minValid) {
        errorMsg = "Below min (" + String(minValid) + ")";
        return false;
    }
    if (value > maxValid) {
        errorMsg = "Above max (" + String(maxValid) + ")";
        return false;
    }
    errorMsg = "OK";
    return true;
}

// Rate-of-change validation - detect impossible jumps
bool validateRateOfChange(float currentValue, float lastValue,
                          unsigned long currentTime, unsigned long lastTime,
                          float maxRate, String& errorMsg) {
    if (isnan(lastValue) || lastTime == 0) {
        errorMsg = "First reading";
        return true;
    }

    float timeDelta = (currentTime - lastTime) / 1000.0;  // seconds
    if (timeDelta <= 0) {
        errorMsg = "Invalid timestamp";
        return false;
    }

    float rate = abs(currentValue - lastValue) / timeDelta;
    if (rate > maxRate) {
        errorMsg = "Rate " + String(rate, 2) + " exceeds max " + String(maxRate);
        return false;
    }

    errorMsg = "Rate OK (" + String(rate, 2) + ")";
    return true;
}

// ====================== OUTLIER DETECTION ======================

// Z-Score outlier detection
bool detectOutlierZScore(float value, float mean, float stdDev,
                         float threshold, float& zScore) {
    if (stdDev == 0) {
        zScore = 0;
        return false;
    }

    zScore = abs((value - mean) / stdDev);
    return zScore > threshold;
}

// IQR outlier detection (requires sorted buffer)
bool detectOutlierIQR(CircularBuffer<VALIDATION_WINDOW>& buffer, float value,
                      float multiplier, float& lowerBound, float& upperBound) {
    if (buffer.size() < 10) return false;

    // Copy to temporary array for sorting
    float sorted[VALIDATION_WINDOW];
    int n = buffer.size();
    for (int i = 0; i < n; i++) {
        sorted[i] = buffer.get(i);
    }

    // Simple bubble sort (OK for small buffers)
    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - i - 1; j++) {
            if (sorted[j] > sorted[j + 1]) {
                float temp = sorted[j];
                sorted[j] = sorted[j + 1];
                sorted[j + 1] = temp;
            }
        }
    }

    // Calculate quartiles
    int q1Idx = n / 4;
    int q3Idx = 3 * n / 4;
    float q1 = sorted[q1Idx];
    float q3 = sorted[q3Idx];
    float iqr = q3 - q1;

    lowerBound = q1 - multiplier * iqr;
    upperBound = q3 + multiplier * iqr;

    return (value < lowerBound) || (value > upperBound);
}

// ====================== NOISE FILTERING ======================

// Moving average filter
float filterMovingAverage(CircularBuffer<FILTER_WINDOW_MA>& buffer, float newValue) {
    buffer.push(newValue);

    float sum = 0;
    for (int i = 0; i < buffer.size(); i++) {
        sum += buffer.get(i);
    }

    return sum / buffer.size();
}

// Median filter (excellent for spike removal)
float filterMedian(CircularBuffer<FILTER_WINDOW_MEDIAN>& buffer, float newValue) {
    buffer.push(newValue);

    // Copy and sort
    float sorted[FILTER_WINDOW_MEDIAN];
    int n = buffer.size();
    for (int i = 0; i < n; i++) {
        sorted[i] = buffer.get(i);
    }

    for (int i = 0; i < n - 1; i++) {
        for (int j = 0; j < n - i - 1; j++) {
            if (sorted[j] > sorted[j + 1]) {
                float temp = sorted[j];
                sorted[j] = sorted[j + 1];
                sorted[j + 1] = temp;
            }
        }
    }

    if (n % 2 == 0) {
        return (sorted[n/2 - 1] + sorted[n/2]) / 2.0;
    }
    return sorted[n / 2];
}

// Exponential smoothing filter
float filterExponentialSmoothing(float& smoothed, float newValue, float alpha) {
    if (isnan(smoothed)) {
        smoothed = newValue;
    } else {
        smoothed = alpha * newValue + (1 - alpha) * smoothed;
    }
    return smoothed;
}

// ====================== MISSING VALUE HANDLING ======================

// Simulate missing data (for demonstration)
bool simulateMissingData() {
    return random(1000) < (MISSING_PROBABILITY * 1000);
}

// Forward-fill imputation with limit
float imputeForwardFill(float lastValidValue, int& consecutiveMissing,
                        int maxMissing, bool& wasImputed) {
    consecutiveMissing++;

    if (consecutiveMissing > maxMissing || isnan(lastValidValue)) {
        wasImputed = false;
        return NAN;
    }

    wasImputed = true;
    return lastValidValue;
}

// ====================== NORMALIZATION ======================

// Min-Max scaling to 0-1 range
float normalizeMinMax(float value, float minSeen, float maxSeen) {
    if (maxSeen == minSeen) return 0.5;
    return (value - minSeen) / (maxSeen - minSeen);
}

// Z-Score normalization (standardization)
float normalizeZScore(float value, float mean, float stdDev) {
    if (stdDev == 0) return 0;
    return (value - mean) / stdDev;
}

// Update normalization parameters
void updateNormalizationParams(float value, float& minSeen, float& maxSeen) {
    if (value < minSeen) minSeen = value;
    if (value > maxSeen) maxSeen = value;
}

// ====================== MAIN PROCESSING FUNCTION ======================

SensorReading processTemperature(int adcValue, unsigned long timestamp) {
    SensorReading reading;
    reading.timestamp = timestamp;
    reading.isValid = true;
    reading.isOutlier = false;
    reading.isMissing = false;
    reading.isImputed = false;
    reading.qualityFlags = "";

    // Step 0: Check for simulated missing data
    if (simulateMissingData()) {
        reading.isMissing = true;
        missingSamples++;

        // Try forward-fill imputation
        bool wasImputed;
        float imputed = imputeForwardFill(lastValidTemp, consecutiveMissingTemp,
                                          MAX_MISSING_SAMPLES, wasImputed);

        if (wasImputed) {
            reading.rawValue = NAN;
            reading.cleanedValue = imputed;
            reading.isImputed = true;
            reading.qualityFlags += "IMPUTED_FFILL ";
            imputedSamples++;
            lastImputedTemp = imputed;
        } else {
            reading.rawValue = NAN;
            reading.cleanedValue = NAN;
            reading.normalizedValue = NAN;
            reading.qualityFlags += "MISSING_NO_IMPUTE ";
            return reading;
        }
    } else {
        consecutiveMissingTemp = 0;

        // Step 1: Convert ADC to temperature
        reading.rawValue = adcToTemperature(adcValue);

        // Step 2: Range validation
        String rangeError;
        if (!validateRange(reading.rawValue, TEMP_MIN_VALID, TEMP_MAX_VALID, rangeError)) {
            reading.isValid = false;
            reading.qualityFlags += "RANGE_VIOLATION(" + rangeError + ") ";
            rangeViolations++;
        }

        // Step 3: Rate-of-change validation
        String rateError;
        if (!validateRateOfChange(reading.rawValue, lastValidTemp,
                                  timestamp, lastValidTempTime,
                                  TEMP_MAX_RATE, rateError)) {
            reading.isValid = false;
            reading.qualityFlags += "RATE_VIOLATION(" + rateError + ") ";
            rateViolations++;
        }

        // Step 4: Outlier detection (only if range-valid)
        if (reading.isValid && tempStats.count > 20) {
            float zScore;
            if (detectOutlierZScore(reading.rawValue, tempStats.mean(),
                                    tempStats.stdDev(), ZSCORE_THRESHOLD, zScore)) {
                reading.isOutlier = true;
                reading.qualityFlags += "ZSCORE_OUTLIER(z=" + String(zScore, 2) + ") ";
                outlierSamples++;
            }

            float lowerBound, upperBound;
            if (detectOutlierIQR(tempRawBuffer, reading.rawValue,
                                 IQR_MULTIPLIER, lowerBound, upperBound)) {
                reading.isOutlier = true;
                reading.qualityFlags += "IQR_OUTLIER ";
            }
        }

        // Step 5: Apply noise filtering
        if (reading.isValid && !reading.isOutlier) {
            // Median filter first (removes spikes)
            float medianFiltered = filterMedian(tempMedianBuffer, reading.rawValue);

            // Then moving average (smooths remaining noise)
            float maFiltered = filterMovingAverage(tempMABuffer, medianFiltered);

            // Exponential smoothing for final output
            reading.cleanedValue = filterExponentialSmoothing(expSmoothedTemp,
                                                              maFiltered,
                                                              EXP_SMOOTHING_ALPHA);

            // Update statistics with valid data
            tempStats.add(reading.rawValue);
            tempRawBuffer.push(reading.rawValue);
            tempCleanBuffer.push(reading.cleanedValue);

            // Update last valid
            lastValidTemp = reading.rawValue;
            lastValidTempTime = timestamp;
            validSamples++;
        } else if (reading.isOutlier) {
            // Use median of recent values for outlier replacement
            reading.cleanedValue = filterMedian(tempMedianBuffer,
                                                tempMedianBuffer.get(tempMedianBuffer.size() - 1));
            reading.qualityFlags += "OUTLIER_REPLACED ";
        } else {
            // Range/rate violation - use last valid with flag
            if (!isnan(lastValidTemp)) {
                reading.cleanedValue = lastValidTemp;
                reading.qualityFlags += "USING_LAST_VALID ";
            } else {
                reading.cleanedValue = NAN;
            }
        }
    }

    // Step 6: Normalization
    if (!isnan(reading.cleanedValue)) {
        updateNormalizationParams(reading.cleanedValue, tempMinSeen, tempMaxSeen);
        reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
                                                   tempMinSeen, tempMaxSeen);
    } else {
        reading.normalizedValue = NAN;
    }

    if (reading.qualityFlags == "") {
        reading.qualityFlags = "CLEAN";
    }

    totalSamples++;
    return reading;
}

SensorReading processLight(int adcValue, unsigned long timestamp) {
    SensorReading reading;
    reading.timestamp = timestamp;
    reading.isValid = true;
    reading.isOutlier = false;
    reading.isMissing = false;
    reading.isImputed = false;
    reading.qualityFlags = "";

    // Similar processing as temperature (abbreviated for space)
    if (simulateMissingData()) {
        reading.isMissing = true;
        missingSamples++;

        bool wasImputed;
        float imputed = imputeForwardFill(lastValidLight, consecutiveMissingLight,
                                          MAX_MISSING_SAMPLES, wasImputed);

        if (wasImputed) {
            reading.rawValue = NAN;
            reading.cleanedValue = imputed;
            reading.isImputed = true;
            reading.qualityFlags = "IMPUTED_FFILL";
            imputedSamples++;
        } else {
            reading.rawValue = NAN;
            reading.cleanedValue = NAN;
            reading.normalizedValue = NAN;
            reading.qualityFlags = "MISSING_NO_IMPUTE";
            return reading;
        }
    } else {
        consecutiveMissingLight = 0;
        reading.rawValue = adcToLight(adcValue);

        // Simplified processing for light sensor
        String rangeError;
        if (!validateRange(reading.rawValue, LIGHT_MIN_VALID, LIGHT_MAX_VALID, rangeError)) {
            reading.isValid = false;
            reading.qualityFlags = "RANGE_VIOLATION";
            rangeViolations++;
        }

        if (reading.isValid) {
            reading.cleanedValue = filterExponentialSmoothing(expSmoothedLight,
                                                              reading.rawValue,
                                                              EXP_SMOOTHING_ALPHA);
            lightStats.add(reading.rawValue);
            lightRawBuffer.push(reading.rawValue);
            lastValidLight = reading.rawValue;
            lastValidLightTime = timestamp;
            validSamples++;
        } else {
            reading.cleanedValue = lastValidLight;
        }
    }

    // Normalization
    if (!isnan(reading.cleanedValue)) {
        updateNormalizationParams(reading.cleanedValue, lightMinSeen, lightMaxSeen);
        reading.normalizedValue = normalizeMinMax(reading.cleanedValue,
                                                   lightMinSeen, lightMaxSeen);
    }

    if (reading.qualityFlags == "") {
        reading.qualityFlags = "CLEAN";
    }

    totalSamples++;
    return reading;
}

// ====================== LED INDICATOR CONTROL ======================

void updateLEDs(const SensorReading& tempReading, const SensorReading& lightReading) {
    // Red LED: Range violation
    if (!tempReading.isValid || !lightReading.isValid) {
        digitalWrite(LED_RED, HIGH);
    } else {
        digitalWrite(LED_RED, LOW);
    }

    // Yellow LED: Outlier detected
    if (tempReading.isOutlier || lightReading.isOutlier) {
        digitalWrite(LED_YELLOW, HIGH);
    } else {
        digitalWrite(LED_YELLOW, LOW);
    }

    // Green LED: Clean valid data
    if (tempReading.qualityFlags == "CLEAN" && lightReading.qualityFlags == "CLEAN") {
        digitalWrite(LED_GREEN, HIGH);
    } else {
        digitalWrite(LED_GREEN, LOW);
    }

    // Blue LED: Missing/imputed data
    if (tempReading.isMissing || tempReading.isImputed ||
        lightReading.isMissing || lightReading.isImputed) {
        digitalWrite(LED_BLUE, HIGH);
    } else {
        digitalWrite(LED_BLUE, LOW);
    }
}

// ====================== SERIAL OUTPUT ======================

void printSensorReading(const char* sensorName, const SensorReading& reading) {
    Serial.print(sensorName);
    Serial.print(": Raw=");
    if (isnan(reading.rawValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.rawValue, 2);
    }

    Serial.print(", Clean=");
    if (isnan(reading.cleanedValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.cleanedValue, 2);
    }

    Serial.print(", Norm=");
    if (isnan(reading.normalizedValue)) {
        Serial.print("NaN");
    } else {
        Serial.print(reading.normalizedValue, 3);
    }

    Serial.print(" [");
    Serial.print(reading.qualityFlags);
    Serial.println("]");
}

void printStatistics() {
    Serial.println("\n========== DATA QUALITY STATISTICS ==========");
    Serial.print("Total Samples: ");
    Serial.println(totalSamples);
    Serial.print("Valid Samples: ");
    Serial.print(validSamples);
    Serial.print(" (");
    Serial.print(100.0 * validSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Outliers Detected: ");
    Serial.print(outlierSamples);
    Serial.print(" (");
    Serial.print(100.0 * outlierSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Missing Values: ");
    Serial.print(missingSamples);
    Serial.print(" (");
    Serial.print(100.0 * missingSamples / max(totalSamples, 1UL), 1);
    Serial.println("%)");
    Serial.print("Imputed Values: ");
    Serial.print(imputedSamples);
    Serial.println();
    Serial.print("Range Violations: ");
    Serial.println(rangeViolations);
    Serial.print("Rate Violations: ");
    Serial.println(rateViolations);

    Serial.println("\n--- Temperature Statistics ---");
    Serial.print("Mean: ");
    Serial.print(tempStats.mean(), 2);
    Serial.print(" C, StdDev: ");
    Serial.print(tempStats.stdDev(), 2);
    Serial.print(" C, Range: [");
    Serial.print(tempMinSeen, 1);
    Serial.print(", ");
    Serial.print(tempMaxSeen, 1);
    Serial.println("] C");

    Serial.println("\n--- Light Statistics ---");
    Serial.print("Mean: ");
    Serial.print(lightStats.mean(), 0);
    Serial.print(" lux, StdDev: ");
    Serial.print(lightStats.stdDev(), 0);
    Serial.print(" lux, Range: [");
    Serial.print(lightMinSeen, 0);
    Serial.print(", ");
    Serial.print(lightMaxSeen, 0);
    Serial.println("] lux");
    Serial.println("==============================================\n");
}

// ====================== SETUP AND LOOP ======================

void setup() {
    Serial.begin(115200);
    delay(1000);

    // Initialize pins
    pinMode(TEMP_PIN, INPUT);
    pinMode(LIGHT_PIN, INPUT);
    pinMode(DRIFT_PIN, INPUT);

    pinMode(LED_RED, OUTPUT);
    pinMode(LED_YELLOW, OUTPUT);
    pinMode(LED_GREEN, OUTPUT);
    pinMode(LED_BLUE, OUTPUT);

    // All LEDs off initially
    digitalWrite(LED_RED, LOW);
    digitalWrite(LED_YELLOW, LOW);
    digitalWrite(LED_GREEN, LOW);
    digitalWrite(LED_BLUE, LOW);

    // Seed random for missing data simulation
    randomSeed(analogRead(0));

    Serial.println("============================================");
    Serial.println("  DATA QUALITY LAB: IoT Preprocessing Demo  ");
    Serial.println("============================================");
    Serial.println("Features demonstrated:");
    Serial.println("  - Range validation (physical bounds)");
    Serial.println("  - Rate-of-change validation");
    Serial.println("  - Z-Score outlier detection");
    Serial.println("  - IQR outlier detection");
    Serial.println("  - Median filter (spike removal)");
    Serial.println("  - Moving average filter (smoothing)");
    Serial.println("  - Exponential smoothing");
    Serial.println("  - Missing value imputation (forward-fill)");
    Serial.println("  - Min-Max normalization (0-1 scaling)");
    Serial.println("============================================");
    Serial.println("LED Indicators:");
    Serial.println("  RED: Range/Rate violation");
    Serial.println("  YELLOW: Outlier detected");
    Serial.println("  GREEN: Clean valid data");
    Serial.println("  BLUE: Missing/Imputed data");
    Serial.println("============================================\n");

    Serial.println("Starting data collection...\n");
}

unsigned long lastSampleTime = 0;
unsigned long lastStatsTime = 0;
const unsigned long STATS_INTERVAL_MS = 10000;  // Print stats every 10 seconds

void loop() {
    unsigned long currentTime = millis();

    // Sample at defined interval
    if (currentTime - lastSampleTime >= SAMPLE_INTERVAL_MS) {
        lastSampleTime = currentTime;

        // Read sensors
        int tempADC = analogRead(TEMP_PIN);
        int lightADC = analogRead(LIGHT_PIN);
        int driftADC = analogRead(DRIFT_PIN);

        // Add simulated drift from potentiometer (optional)
        float driftFactor = (driftADC - 2048) / 2048.0;  // -1 to +1
        tempADC = constrain(tempADC + (int)(driftFactor * 500), 0, 4095);

        // Process readings through data quality pipeline
        SensorReading tempReading = processTemperature(tempADC, currentTime);
        SensorReading lightReading = processLight(lightADC, currentTime);

        // Update LED indicators
        updateLEDs(tempReading, lightReading);

        // Print readings
        printSensorReading("TEMP", tempReading);
        printSensorReading("LIGHT", lightReading);
        Serial.println();
    }

    // Print statistics periodically
    if (currentTime - lastStatsTime >= STATS_INTERVAL_MS) {
        lastStatsTime = currentTime;
        printStatistics();
    }
}

1308.4.6 Step-by-Step Instructions

1308.4.6.1 Step 1: Set Up the Simulator

  1. Open the Wokwi simulator embedded above (or visit wokwi.com)
  2. Create a new ESP32 project
  3. Click the diagram.json tab and paste the circuit configuration
  4. Replace the default code with the complete Arduino code above

1308.4.6.2 Step 2: Run and Observe Validation

  1. Click the Play button to start the simulation
  2. Open the Serial Monitor to see the data processing output
  3. Observe the quality flags showing validation status for each reading
  4. Watch for “CLEAN” flags indicating data passed all quality checks

1308.4.6.3 Step 3: Trigger Range Violations

  1. Click the NTC temperature sensor in the simulator
  2. Drag the slider to extreme values (very hot or very cold)
  3. Watch the RED LED turn on when values exceed physical bounds
  4. Note the “RANGE_VIOLATION” flag in the serial output
  5. Observe how the system uses the last valid value when current is invalid

1308.4.6.4 Step 4: Observe Outlier Detection

  1. Make sudden temperature changes by quickly dragging the sensor slider
  2. Watch the YELLOW LED blink when outliers are detected
  3. See the “ZSCORE_OUTLIER” and “IQR_OUTLIER” flags in output
  4. Notice how outliers are replaced with median values

1308.4.6.5 Step 5: Observe Missing Data Handling

  1. The code simulates 5% random missing data
  2. Watch the BLUE LED flash when data is missing or imputed
  3. See “IMPUTED_FFILL” flags showing forward-fill imputation
  4. Note “MISSING_NO_IMPUTE” when too many consecutive values are missing

1308.4.6.6 Step 6: Experiment with Filtering

  1. Observe the Raw vs Clean values in the serial output
  2. Notice how Clean values are smoother due to median + moving average filters
  3. Compare the normalized values (0-1 range) for multi-sensor comparison

1308.4.6.7 Step 7: Analyze Statistics

  1. Wait for the statistics report (prints every 10 seconds)
  2. Review the data quality percentages: valid, outliers, missing
  3. Examine the sensor statistics: mean, standard deviation, range
  4. Consider how these metrics would inform production monitoring

1308.4.7 Challenge Exercises

Difficulty: Intermediate

Task: The code includes MAD outlier detection but does not use it in the main pipeline. Modify the processTemperature() function to use MAD as the primary outlier detection method.

Hints: - MAD is more robust than Z-score for non-Gaussian data - The detectOutlierMAD() function is already implemented - Replace or complement the Z-score check with MAD

Expected Outcome: MAD should detect outliers even when extreme values skew the mean and standard deviation.

Difficulty: Intermediate

Task: Currently, forward-fill uses a fixed limit (MAX_MISSING_SAMPLES). Implement an exponential decay on the confidence of imputed values.

Requirements: - Add a confidence field to SensorReading - Reduce confidence by 10% for each consecutive imputed value - Stop imputing when confidence drops below 50% - Display confidence in serial output

Expected Outcome: Imputed values should be flagged with decreasing confidence as gaps grow longer.

Difficulty: Advanced

Task: Add plausibility checking between temperature and light sensors. If it is very bright (high light), temperature should be reasonable for daytime.

Requirements: - If light > 50000 lux and temperature < 10C, flag as suspicious - If light < 100 lux and temperature > 35C (outdoors), flag as suspicious - Add a new LED or serial indicator for cross-sensor anomalies

Expected Outcome: The system should detect when sensor readings are physically inconsistent with each other.

Difficulty: Advanced

Task: Replace the exponential smoothing filter with a simple Kalman filter for temperature.

Requirements: - Implement 1D Kalman filter with process noise and measurement noise - Estimate the Kalman gain dynamically - Output both the filtered value and the uncertainty estimate

Learning: Kalman filters provide optimal estimation when process and measurement noise characteristics are known.

1308.4.8 Expected Outcomes

After completing this lab, you should be able to:

  1. Understand validation trade-offs: Strict validation catches more errors but may reject valid extreme readings
  2. Choose appropriate outlier methods: Z-score for Gaussian data, IQR/MAD for robust detection
  3. Select imputation strategies: Forward-fill for slow-changing, interpolation for trending data
  4. Apply noise filters correctly: Median for spikes, moving average for steady-state noise
  5. Normalize for fusion: Understand when to use min-max vs Z-score normalization

Quality Metrics to Observe: - Valid sample rate should be >90% under normal conditions - Outlier rate should be <5% for stable sensors - Imputed values should maintain temporal continuity

1308.5 Summary

Data normalization and scaling complete the data quality preprocessing pipeline:

  • Min-Max Scaling: Transforms data to 0-1 range, ideal for neural networks and bounded outputs
  • Z-Score Normalization: Centers data around mean with unit variance, best for clustering and SVM
  • Robust Scaling: Uses median and IQR, resistant to outliers
  • Log Transform: Compresses right-skewed data spanning orders of magnitude
  • Complete Pipeline: Validation -> Cleaning -> Transformation, all implementable on edge devices

Critical Design Principle: The complete “validate-clean-transform” pipeline should run at the edge. Catching data quality issues at the source costs 1% of fixing them in the cloud, and normalized data enables fair multi-sensor fusion.

1308.6 What’s Next

The next chapter explores Multi-Sensor Data Fusion, building on these preprocessing techniques to combine data from multiple sensors for improved accuracy and reliability.

Data Quality Series:

Next Steps:

Advanced Topics: