33 Data Preprocessing Workflow

analytics-ml

data

quality

preprocessing

33.1 Start With the Story

Picture an IoT team using the ideas in Data Preprocessing Workflow during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

Chapter Roadmap

This is a long overview, so keep the flow visible:

First you learn why raw IoT readings are noisy, incomplete, and sometimes impossible.
Then you walk the validate-clean-transform pipeline and see why validation must come first.
Next you price bad data with the 1-10-100 rule and a smart-building example.
Finally you compare imputation, filtering, and normalization choices before using the quizzes to check the full workflow.

Checkpoints recap the main decisions as you go, and “Deep dive” sections are optional detail on a first read.

33.2 Learning Objectives

By the end of this chapter series, you will be able to:

Design a Data Quality Pipeline: Architect a validate-clean-transform workflow for IoT sensor streams
Compare Preprocessing Techniques: Evaluate validation, imputation, and normalization methods based on sensor type and data characteristics
Implement Edge-Side Preprocessing: Build and test resource-efficient data quality checks that run on constrained devices
Calculate Data Quality Impact: Quantify the cost of poor data quality versus the investment in preprocessing using the 1-10-100 rule
Diagnose Common Pitfalls: Identify and prevent the most frequent data quality mistakes in IoT systems

33.3 Data Quality Preprocessing Check

Key Concepts

Data preprocessing pipeline: A sequenced set of transformations applied to raw sensor data: validation → imputation → filtering → normalisation → feature extraction → aggregation, each step feeding the next.
Outlier detection and treatment: The process of identifying readings that lie far outside the expected range and deciding whether to remove, cap, or flag them before analysis.
Feature extraction: The transformation of raw sensor time series into informative features (mean, variance, FFT components, zero-crossing rate) that capture the patterns relevant to the downstream task.
Data windowing: Dividing a continuous sensor stream into fixed-length or event-triggered windows for batch feature extraction and ML model input.
Schema-on-read vs schema-on-write: Two approaches to data structure enforcement: schema-on-write validates structure at ingestion (preferred for quality), schema-on-read allows raw storage and validates at query time (flexible but risky).

33.4 In 60 Seconds

Data preprocessing transforms raw, noisy IoT sensor readings into clean, structured inputs suitable for analytics and machine learning — and the quality of this step determines the accuracy ceiling of every downstream analysis. The pipeline typically covers validation, imputation, filtering, normalisation, and feature extraction, and each step must be designed with the specific sensor characteristics and downstream use case in mind.

33.5 Minimum Viable Understanding

Data quality preprocessing is a three-stage pipeline – validate (reject impossible values), clean (fill gaps and remove noise), transform (normalize for analysis) – applied at the edge before data reaches the cloud.
Bad data costs 10-100x more to fix downstream – a corrupt sensor reading can trigger false alarms, shut down equipment, or poison ML models, so catching issues at the source is critical.
Every sensor type needs tailored quality rules – temperature cannot exceed physical bounds, humidity cannot go above 100%, and rate-of-change limits must match the physical process being measured.

33.6 Overview

Data quality preprocessing is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This series explores practical techniques for detecting and correcting data quality issues in real-time on resource-constrained edge devices.

33.7 Data Quality Preprocessing Basics

Imagine you are baking a cake. Before you start mixing, you check your ingredients: Is the flour fresh or expired? Is there enough sugar? Did someone accidentally put salt in the sugar jar?

Data quality preprocessing works the same way. Before IoT data is analyzed or used to make decisions, it must be checked and cleaned:

Validate – Is this sensor reading even physically possible? (A room temperature of 500 degrees is not.)
Clean – Are there gaps or noise in the data? Fill the gaps and smooth out the noise.
Transform – Are all the readings on the same scale? Convert them so different sensors can be compared.

Why does this matter?

A smart thermostat that trusts a faulty temperature reading might blast the AC on a cold day
A factory monitoring system that ignores data quality might miss a real equipment failure hidden in noisy data
A health monitor that does not validate readings might send a false emergency alert

Key concept: It is much cheaper and faster to catch data problems at the edge (right where the sensor is) than to fix them later in the cloud. Think of it as proofreading your essay before submitting it, not after the teacher has graded it.

The basic idea is simple: do not ask analytics to make sense of readings you would not trust in the field. The next sections turn that idea into a concrete pipeline order.

33.8 The Data Quality Problem in IoT

IoT systems generate massive volumes of sensor data, but raw readings are rarely analysis-ready. Studies consistently show that data scientists spend 60-80% of their time on data preparation, and in IoT contexts the challenges are amplified by:

Harsh environments: Sensors deployed outdoors, in factories, or underwater face temperature extremes, vibration, and electromagnetic interference
Resource constraints: Edge devices have limited CPU, memory, and power for sophisticated processing
Real-time requirements: Many IoT applications need clean data in milliseconds, not hours
Scale: Thousands of sensors producing readings every second create a firehose of potentially dirty data

Overview diagram of the IoT data preprocessing pipeline showing the sequential flow from raw sensor data through three stages: Validate (range check), Clean (remove noise), and Transform (feature extraction), illustrating the end-to-end process for converting noisy sensor input into analysis-ready data

33.9 The Three-Stage Pipeline

The data quality pipeline follows a strict validate-clean-transform sequence. Each stage builds on the previous one, and skipping a stage leads to compounding errors downstream.

Three-stage data quality pipeline diagram showing the sequential flow: Validate (check physical bounds), Clean (remove errors and fill gaps), and Transform (prepare features for analysis), with data flowing left to right through each processing stage — Figure 33.1

Stage	Purpose	Key Techniques	Typical Edge Cost
1. Validate	Reject impossible readings	Range checks, rate-of-change limits, cross-sensor plausibility	Very low (simple comparisons)
2. Clean	Fill gaps, remove noise	Forward fill, interpolation, moving average, median filter	Low to moderate
3. Transform	Prepare for analysis	Min-max scaling, z-score normalization, robust scaling	Low

33.10 Try It: Three-Stage Pipeline Simulator

Enter a raw sensor reading and configure validation rules to see how data flows through the validate-clean-transform pipeline. Introduce noise, outliers, or missing values to observe how each stage responds.

Show code

viewof pipe_rawValue = Inputs.range([-50, 600], {value: 23.5, step: 0.1, label: "Raw sensor reading (C)"})
viewof pipe_minValid = Inputs.range([-50, 50], {value: -10, step: 1, label: "Valid range minimum (C)"})
viewof pipe_maxValid = Inputs.range([10, 100], {value: 60, step: 1, label: "Valid range maximum (C)"})
viewof pipe_lastReading = Inputs.range([-20, 80], {value: 22.8, step: 0.1, label: "Previous reading (C)"})
viewof pipe_maxRatePerMin = Inputs.range([0.1, 20], {value: 2.0, step: 0.1, label: "Max rate of change (C/min)"})
viewof pipe_emaAlpha = Inputs.range([0.05, 1.0], {value: 0.3, step: 0.05, label: "EMA smoothing alpha"})
viewof pipe_zoneMin = Inputs.range([0, 30], {value: 18, step: 1, label: "Zone min for normalization (C)"})
viewof pipe_zoneMax = Inputs.range([20, 60], {value: 28, step: 1, label: "Zone max for normalization (C)"})

Show code

{
  const raw = pipe_rawValue;
  const rangeOk = raw >= pipe_minValid && raw <= pipe_maxValid;
  const rateChange = Math.abs(raw - pipe_lastReading);
  const rateOk = rateChange <= pipe_maxRatePerMin;
  const validationPassed = rangeOk && rateOk;

  const validationStatus = !rangeOk
    ? `REJECTED -- ${raw}C is outside [${pipe_minValid}, ${pipe_maxValid}]`
    : !rateOk
    ? `REJECTED -- rate of change ${rateChange.toFixed(1)}C/min exceeds limit of ${pipe_maxRatePerMin}C/min`
    : `PASSED -- ${raw}C is within range and rate of change is ${rateChange.toFixed(1)}C/min`;

  const cleaned = validationPassed
    ? (pipe_emaAlpha * raw + (1 - pipe_emaAlpha) * pipe_lastReading)
    : null;

  const zoneRange = pipe_zoneMax - pipe_zoneMin;
  const normalized = cleaned !== null && zoneRange > 0
    ? ((cleaned - pipe_zoneMin) / zoneRange)
    : null;

  const valColor = validationPassed ? "#16A085" : "#E74C3C";
  const normColor = normalized !== null && normalized >= 0 && normalized <= 1 ? "#3498DB" : "#E67E22";
  const viewportWidth = typeof width !== "undefined" ? width : window.innerWidth;
  const compact = viewportWidth < 640;
  const stageGrid = compact ? "1fr" : "repeat(3, minmax(0, 1fr))";

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #2C3E50; color: var(--bs-body-color); margin-top: 0.5rem;">
    <div style="display: grid; grid-template-columns: ${stageGrid}; gap: 1rem;">
      <div style="background: var(--bs-body-bg, #fff); padding: 1rem; border-radius: 6px; border-top: 3px solid ${valColor}; min-width: 0;">
        <h4 style="margin: 0 0 0.5rem 0; font-size: 0.95rem; color: #2C3E50;">Stage 1: Validate</h4>
        <p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>Range check:</strong> <span style="color: ${rangeOk ? '#16A085' : '#E74C3C'};">${rangeOk ? 'PASS' : 'FAIL'}</span></p>
        <p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>Rate check:</strong> <span style="color: ${rateOk ? '#16A085' : '#E74C3C'};">${rateOk ? 'PASS' : 'FAIL'}</span></p>
        <p style="margin: 0.5rem 0 0 0; font-size: 0.8rem; color: ${valColor}; font-weight: bold; line-height: 1.45;">${validationStatus}</p>
      </div>
      <div style="background: var(--bs-body-bg, #fff); padding: 1rem; border-radius: 6px; border-top: 3px solid ${validationPassed ? '#E67E22' : '#7F8C8D'}; min-width: 0;">
        <h4 style="margin: 0 0 0.5rem 0; font-size: 0.95rem; color: #2C3E50;">Stage 2: Clean</h4>
        ${validationPassed
          ? html`<p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>EMA filter</strong> (alpha=${pipe_emaAlpha}):</p>
            <p style="margin: 0.3rem 0; font-size: 0.85rem; line-height: 1.45; word-break: break-word;">${pipe_emaAlpha} x ${raw} + ${(1-pipe_emaAlpha).toFixed(2)} x ${pipe_lastReading}</p>
            <p style="margin: 0.5rem 0 0 0; font-size: 1.1rem; font-weight: bold; color: #E67E22;">${cleaned.toFixed(2)}C</p>`
          : html`<p style="margin: 0.3rem 0; font-size: 0.85rem; color: #7F8C8D;">Skipped -- invalid reading rejected in Stage 1</p>`}
      </div>
      <div style="background: var(--bs-body-bg, #fff); padding: 1rem; border-radius: 6px; border-top: 3px solid ${normalized !== null ? normColor : '#7F8C8D'}; min-width: 0;">
        <h4 style="margin: 0 0 0.5rem 0; font-size: 0.95rem; color: #2C3E50;">Stage 3: Transform</h4>
        ${normalized !== null
          ? html`<p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>Min-max normalization:</strong></p>
            <p style="margin: 0.3rem 0; font-size: 0.85rem; line-height: 1.45; word-break: break-word;">(${cleaned.toFixed(2)} - ${pipe_zoneMin}) / (${pipe_zoneMax} - ${pipe_zoneMin})</p>
            <p style="margin: 0.5rem 0 0 0; font-size: 1.1rem; font-weight: bold; color: ${normColor};">${normalized.toFixed(4)}</p>
            ${normalized < 0 || normalized > 1 ? html`<p style="margin: 0.3rem 0; font-size: 0.75rem; color: #E67E22;">Value outside [0,1] -- review zone bounds</p>` : html``}`
          : html`<p style="margin: 0.3rem 0; font-size: 0.85rem; color: #7F8C8D;">Skipped -- no valid data to normalize</p>`}
      </div>
    </div>
    <p style="margin: 0.8rem 0 0 0; font-size: 0.85rem; text-align: center; color: var(--bs-body-color); opacity: 0.8;"><strong>Experiment:</strong> Try setting the raw reading to 500C (fails range check), or to 30C with previous reading 22C and max rate 2C/min (fails rate check). Then adjust the EMA alpha to see how smoothing changes the cleaned value.</p>
  </div>`;
}

33.11 Chapter Series

This topic is covered in three focused chapters:

33.11.1 Validation and Outliers

The first stage of the data quality pipeline focuses on detecting invalid and anomalous readings:

Range Validation: Check values against physical bounds
Rate-of-Change Validation: Detect impossible sensor jumps
Multi-Sensor Plausibility: Cross-validate related measurements
Z-Score Detection: Identify outliers in Gaussian distributions
IQR and MAD Detection: Robust outlier detection for skewed data

33.11.2 Imputation and Noise Filtering

The second stage handles gaps in data and removes noise while preserving the underlying signal:

Forward Fill: Simple imputation for slowly-changing values
Linear Interpolation: Fill gaps in trending data
Seasonal Decomposition: Use periodic patterns for imputation
Sensor-Specific Strategies: Match imputation to sensor semantics
Moving Average and Median Filters: Smooth steady-state noise and remove spikes
Exponential Smoothing: Real-time filtering with tunable responsiveness

33.11.3 Normalization Lab

The final stage prepares data for analysis and provides hands-on practice:

Min-Max Scaling: Transform data to bounded ranges for neural networks
Z-Score Normalization: Center data for clustering and SVM
Robust Scaling: Outlier-resistant normalization
ESP32 Wokwi Lab: Complete data quality pipeline implementation
Challenge Exercises: Extend the pipeline with advanced techniques

Checkpoint: Pipeline Map

You now know:

Validation rejects readings that violate physical bounds, rate limits, or multi-sensor plausibility.
Cleaning handles missing readings and noise only after invalid readings are removed.
Transformation makes cleaned data comparable for analytics, feature extraction, and ML input.

33.12 Cost of Poor Data Quality

Understanding why data quality matters requires quantifying the cost of getting it wrong. The 1-10-100 rule is well-established in data engineering:

Diagram illustrating the 1-10-100 rule for data quality costs: three stages showing Detection (early, $1), Correction (mid-stage, $10), and Failure (late, $100), demonstrating the exponential cost increase when data quality issues are caught later in the pipeline

Real-world examples with quantified costs:

33.12.0.1 Smart HVAC

Failure scenario: Stuck sensor reads 15C in summer, heater runs 8 hours.

Root cause: No staleness check.

Cost of failure: $4.80/day energy + $200 investigation.

Cost of prevention: $0 (one timestamp comparison).

Ratio: Infinite

33.12.0.2 Predictive Maintenance

Failure scenario: EMI noise triggers false “bearing failure” alert.

Root cause: No noise filter.

Cost of failure: $45,000 (4-hour shutdown of production line).

Cost of prevention: $0.02 (median filter CPU time per day).

Ratio: 2,250,000:1

33.12.0.3 Agricultural IoT

Failure scenario: Moisture sensors drift 5% over 6 months.

Root cause: No drift detection.

Cost of failure: $12,000/season (30% water overuse on a 50-hectare farm).

Cost of prevention: $50 (quarterly calibration check).

Ratio: 240:1

33.12.0.4 Cold Chain

Failure scenario: Sensor gap during transport is not flagged.

Root cause: No gap detection.

Cost of failure: $500,000 (rejected pharmaceutical shipment).

Cost of prevention: $0 (missing-reading counter).

Ratio: Infinite

33.12.0.5 Smart Grid

Failure scenario: CT sensor phase error corrupts power readings.

Root cause: No cross-sensor validation.

Cost of failure: $8,000/month (billing errors for 200 units).

Cost of prevention: $0 (compare with utility meter).

Ratio: Infinite

The pattern is consistent: prevention costs are negligible (simple comparisons, a few CPU cycles) while failure costs range from hundreds to hundreds of thousands of dollars. This is why data quality should be the first thing you implement, not the last.

33.12.1 Try It: Data Quality Cost Calculator

Use this interactive calculator to explore the 1-10-100 rule with your own failure scenario. Adjust the costs to see how the prevention-to-failure ratio changes.

Show code

viewof dqCostPreventionInput = Inputs.range([0, 100], {value: 0.02, step: 0.01, label: "Prevention cost ($/day)"})
viewof dqCostCorrectionInput = Inputs.range([0, 10000], {value: 200, step: 10, label: "Correction cost ($)"})
viewof dqCostFailureInput = Inputs.range([0, 500000], {value: 45000, step: 100, label: "Failure cost ($)"})

Show code

{
  const ratio = dqCostPreventionInput > 0
    ? (dqCostFailureInput / dqCostPreventionInput).toFixed(0)
    : "Infinite";
  const corrRatio = dqCostPreventionInput > 0
    ? (dqCostCorrectionInput / dqCostPreventionInput).toFixed(0)
    : "Infinite";
  const annualPrevention = (dqCostPreventionInput * 365).toFixed(2);
  const viewportWidth = typeof width !== "undefined" ? width : window.innerWidth;
  const compact = viewportWidth < 640;
  const metricMin = compact ? 130 : 160;

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #16A085; color: var(--bs-body-color); margin-top: 0.5rem;">
    <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(${metricMin}px, 1fr)); gap: 1rem; text-align: center;">
      <div>
        <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Annual Prevention</div>
        <div style="font-size: 1.5rem; font-weight: bold; color: #16A085;">$${annualPrevention}</div>
      </div>
      <div>
        <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Correction / Prevention</div>
        <div style="font-size: 1.5rem; font-weight: bold; color: #E67E22;">${corrRatio}:1</div>
      </div>
      <div>
        <div style="font-size: 0.85rem; color: var(--bs-body-color); opacity: 0.7;">Failure / Prevention</div>
        <div style="font-size: 1.5rem; font-weight: bold; color: #E74C3C;">${ratio}:1</div>
      </div>
    </div>
    <p style="margin: 0.8rem 0 0 0; font-size: 0.9rem; text-align: center;"><strong>Conclusion:</strong> ${dqCostPreventionInput === 0 ? "Prevention is free -- making the ratio infinite. Edge validation is mandatory." : `Every $1 spent on prevention avoids $${ratio} in failure costs.`}</p>
  </div>`;
}

33.13 Putting Numbers to It

Quantifying the 1-10-100 Rule for IoT Data Quality

Treat the arithmetic below as a deep dive: useful when you need to audit the cost logic, but not required before continuing to the smart-building pipeline.

The classic 1-10-100 rule states: $1 to prevent, $10 to correct, $100 when it causes failure.

Example: Smart HVAC with stuck sensor reading 15°C in summer

Prevention Cost ($1 equivalent - timestamp staleness check):$$ = 1 ^{-6} \[ \] ^{-13} $0 $$

Correction Cost ($10 equivalent - retrospective data repair):$$ = 86,400 ^{-4} \[ \] = 259 = $0.07 \[ \] + = 2 $100/ = $200 $$

Failure Cost ($100 equivalent - heater ran 8 hours unnecessarily):$$ = 5 $0.12/ = $4.80/ \[ \] = $144 \[ \] = $144 + $200 = $344 $$

Actual Ratio: \[ \frac{\text{Correction}}{\text{Prevention}} = \frac{\$200}{\$0} = \infty \quad \frac{\text{Failure}}{\text{Prevention}} = \frac{\$344}{\$0} = \infty \]

For IoT, prevention is so cheap (single comparison) that the ratio is effectively infinite—making edge-side validation mandatory, not optional.

Checkpoint: Cost Evidence

You now know:

The 1-10-100 rule compares prevention, correction, and failure costs.
In the examples above, simple edge checks prevent failures ranging from wasted energy to rejected shipments.
When prevention is a timestamp comparison, range check, or median filter, the cost can be effectively negligible.

33.14 Smart Building Temperature Pipeline

Scenario: You are deploying 200 temperature sensors across a commercial building for HVAC optimization. Each sensor reports every 30 seconds. You need clean, analysis-ready data for the building management system.

Step 1 – Define Validation Rules

First, establish what constitutes valid data for your specific deployment:

Rule	Threshold	Rationale
Physical range	-10 to 60 degrees Celsius	Building is climate-controlled but accounts for loading docks
Rate of change	Max 2 degrees Celsius per minute	Physical thermal mass prevents faster changes
Cross-sensor	Max 8 degrees Celsius difference from nearest neighbor	Adjacent zones should not differ drastically
Staleness	Max 5 minutes between readings	Sensor or network failure if gap exceeds this

Step 2 – Design Cleaning Strategy

For readings that pass validation but have quality issues:

If a reading is missing for less than 5 minutes: Use linear interpolation from neighbouring readings.
If a reading is missing for 5 minutes or more: Use forward fill and flag the value as imputed.
If high-frequency noise exceeds 0.5C: Apply an exponential moving average with alpha = 0.3.

That gives the building team a contract: every later interpolation, filter, and feature sees data that has already survived the deployment-specific rules.

33.15 Imputation and Filtering Explorer

Simulate a sensor data stream with missing values and noise. Choose an imputation method and a noise filter to see how different strategies affect the output. The chart shows raw data (with gaps), imputed values, and the filtered signal.

Show code

viewof imp_gapStart = Inputs.range([2, 12], {value: 5, step: 1, label: "Gap start index (reading #)"})
viewof imp_gapLength = Inputs.range([1, 5], {value: 3, step: 1, label: "Gap length (missing readings)"})
viewof imp_noiseLevel = Inputs.range([0, 5], {value: 1.5, step: 0.1, label: "Noise amplitude (C)"})
viewof imp_method = Inputs.select(["Forward Fill", "Linear Interpolation", "Mean Imputation"], {value: "Linear Interpolation", label: "Imputation method"})
viewof imp_filterType = Inputs.select(["None", "EMA (alpha=0.3)", "EMA (alpha=0.6)", "SMA (window=3)", "SMA (window=5)"], {value: "EMA (alpha=0.3)", label: "Noise filter"})

Show code

{
  // Generate base temperature signal with gentle trend
  const imp_n = 20;
  const imp_seed = 42;
  const imp_baseTemps = Array.from({length: imp_n}, (_, i) => 22 + 0.3 * Math.sin(i * 0.5) + 0.1 * i);

  // Add noise using deterministic pseudo-random
  function imp_pseudoRandom(seed, i) {
    const x = Math.sin(seed + i * 127.1) * 43758.5453;
    return x - Math.floor(x);
  }
  const imp_noisyTemps = imp_baseTemps.map((v, i) =>
    v + (imp_pseudoRandom(imp_seed, i) - 0.5) * 2 * imp_noiseLevel
  );

  // Create gaps (set missing values to null)
  const imp_gapEnd = Math.min(imp_gapStart + imp_gapLength, imp_n);
  const imp_withGaps = imp_noisyTemps.map((v, i) =>
    (i >= imp_gapStart && i < imp_gapEnd) ? null : v
  );

  // Impute missing values
  const imp_imputed = [...imp_withGaps];
  if (imp_method === "Forward Fill") {
    for (let i = 0; i < imp_n; i++) {
      if (imp_imputed[i] === null) {
        imp_imputed[i] = i > 0 ? imp_imputed[i - 1] : imp_baseTemps[0];
      }
    }
  } else if (imp_method === "Linear Interpolation") {
    let startIdx = imp_gapStart - 1;
    let endIdx = imp_gapEnd;
    const startVal = startIdx >= 0 ? imp_withGaps[startIdx] : imp_baseTemps[0];
    const endVal = endIdx < imp_n ? imp_withGaps[endIdx] : startVal;
    for (let i = imp_gapStart; i < imp_gapEnd; i++) {
      const t = (i - startIdx) / (endIdx - startIdx);
      imp_imputed[i] = startVal + t * (endVal - startVal);
    }
  } else {
    const imp_validVals = imp_withGaps.filter(v => v !== null);
    const imp_mean = imp_validVals.reduce((a, b) => a + b, 0) / imp_validVals.length;
    for (let i = 0; i < imp_n; i++) {
      if (imp_imputed[i] === null) imp_imputed[i] = imp_mean;
    }
  }

  // Apply noise filter
  const imp_filtered = [...imp_imputed];
  if (imp_filterType.startsWith("EMA")) {
    const alpha = imp_filterType.includes("0.3") ? 0.3 : 0.6;
    for (let i = 1; i < imp_n; i++) {
      imp_filtered[i] = alpha * imp_imputed[i] + (1 - alpha) * imp_filtered[i - 1];
    }
  } else if (imp_filterType.startsWith("SMA")) {
    const w = imp_filterType.includes("3") ? 3 : 5;
    for (let i = 0; i < imp_n; i++) {
      const start = Math.max(0, i - Math.floor(w / 2));
      const end = Math.min(imp_n, i + Math.floor(w / 2) + 1);
      const slice = imp_imputed.slice(start, end);
      imp_filtered[i] = slice.reduce((a, b) => a + b, 0) / slice.length;
    }
  }

  // Build data for display
  const imp_rows = [];
  for (let i = 0; i < imp_n; i++) {
    imp_rows.push({index: i, series: "Raw (with gaps)", value: imp_withGaps[i]});
    imp_rows.push({index: i, series: "Imputed", value: imp_imputed[i]});
    if (imp_filterType !== "None") {
      imp_rows.push({index: i, series: "Filtered", value: imp_filtered[i]});
    }
    imp_rows.push({index: i, series: "True signal", value: imp_baseTemps[i]});
  }

  // Calculate error metrics
  const imp_imputeErrors = [];
  const imp_filterErrors = [];
  for (let i = imp_gapStart; i < imp_gapEnd; i++) {
    imp_imputeErrors.push(Math.abs(imp_imputed[i] - imp_baseTemps[i]));
  }
  for (let i = 0; i < imp_n; i++) {
    imp_filterErrors.push(Math.pow(imp_filtered[i] - imp_baseTemps[i], 2));
  }
  const imp_imputeMAE = imp_imputeErrors.length > 0
    ? (imp_imputeErrors.reduce((a, b) => a + b, 0) / imp_imputeErrors.length).toFixed(3)
    : "N/A";
  const imp_filterRMSE = Math.sqrt(imp_filterErrors.reduce((a, b) => a + b, 0) / imp_filterErrors.length).toFixed(3);
  const viewportWidth = typeof width !== "undefined" ? width : window.innerWidth;
  const compact = viewportWidth < 640;
  const imp_displayStart = Math.max(0, imp_gapStart - 1);
  const imp_displayEnd = Math.min(imp_n, imp_gapEnd + 2);
  const imp_displayIndices = Array.from(
    {length: imp_displayEnd - imp_displayStart},
    (_, idx) => imp_displayStart + idx
  );
  const metricMin = compact ? 130 : 160;

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #9B59B6; color: var(--bs-body-color); margin-top: 0.5rem;">
    <p style="margin: 0 0 0.7rem 0; font-size: 0.82rem; opacity: 0.8;">Focused view around the missing-data window (readings ${imp_displayStart} to ${imp_displayEnd - 1}).</p>
    ${compact
      ? html`<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(110px, 1fr)); gap: 0.6rem; margin-bottom: 0.8rem;">
          ${imp_displayIndices.map(i => html`<div style="background: var(--bs-body-bg, #fff); border: 1px solid #dde4ea; border-radius: 8px; padding: 0.65rem;">
            <div style="font-weight: 700; color: #2C3E50; margin-bottom: 0.35rem;">Reading ${i}</div>
            <div style="font-size: 0.8rem; color: #7F8C8D;"><strong>Raw:</strong> ${imp_withGaps[i] !== null ? imp_withGaps[i].toFixed(1) : 'GAP'}</div>
            <div style="font-size: 0.8rem; color: #9B59B6;"><strong>Imputed:</strong> ${imp_imputed[i].toFixed(1)}</div>
            <div style="font-size: 0.8rem; color: #16A085;"><strong>Filtered:</strong> ${imp_filtered[i].toFixed(1)}</div>
          </div>`)}
        </div>`
      : html`<div style="overflow-x: auto; margin-bottom: 0.8rem;">
          <table style="width: 100%; border-collapse: collapse; font-size: 0.8rem;">
            <thead>
              <tr style="border-bottom: 2px solid #2C3E50;">
                <th style="padding: 4px; text-align: center;">Reading #</th>
                ${imp_displayIndices.map(i => html`<th style="padding: 4px; text-align: center; ${i >= imp_gapStart && i < imp_gapEnd ? 'background: #ffeaa7;' : ''}">${i}</th>`)}
              </tr>
            </thead>
            <tbody>
              <tr>
                <td style="padding: 4px; font-weight: bold; color: #7F8C8D;">Raw</td>
                ${imp_displayIndices.map(i => html`<td style="padding: 4px; text-align: center; ${imp_withGaps[i] === null ? 'color: #E74C3C; font-weight: bold;' : ''}">${imp_withGaps[i] !== null ? imp_withGaps[i].toFixed(1) : 'GAP'}</td>`)}
              </tr>
              <tr>
                <td style="padding: 4px; font-weight: bold; color: #9B59B6;">Imputed</td>
                ${imp_displayIndices.map(i => html`<td style="padding: 4px; text-align: center; ${i >= imp_gapStart && i < imp_gapEnd ? 'color: #9B59B6; font-weight: bold;' : ''}">${imp_imputed[i].toFixed(1)}</td>`)}
              </tr>
              <tr>
                <td style="padding: 4px; font-weight: bold; color: #16A085;">Filtered</td>
                ${imp_displayIndices.map(i => html`<td style="padding: 4px; text-align: center;">${imp_filtered[i].toFixed(1)}</td>`)}
              </tr>
            </tbody>
          </table>
        </div>`}
    <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(${metricMin}px, 1fr)); gap: 1rem; text-align: center; margin-top: 0.5rem;">
      <div>
        <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Imputation MAE (gap region)</div>
        <div style="font-size: 1.3rem; font-weight: bold; color: #9B59B6;">${imp_imputeMAE}C</div>
      </div>
      <div>
        <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Filter RMSE (full signal)</div>
        <div style="font-size: 1.3rem; font-weight: bold; color: #16A085;">${imp_filterRMSE}C</div>
      </div>
      <div>
        <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Missing readings</div>
        <div style="font-size: 1.3rem; font-weight: bold; color: #E74C3C;">${imp_gapLength} of ${imp_n}</div>
      </div>
    </div>
    <p style="margin: 0.8rem 0 0 0; font-size: 0.85rem; text-align: center; color: var(--bs-body-color); opacity: 0.8;"><strong>Experiment:</strong> Compare Forward Fill (flat line during gap) vs Linear Interpolation (straight line between boundaries). Increase noise amplitude and try different filters to see the smoothing-vs-responsiveness tradeoff.</p>
  </div>`;
}

Step 3 – Apply Normalization

For the ML-based HVAC optimization model:

Normalized Temp = (Raw Temp - Zone Min) / (Zone Max - Zone Min)

Where:

Zone_Min is the historical minimum for the zone (for example, 18C for an office).
Zone_Max is the historical maximum for the zone (for example, 28C for an office).
The result should stay in the [0, 1] range for neural-network input.

33.15.1 Normalization Throughput Calculator

Experiment with different temperature readings, zone bounds, and sensor configurations to see how min-max normalization works and how pipeline throughput scales.

Show code

viewof dqNormTemp = Inputs.range([0, 50], {value: 23.5, step: 0.1, label: "Raw temperature (C)"})
viewof dqNormMin = Inputs.range([0, 30], {value: 18, step: 1, label: "Zone minimum (C)"})
viewof dqNormMax = Inputs.range([20, 60], {value: 28, step: 1, label: "Zone maximum (C)"})
viewof dqNormSensors = Inputs.range([1, 1000], {value: 200, step: 1, label: "Number of sensors"})
viewof dqNormInterval = Inputs.range([1, 300], {value: 30, step: 1, label: "Reading interval (seconds)"})

Show code

{
  const range = dqNormMax - dqNormMin;
  const normalized = range > 0 ? ((dqNormTemp - dqNormMin) / range) : 0;
  const readingsPerMin = 60 / dqNormInterval;
  const readingsPerHour = dqNormSensors * readingsPerMin * 60;
  const outOfRange = dqNormTemp < dqNormMin || dqNormTemp > dqNormMax;
  const warning = range <= 0 ? "Zone maximum must be greater than zone minimum for valid normalization." : "";
  const viewportWidth = typeof width !== "undefined" ? width : window.innerWidth;
  const compact = viewportWidth < 640;
  const resultGrid = compact ? "1fr" : "repeat(2, minmax(0, 1fr))";

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #3498DB; color: var(--bs-body-color); margin-top: 0.5rem;">
    <div style="display: grid; grid-template-columns: ${resultGrid}; gap: 1rem;">
      <div>
        <h4 style="margin: 0 0 0.5rem 0; font-size: 1rem; color: #2C3E50;">Normalization Result</h4>
        ${warning ? html`<p style="color: #E74C3C; font-weight: bold;">${warning}</p>` : html`
        <p style="margin: 0.3rem 0;"><strong>Formula:</strong> (${dqNormTemp} - ${dqNormMin}) / (${dqNormMax} - ${dqNormMin})</p>
        <p style="margin: 0.3rem 0;"><strong>Normalized value:</strong> <span style="font-size: 1.3rem; font-weight: bold; color: ${outOfRange ? '#E74C3C' : '#16A085'};">${normalized.toFixed(4)}</span></p>
        ${outOfRange ? html`<p style="margin: 0.3rem 0; color: #E74C3C; font-size: 0.85rem;"><strong>Warning:</strong> Value is outside [0, 1] range. The reading ${dqNormTemp < dqNormMin ? "is below zone minimum" : "exceeds zone maximum"} -- this may indicate a sensor issue or incorrect zone bounds.</p>` : html``}
        `}
      </div>
      <div>
        <h4 style="margin: 0 0 0.5rem 0; font-size: 1rem; color: #2C3E50;">Pipeline Throughput</h4>
        <p style="margin: 0.3rem 0;"><strong>Readings/minute:</strong> ${(readingsPerMin * dqNormSensors).toLocaleString()}</p>
        <p style="margin: 0.3rem 0;"><strong>Readings/hour:</strong> <span style="font-size: 1.3rem; font-weight: bold; color: #3498DB;">${readingsPerHour.toLocaleString()}</span></p>
        <p style="margin: 0.3rem 0;"><strong>Readings/day:</strong> ${(readingsPerHour * 24).toLocaleString()}</p>
      </div>
    </div>
  </div>`;
}

Result: With the default settings, the pipeline processes 24,000 readings per hour (200 sensors x 2 readings/min x 60 min). With edge-side validation, roughly 0.1-0.5% of readings are flagged or rejected, preventing those errors from reaching the HVAC control algorithm. The cleaning stage fills the approximately 2-3% of readings lost to temporary network issues.

Checkpoint: Cleaning and Scaling

You now know:

A 5 minute gap is handled differently from a short missing-reading blip.
Exponential smoothing with alpha = 0.3 trades responsiveness against noise reduction.
Min-max normalization depends on the chosen zone bounds, so outliers must be cleaned before scaling.

33.16 Common Pitfalls in IoT Data Quality

1. Skipping validation because “the sensor is reliable” Even high-quality sensors fail. A $500 industrial temperature sensor can still produce garbage readings when its wiring corrodes, its power supply fluctuates, or firmware bugs cause buffer overflows. Always validate.

2. Using the same thresholds for all environments A valid temperature range for an indoor office (15-30 degrees Celsius) is completely wrong for a cold storage facility (-25 to -15 degrees Celsius) or a server room (18-27 degrees Celsius). Validation rules must be context-specific.

3. Over-smoothing the signal Aggressive noise filtering (large window moving averages, very low alpha in EMA) removes real events along with noise. A sudden temperature spike might be a genuine HVAC failure, not noise. Balance smoothness with responsiveness.

4. Ignoring sensor drift A sensor that reads 0.5 degrees Celsius too high on day 1 might read 3 degrees too high by month 6. Without periodic recalibration or drift detection, your “clean” data slowly becomes systematically wrong.

5. Normalizing before cleaning If you normalize data that contains outliers, the outliers distort the scaling parameters (min, max, mean, standard deviation), making all your normalized values wrong. Always clean first, then normalize.

6. Treating all missing data the same A 30-second gap (one missed reading) is very different from a 2-hour gap (network outage). Simple forward fill works for the former but introduces dangerous stale data for the latter. Match your imputation strategy to the gap duration and sensor type.

33.17 Try It: Normalization Methods Comparison

Enter a set of sensor values (including an outlier) to see how three normalization methods – Min-Max, Z-Score, and Robust Scaling – handle the data differently. Notice how outliers distort Min-Max and Z-Score but have less effect on Robust Scaling.

Show code

viewof ncomp_values = Inputs.text({
  label: "Sensor values (comma-separated)",
  value: "21.5, 22.0, 500, 22.3",
  placeholder: "e.g., 21.5, 22.0, 23.1, 500, 22.3"
})
viewof ncomp_highlightIdx = Inputs.range([0, 3], {value: 2, step: 1, label: "Highlight reading index"})

Show code

{
  // Parse input values
  const ncomp_raw = ncomp_values.split(",").map(s => parseFloat(s.trim())).filter(v => !isNaN(v));
  if (ncomp_raw.length < 3) {
    return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #E74C3C; color: var(--bs-body-color);">
      <p style="color: #E74C3C; font-weight: bold;">Please enter at least 3 comma-separated numeric values.</p>
    </div>`;
  }

  const n = ncomp_raw.length;
  const hlIdx = Math.min(ncomp_highlightIdx, n - 1);

  // Min-Max scaling
  const ncomp_min = Math.min(...ncomp_raw);
  const ncomp_max = Math.max(...ncomp_raw);
  const ncomp_mmRange = ncomp_max - ncomp_min;
  const ncomp_minmax = ncomp_raw.map(v => ncomp_mmRange > 0 ? (v - ncomp_min) / ncomp_mmRange : 0);

  // Z-Score normalization
  const ncomp_mean = ncomp_raw.reduce((a, b) => a + b, 0) / n;
  const ncomp_variance = ncomp_raw.reduce((a, v) => a + Math.pow(v - ncomp_mean, 2), 0) / n;
  const ncomp_std = Math.sqrt(ncomp_variance);
  const ncomp_zscore = ncomp_raw.map(v => ncomp_std > 0 ? (v - ncomp_mean) / ncomp_std : 0);

  // Robust scaling (using median and IQR)
  const ncomp_sorted = [...ncomp_raw].sort((a, b) => a - b);
  const ncomp_median = n % 2 === 0
    ? (ncomp_sorted[n/2 - 1] + ncomp_sorted[n/2]) / 2
    : ncomp_sorted[Math.floor(n/2)];
  const ncomp_q1Idx = Math.floor(n / 4);
  const ncomp_q3Idx = Math.floor(3 * n / 4);
  const ncomp_q1 = ncomp_sorted[ncomp_q1Idx];
  const ncomp_q3 = ncomp_sorted[ncomp_q3Idx];
  const ncomp_iqr = ncomp_q3 - ncomp_q1;
  const ncomp_robust = ncomp_raw.map(v => ncomp_iqr > 0 ? (v - ncomp_median) / ncomp_iqr : 0);

  // Check for outlier distortion
  const ncomp_rawNoOutlier = ncomp_raw.filter((_, i) => i !== hlIdx);
  const ncomp_minNoOut = Math.min(...ncomp_rawNoOutlier);
  const ncomp_maxNoOut = Math.max(...ncomp_rawNoOutlier);
  const ncomp_rangeNoOut = ncomp_maxNoOut - ncomp_minNoOut;
  const ncomp_distortion = ncomp_mmRange > 0 && ncomp_rangeNoOut > 0
    ? ((ncomp_mmRange / ncomp_rangeNoOut) - 1) * 100
    : 0;
  const viewportWidth = typeof width !== "undefined" ? width : window.innerWidth;
  const compact = viewportWidth < 640;
  const metricMin = compact ? 130 : 160;
  const ncomp_highlightCards = [
    {label: "Raw", value: ncomp_raw[hlIdx].toFixed(1), color: "#2C3E50"},
    {label: "Min-Max", value: ncomp_minmax[hlIdx].toFixed(4), color: "#3498DB"},
    {label: "Z-Score", value: ncomp_zscore[hlIdx].toFixed(4), color: "#16A085"},
    {label: "Robust", value: ncomp_robust[hlIdx].toFixed(4), color: "#E67E22"}
  ];

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #E67E22; color: var(--bs-body-color); margin-top: 0.5rem;">
    ${compact
      ? html`<div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(120px, 1fr)); gap: 0.7rem; margin-bottom: 0.9rem;">
          ${ncomp_highlightCards.map(card => html`<div style="background: var(--bs-body-bg, #fff); border: 1px solid #dde4ea; border-radius: 8px; padding: 0.7rem; text-align: center;">
            <div style="font-size: 0.8rem; opacity: 0.7;">${card.label}</div>
            <div style="font-size: 1.2rem; font-weight: 700; color: ${card.color};">${card.value}</div>
          </div>`)}
        </div>`
      : html`<div style="overflow-x: auto; margin-bottom: 0.8rem;">
          <table style="width: 100%; border-collapse: collapse; font-size: 0.8rem;">
            <thead>
              <tr style="border-bottom: 2px solid #2C3E50;">
                <th style="padding: 6px; text-align: left; min-width: 100px;">Method</th>
                ${ncomp_raw.map((_, i) => html`<th style="padding: 6px; text-align: center; ${i === hlIdx ? 'background: #ffeaa7;' : ''}">v${i}</th>`)}
              </tr>
            </thead>
            <tbody>
              <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 6px; font-weight: bold; color: #2C3E50;">Raw</td>
                ${ncomp_raw.map((v, i) => html`<td style="padding: 6px; text-align: center; ${i === hlIdx ? 'background: #ffeaa7; font-weight: bold;' : ''}">${v.toFixed(1)}</td>`)}
              </tr>
              <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 6px; font-weight: bold; color: #3498DB;">Min-Max</td>
                ${ncomp_minmax.map((v, i) => html`<td style="padding: 6px; text-align: center; ${i === hlIdx ? 'background: #ffeaa7; font-weight: bold;' : ''}">${v.toFixed(4)}</td>`)}
              </tr>
              <tr style="border-bottom: 1px solid #ddd;">
                <td style="padding: 6px; font-weight: bold; color: #16A085;">Z-Score</td>
                ${ncomp_zscore.map((v, i) => html`<td style="padding: 6px; text-align: center; ${i === hlIdx ? 'background: #ffeaa7; font-weight: bold;' : ''}">${v.toFixed(4)}</td>`)}
              </tr>
              <tr>
                <td style="padding: 6px; font-weight: bold; color: #E67E22;">Robust</td>
                ${ncomp_robust.map((v, i) => html`<td style="padding: 6px; text-align: center; ${i === hlIdx ? 'background: #ffeaa7; font-weight: bold;' : ''}">${v.toFixed(4)}</td>`)}
              </tr>
            </tbody>
          </table>
        </div>`}
    <div style="display: grid; grid-template-columns: repeat(auto-fit, minmax(${metricMin}px, 1fr)); gap: 1rem; text-align: center;">
      <div>
        <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Mean / Std Dev</div>
        <div style="font-size: 1rem; font-weight: bold; color: #16A085;">${ncomp_mean.toFixed(2)} / ${ncomp_std.toFixed(2)}</div>
      </div>
      <div>
        <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Median / IQR</div>
        <div style="font-size: 1rem; font-weight: bold; color: #E67E22;">${ncomp_median.toFixed(2)} / ${ncomp_iqr.toFixed(2)}</div>
      </div>
      <div>
        <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Min-Max Range</div>
        <div style="font-size: 1rem; font-weight: bold; color: #3498DB;">${ncomp_mmRange.toFixed(2)}</div>
      </div>
      <div>
        <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Range Distortion</div>
        <div style="font-size: 1rem; font-weight: bold; color: ${ncomp_distortion > 100 ? '#E74C3C' : '#16A085'};">${ncomp_distortion.toFixed(0)}%</div>
      </div>
    </div>
    <p style="margin: 0.8rem 0 0 0; font-size: 0.85rem; text-align: center; color: var(--bs-body-color); opacity: 0.8;"><strong>Key insight:</strong> ${ncomp_distortion > 100
      ? "The highlighted value inflates the Min-Max range by " + ncomp_distortion.toFixed(0) + "%, compressing all normal values near zero. Robust Scaling uses median and IQR, which are resistant to this distortion. This demonstrates why you should clean outliers before normalizing -- or use Robust Scaling when outliers cannot be removed."
      : "With no extreme outliers, all three methods produce reasonable results. Try adding an outlier (e.g., 500) to see how Min-Max and Z-Score get distorted while Robust Scaling remains stable."}</p>
  </div>`;
}

Checkpoint: Implementation Readiness

You now know:

The safe order is validate first, clean second, transform third.
Imputed values should be flagged so downstream analysis can distinguish estimated data from measured data.
Robust scaling is useful when outliers cannot be removed before normalization.

33.18 Knowledge Check

Test your understanding of data quality preprocessing concepts:

33.19 Interactive Quiz: Match Concepts

33.20 Interactive Quiz: Sequence the Steps

33.21 Label the Diagram

33.22 Code Challenge

33.23 Preprocessing Signal Contracts

For the deeper implementation contract behind alignment, smoothing, resampling, causal filtering, aliasing, and preprocessing provenance, continue to Preprocessing Sequence and Signal Contracts.

33.24 Summary and Key Takeaways

Data quality preprocessing is not optional in IoT systems – it is the critical foundation that determines whether your analytics, ML models, and automated decisions can be trusted.

Core principles to remember:

Follow the pipeline order: Validate first, clean second, transform third. Skipping or reordering stages causes compounding errors.
Catch issues at the edge: The 1-10-100 rule shows that prevention at the source is 100x cheaper than fixing downstream failures.
Customize for context: Validation thresholds, imputation strategies, and normalization methods must match the specific sensor type, deployment environment, and downstream use case.
Always flag imputed data: Downstream analysis needs to know which values are measured versus estimated. Never silently replace data.
Balance filtering with responsiveness: Over-smoothing removes real events. Under-smoothing leaves noise that corrupts analysis. Tune your filters to the specific signal characteristics.

33.25 Learning Path

Recommended order:

Start with Data Validation and Outlier Detection to understand how to catch invalid data at the source
Continue with Missing Value Imputation and Noise Filtering to learn gap handling and signal smoothing
Complete with Data Normalization and Preprocessing Lab for scaling techniques and hands-on practice

Prerequisites:

Edge Data Acquisition - Where raw data originates
Sensor Fundamentals - Understanding sensor characteristics
Signal Processing Essentials - Basic filtering concepts

33.26 Concept Relationships

This overview chapter introduces the three-stage data quality pipeline that underpins all IoT analytics. The validate-clean-transform sequence is critical because each stage builds on the previous one – skipping or reordering stages causes compounding errors.

Critical Dependencies:

Edge Data Acquisition – Where raw data originates; edge preprocessing catches issues at source (negligible cost vs. expensive cloud fixes)
Sensor Fundamentals – Understanding sensor drift, noise, and failure modes informs validation thresholds

Downstream Applications (Require clean data):

Multi-Sensor Data Fusion – Combining sensors; garbage data in one sensor poisons the entire fused output
Anomaly Detection – Finding meaningful outliers; poor quality data creates false positives that drown real anomalies
Modeling and Inferencing – ML models amplify data quality issues; a 5% error rate in training data can cause 30% accuracy drop

33.27 What’s Next

If you want to…	Read this
Learn imputation and filtering in detail	Data Quality Imputation and Filtering
Practise normalisation in a hands-on lab	Data Quality Normalisation Lab
Understand data quality validation	Data Quality Validation
Control preprocessing sequence and signal provenance	Preprocessing Sequence and Signal Contracts
Apply preprocessed data to ML pipelines	Modeling and Inferencing
Return to the module overview	Big Data Overview

33.28 See Also

Data Quality Deep Dives:

Data Validation and Outlier Detection - Range checks, rate-of-change, Z-score, IQR
Missing Value Imputation and Noise Filtering - Forward fill, interpolation, median filter
Data Normalization and Preprocessing Lab - Min-max, Z-score, ESP32 implementation

Foundational Context:

Edge Data Acquisition - Data collection at source
Sensor Fundamentals - Sensor characteristics and failure modes

Applications:

Multi-Sensor Data Fusion - Combining quality-checked sensor data
Anomaly Detection - Detecting meaningful outliers in clean data
Stream Processing - Real-time quality pipelines