By the end of this chapter series, you will be able to:
Design a Data Quality Pipeline: Architect a validate-clean-transform workflow for IoT sensor streams
Compare Preprocessing Techniques: Evaluate validation, imputation, and normalization methods based on sensor type and data characteristics
Implement Edge-Side Preprocessing: Build and test resource-efficient data quality checks that run on constrained devices
Calculate Data Quality Impact: Quantify the cost of poor data quality versus the investment in preprocessing using the 1-10-100 rule
Diagnose Common Pitfalls: Identify and prevent the most frequent data quality mistakes in IoT systems
Key Concepts
Data preprocessing pipeline: A sequenced set of transformations applied to raw sensor data: validation → imputation → filtering → normalisation → feature extraction → aggregation, each step feeding the next.
Outlier detection and treatment: The process of identifying readings that lie far outside the expected range and deciding whether to remove, cap, or flag them before analysis.
Feature extraction: The transformation of raw sensor time series into informative features (mean, variance, FFT components, zero-crossing rate) that capture the patterns relevant to the downstream task.
Data windowing: Dividing a continuous sensor stream into fixed-length or event-triggered windows for batch feature extraction and ML model input.
Schema-on-read vs schema-on-write: Two approaches to data structure enforcement: schema-on-write validates structure at ingestion (preferred for quality), schema-on-read allows raw storage and validates at query time (flexible but risky).
In 60 Seconds
Data preprocessing transforms raw, noisy IoT sensor readings into clean, structured inputs suitable for analytics and machine learning — and the quality of this step determines the accuracy ceiling of every downstream analysis. The pipeline typically covers validation, imputation, filtering, normalisation, and feature extraction, and each step must be designed with the specific sensor characteristics and downstream use case in mind.
Minimum Viable Understanding
Data quality preprocessing is a three-stage pipeline – validate (reject impossible values), clean (fill gaps and remove noise), transform (normalize for analysis) – applied at the edge before data reaches the cloud.
Bad data costs 10-100x more to fix downstream – a corrupt sensor reading can trigger false alarms, shut down equipment, or poison ML models, so catching issues at the source is critical.
Every sensor type needs tailored quality rules – temperature cannot exceed physical bounds, humidity cannot go above 100%, and rate-of-change limits must match the physical process being measured.
42.2 Overview
Data quality preprocessing is the foundation of trustworthy IoT analytics. Raw sensor data is inherently noisy, incomplete, and sometimes outright wrong. This series explores practical techniques for detecting and correcting data quality issues in real-time on resource-constrained edge devices.
Sensor Squad: The Data Quality Detectives
Sammy the Sensor was on duty at the Smart City Weather Station when the alarm went off.
“Boss! We just got a temperature reading of 500 degrees from Sensor 47 in Central Park!” shouted Lila the LED, flashing red.
Sammy stayed calm. “That is hotter than a pizza oven. There is NO WAY a park bench is 500 degrees. Let me check our Data Quality Checklist.”
Sammy pulled out the checklist:
Step 1 – VALIDATE: “Is 500 degrees even possible outdoors? Nope! The hottest place on Earth is about 57 degrees Celsius. REJECTED!”
Step 2 – CLEAN: “Sensor 47 also had a gap last Tuesday. Let me fill it using the readings from nearby Sensors 46 and 48.”
Step 3 – TRANSFORM: “Now let me put all the temperatures on the same scale so our weather prediction model can understand them.”
Max the Microcontroller nodded. “Without your detective work, the weather app would have told everyone the park was on fire!”
Bella the Battery added, “And I saved energy by catching that bad reading right here at the edge, instead of sending it all the way to the cloud!”
The lesson: Always check your data before trusting it. Bad data in means bad decisions out!
For Beginners: What Is Data Quality Preprocessing?
Imagine you are baking a cake. Before you start mixing, you check your ingredients: Is the flour fresh or expired? Is there enough sugar? Did someone accidentally put salt in the sugar jar?
Data quality preprocessing works the same way. Before IoT data is analyzed or used to make decisions, it must be checked and cleaned:
Validate – Is this sensor reading even physically possible? (A room temperature of 500 degrees is not.)
Clean – Are there gaps or noise in the data? Fill the gaps and smooth out the noise.
Transform – Are all the readings on the same scale? Convert them so different sensors can be compared.
Why does this matter?
A smart thermostat that trusts a faulty temperature reading might blast the AC on a cold day
A factory monitoring system that ignores data quality might miss a real equipment failure hidden in noisy data
A health monitor that does not validate readings might send a false emergency alert
Key concept: It is much cheaper and faster to catch data problems at the edge (right where the sensor is) than to fix them later in the cloud. Think of it as proofreading your essay before submitting it, not after the teacher has graded it.
42.3 The Data Quality Problem in IoT
IoT systems generate massive volumes of sensor data, but raw readings are rarely analysis-ready. Studies consistently show that data scientists spend 60-80% of their time on data preparation, and in IoT contexts the challenges are amplified by:
Harsh environments: Sensors deployed outdoors, in factories, or underwater face temperature extremes, vibration, and electromagnetic interference
Resource constraints: Edge devices have limited CPU, memory, and power for sophisticated processing
Real-time requirements: Many IoT applications need clean data in milliseconds, not hours
Scale: Thousands of sensors producing readings every second create a firehose of potentially dirty data
42.4 The Three-Stage Pipeline
The data quality pipeline follows a strict validate-clean-transform sequence. Each stage builds on the previous one, and skipping a stage leads to compounding errors downstream.
Figure 42.1
Stage
Purpose
Key Techniques
Typical Edge Cost
1. Validate
Reject impossible readings
Range checks, rate-of-change limits, cross-sensor plausibility
Very low (simple comparisons)
2. Clean
Fill gaps, remove noise
Forward fill, interpolation, moving average, median filter
Enter a raw sensor reading and configure validation rules to see how data flows through the validate-clean-transform pipeline. Introduce noise, outliers, or missing values to observe how each stage responds.
Show code
viewof pipe_rawValue = Inputs.range([-50,600], {value:23.5,step:0.1,label:"Raw sensor reading (C)"})viewof pipe_minValid = Inputs.range([-50,50], {value:-10,step:1,label:"Valid range minimum (C)"})viewof pipe_maxValid = Inputs.range([10,100], {value:60,step:1,label:"Valid range maximum (C)"})viewof pipe_lastReading = Inputs.range([-20,80], {value:22.8,step:0.1,label:"Previous reading (C)"})viewof pipe_maxRatePerMin = Inputs.range([0.1,20], {value:2.0,step:0.1,label:"Max rate of change (C/min)"})viewof pipe_emaAlpha = Inputs.range([0.05,1.0], {value:0.3,step:0.05,label:"EMA smoothing alpha"})viewof pipe_zoneMin = Inputs.range([0,30], {value:18,step:1,label:"Zone min for normalization (C)"})viewof pipe_zoneMax = Inputs.range([20,60], {value:28,step:1,label:"Zone max for normalization (C)"})
Show code
{const raw = pipe_rawValue;const rangeOk = raw >= pipe_minValid && raw <= pipe_maxValid;const rateChange =Math.abs(raw - pipe_lastReading);const rateOk = rateChange <= pipe_maxRatePerMin;const validationPassed = rangeOk && rateOk;const validationStatus =!rangeOk?`REJECTED -- ${raw}C is outside [${pipe_minValid}, ${pipe_maxValid}]`:!rateOk?`REJECTED -- rate of change ${rateChange.toFixed(1)}C/min exceeds limit of ${pipe_maxRatePerMin}C/min`:`PASSED -- ${raw}C is within range and rate of change is ${rateChange.toFixed(1)}C/min`;const cleaned = validationPassed? (pipe_emaAlpha * raw + (1- pipe_emaAlpha) * pipe_lastReading):null;const zoneRange = pipe_zoneMax - pipe_zoneMin;const normalized = cleaned !==null&& zoneRange >0? ((cleaned - pipe_zoneMin) / zoneRange):null;const valColor = validationPassed ?"#16A085":"#E74C3C";const normColor = normalized !==null&& normalized >=0&& normalized <=1?"#3498DB":"#E67E22";returnhtml`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #2C3E50; color: var(--bs-body-color); margin-top: 0.5rem;"> <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 1rem;"> <div style="background: var(--bs-body-bg, #fff); padding: 1rem; border-radius: 6px; border-top: 3px solid ${valColor};"> <h4 style="margin: 0 0 0.5rem 0; font-size: 0.95rem; color: #2C3E50;">Stage 1: Validate</h4> <p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>Range check:</strong> <span style="color: ${rangeOk ?'#16A085':'#E74C3C'};">${rangeOk ?'PASS':'FAIL'}</span></p> <p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>Rate check:</strong> <span style="color: ${rateOk ?'#16A085':'#E74C3C'};">${rateOk ?'PASS':'FAIL'}</span></p> <p style="margin: 0.5rem 0 0 0; font-size: 0.8rem; color: ${valColor}; font-weight: bold;">${validationStatus}</p> </div> <div style="background: var(--bs-body-bg, #fff); padding: 1rem; border-radius: 6px; border-top: 3px solid ${validationPassed ?'#E67E22':'#7F8C8D'};"> <h4 style="margin: 0 0 0.5rem 0; font-size: 0.95rem; color: #2C3E50;">Stage 2: Clean</h4>${validationPassed?html`<p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>EMA filter</strong> (alpha=${pipe_emaAlpha}):</p> <p style="margin: 0.3rem 0; font-size: 0.85rem;">${pipe_emaAlpha} x ${raw} + ${(1-pipe_emaAlpha).toFixed(2)} x ${pipe_lastReading}</p> <p style="margin: 0.5rem 0 0 0; font-size: 1.1rem; font-weight: bold; color: #E67E22;">${cleaned.toFixed(2)}C</p>`:html`<p style="margin: 0.3rem 0; font-size: 0.85rem; color: #7F8C8D;">Skipped -- invalid reading rejected in Stage 1</p>`} </div> <div style="background: var(--bs-body-bg, #fff); padding: 1rem; border-radius: 6px; border-top: 3px solid ${normalized !==null? normColor :'#7F8C8D'};"> <h4 style="margin: 0 0 0.5rem 0; font-size: 0.95rem; color: #2C3E50;">Stage 3: Transform</h4>${normalized !==null?html`<p style="margin: 0.3rem 0; font-size: 0.85rem;"><strong>Min-max normalization:</strong></p> <p style="margin: 0.3rem 0; font-size: 0.85rem;">(${cleaned.toFixed(2)} - ${pipe_zoneMin}) / (${pipe_zoneMax} - ${pipe_zoneMin})</p> <p style="margin: 0.5rem 0 0 0; font-size: 1.1rem; font-weight: bold; color: ${normColor};">${normalized.toFixed(4)}</p>${normalized <0|| normalized >1?html`<p style="margin: 0.3rem 0; font-size: 0.75rem; color: #E67E22;">Value outside [0,1] -- review zone bounds</p>`:html``}`:html`<p style="margin: 0.3rem 0; font-size: 0.85rem; color: #7F8C8D;">Skipped -- no valid data to normalize</p>`} </div> </div> <p style="margin: 0.8rem 0 0 0; font-size: 0.85rem; text-align: center; color: var(--bs-body-color); opacity: 0.8;"><strong>Experiment:</strong> Try setting the raw reading to 500C (fails range check), or to 30C with previous reading 22C and max rate 2C/min (fails rate check). Then adjust the EMA alpha to see how smoothing changes the cleaned value.</p> </div>`;}
Agricultural IoT: Moisture sensors drift 5% over 6 months
No drift detection
$12,000/season (30% water overuse on 50-hectare farm)
$50 (quarterly calibration check)
240:1
Cold chain: Sensor gap during transport not flagged
No gap detection
$500,000 (rejected pharmaceutical shipment)
$0 (missing-reading counter)
Infinite
Smart grid: CT sensor phase error corrupts power readings
No cross-sensor validation
$8,000/month (billing errors for 200 units)
$0 (compare with utility meter)
Infinite
The pattern is consistent: prevention costs are negligible (simple comparisons, a few CPU cycles) while failure costs range from hundreds to hundreds of thousands of dollars. This is why data quality should be the first thing you implement, not the last.
42.6.1 Try It: Data Quality Cost Calculator
Use this interactive calculator to explore the 1-10-100 rule with your own failure scenario. Adjust the costs to see how the prevention-to-failure ratio changes.
For IoT, prevention is so cheap (single comparison) that the ratio is effectively infinite—making edge-side validation mandatory, not optional.
42.7 Worked Example: Smart Building Temperature Pipeline
Scenario: You are deploying 200 temperature sensors across a commercial building for HVAC optimization. Each sensor reports every 30 seconds. You need clean, analysis-ready data for the building management system.
Step 1 – Define Validation Rules
First, establish what constitutes valid data for your specific deployment:
Rule
Threshold
Rationale
Physical range
-10 to 60 degrees Celsius
Building is climate-controlled but accounts for loading docks
Rate of change
Max 2 degrees Celsius per minute
Physical thermal mass prevents faster changes
Cross-sensor
Max 8 degrees Celsius difference from nearest neighbor
Adjacent zones should not differ drastically
Staleness
Max 5 minutes between readings
Sensor or network failure if gap exceeds this
Step 2 – Design Cleaning Strategy
For readings that pass validation but have quality issues:
IF reading is missing (gap < 5 minutes):
Use linear interpolation from neighbors
ELSE IF reading is missing (gap >= 5 minutes):
Use forward fill + flag as "imputed"
IF noise detected (high-frequency fluctuation > 0.5C):
Apply exponential moving average (alpha = 0.3)
Try It: Imputation and Noise Filtering Explorer
Simulate a sensor data stream with missing values and noise. Choose an imputation method and a noise filter to see how different strategies affect the output. The chart shows raw data (with gaps), imputed values, and the filtered signal.
{// Generate base temperature signal with gentle trendconst imp_n =20;const imp_seed =42;const imp_baseTemps =Array.from({length: imp_n}, (_, i) =>22+0.3*Math.sin(i *0.5) +0.1* i);// Add noise using deterministic pseudo-randomfunctionimp_pseudoRandom(seed, i) {const x =Math.sin(seed + i *127.1) *43758.5453;return x -Math.floor(x); }const imp_noisyTemps = imp_baseTemps.map((v, i) => v + (imp_pseudoRandom(imp_seed, i) -0.5) *2* imp_noiseLevel );// Create gaps (set missing values to null)const imp_gapEnd =Math.min(imp_gapStart + imp_gapLength, imp_n);const imp_withGaps = imp_noisyTemps.map((v, i) => (i >= imp_gapStart && i < imp_gapEnd) ?null: v );// Impute missing valuesconst imp_imputed = [...imp_withGaps];if (imp_method ==="Forward Fill") {for (let i =0; i < imp_n; i++) {if (imp_imputed[i] ===null) { imp_imputed[i] = i >0? imp_imputed[i -1] : imp_baseTemps[0]; } } } elseif (imp_method ==="Linear Interpolation") {let startIdx = imp_gapStart -1;let endIdx = imp_gapEnd;const startVal = startIdx >=0? imp_withGaps[startIdx] : imp_baseTemps[0];const endVal = endIdx < imp_n ? imp_withGaps[endIdx] : startVal;for (let i = imp_gapStart; i < imp_gapEnd; i++) {const t = (i - startIdx) / (endIdx - startIdx); imp_imputed[i] = startVal + t * (endVal - startVal); } } else {const imp_validVals = imp_withGaps.filter(v => v !==null);const imp_mean = imp_validVals.reduce((a, b) => a + b,0) / imp_validVals.length;for (let i =0; i < imp_n; i++) {if (imp_imputed[i] ===null) imp_imputed[i] = imp_mean; } }// Apply noise filterconst imp_filtered = [...imp_imputed];if (imp_filterType.startsWith("EMA")) {const alpha = imp_filterType.includes("0.3") ?0.3:0.6;for (let i =1; i < imp_n; i++) { imp_filtered[i] = alpha * imp_imputed[i] + (1- alpha) * imp_filtered[i -1]; } } elseif (imp_filterType.startsWith("SMA")) {const w = imp_filterType.includes("3") ?3:5;for (let i =0; i < imp_n; i++) {const start =Math.max(0, i -Math.floor(w /2));const end =Math.min(imp_n, i +Math.floor(w /2) +1);const slice = imp_imputed.slice(start, end); imp_filtered[i] = slice.reduce((a, b) => a + b,0) / slice.length; } }// Build data for displayconst imp_rows = [];for (let i =0; i < imp_n; i++) { imp_rows.push({index: i,series:"Raw (with gaps)",value: imp_withGaps[i]}); imp_rows.push({index: i,series:"Imputed",value: imp_imputed[i]});if (imp_filterType !=="None") { imp_rows.push({index: i,series:"Filtered",value: imp_filtered[i]}); } imp_rows.push({index: i,series:"True signal",value: imp_baseTemps[i]}); }// Calculate error metricsconst imp_imputeErrors = [];const imp_filterErrors = [];for (let i = imp_gapStart; i < imp_gapEnd; i++) { imp_imputeErrors.push(Math.abs(imp_imputed[i] - imp_baseTemps[i])); }for (let i =0; i < imp_n; i++) { imp_filterErrors.push(Math.pow(imp_filtered[i] - imp_baseTemps[i],2)); }const imp_imputeMAE = imp_imputeErrors.length>0? (imp_imputeErrors.reduce((a, b) => a + b,0) / imp_imputeErrors.length).toFixed(3):"N/A";const imp_filterRMSE =Math.sqrt(imp_filterErrors.reduce((a, b) => a + b,0) / imp_filterErrors.length).toFixed(3);returnhtml`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #9B59B6; color: var(--bs-body-color); margin-top: 0.5rem;"> <div style="overflow-x: auto;"> <table style="width: 100%; border-collapse: collapse; font-size: 0.8rem; margin-bottom: 0.8rem;"> <thead> <tr style="border-bottom: 2px solid #2C3E50;"> <th style="padding: 4px; text-align: center;">Reading #</th>${Array.from({length: imp_n}, (_, i) =>html`<th style="padding: 4px; text-align: center; ${i >= imp_gapStart && i < imp_gapEnd ?'background: #ffeaa7;':''}">${i}</th>`)} </tr> </thead> <tbody> <tr> <td style="padding: 4px; font-weight: bold; color: #7F8C8D;">Raw</td>${Array.from({length: imp_n}, (_, i) =>html`<td style="padding: 4px; text-align: center; ${imp_withGaps[i] ===null?'color: #E74C3C; font-weight: bold;':''}">${imp_withGaps[i] !==null? imp_withGaps[i].toFixed(1) :'GAP'}</td>`)} </tr> <tr> <td style="padding: 4px; font-weight: bold; color: #9B59B6;">Imputed</td>${Array.from({length: imp_n}, (_, i) =>html`<td style="padding: 4px; text-align: center; ${i >= imp_gapStart && i < imp_gapEnd ?'color: #9B59B6; font-weight: bold;':''}">${imp_imputed[i].toFixed(1)}</td>`)} </tr> <tr> <td style="padding: 4px; font-weight: bold; color: #16A085;">Filtered</td>${Array.from({length: imp_n}, (_, i) =>html`<td style="padding: 4px; text-align: center;">${imp_filtered[i].toFixed(1)}</td>`)} </tr> </tbody> </table> </div> <div style="display: grid; grid-template-columns: 1fr 1fr 1fr; gap: 1rem; text-align: center; margin-top: 0.5rem;"> <div> <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Imputation MAE (gap region)</div> <div style="font-size: 1.3rem; font-weight: bold; color: #9B59B6;">${imp_imputeMAE}C</div> </div> <div> <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Filter RMSE (full signal)</div> <div style="font-size: 1.3rem; font-weight: bold; color: #16A085;">${imp_filterRMSE}C</div> </div> <div> <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Missing readings</div> <div style="font-size: 1.3rem; font-weight: bold; color: #E74C3C;">${imp_gapLength} of ${imp_n}</div> </div> </div> <p style="margin: 0.8rem 0 0 0; font-size: 0.85rem; text-align: center; color: var(--bs-body-color); opacity: 0.8;"><strong>Experiment:</strong> Compare Forward Fill (flat line during gap) vs Linear Interpolation (straight line between boundaries). Increase noise amplitude and try different filters to see the smoothing-vs-responsiveness tradeoff.</p> </div>`;}
Step 3 – Apply Normalization
For the ML-based HVAC optimization model:
Normalized_Temp = (Raw_Temp - Zone_Min) / (Zone_Max - Zone_Min)
Where:
Zone_Min = historical minimum for this zone (e.g., 18C for office)
Zone_Max = historical maximum for this zone (e.g., 28C for office)
Result: values in [0, 1] range for neural network input
42.7.1 Try It: Normalization and Throughput Calculator
Experiment with different temperature readings, zone bounds, and sensor configurations to see how min-max normalization works and how pipeline throughput scales.
{const range = dqNormMax - dqNormMin;const normalized = range >0? ((dqNormTemp - dqNormMin) / range) :0;const readingsPerMin =60/ dqNormInterval;const readingsPerHour = dqNormSensors * readingsPerMin *60;const outOfRange = dqNormTemp < dqNormMin || dqNormTemp > dqNormMax;const warning = range <=0?"Zone maximum must be greater than zone minimum for valid normalization.":"";returnhtml`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #3498DB; color: var(--bs-body-color); margin-top: 0.5rem;"> <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem;"> <div> <h4 style="margin: 0 0 0.5rem 0; font-size: 1rem; color: #2C3E50;">Normalization Result</h4>${warning ?html`<p style="color: #E74C3C; font-weight: bold;">${warning}</p>`:html` <p style="margin: 0.3rem 0;"><strong>Formula:</strong> (${dqNormTemp} - ${dqNormMin}) / (${dqNormMax} - ${dqNormMin})</p> <p style="margin: 0.3rem 0;"><strong>Normalized value:</strong> <span style="font-size: 1.3rem; font-weight: bold; color: ${outOfRange ?'#E74C3C':'#16A085'};">${normalized.toFixed(4)}</span></p>${outOfRange ?html`<p style="margin: 0.3rem 0; color: #E74C3C; font-size: 0.85rem;"><strong>Warning:</strong> Value is outside [0, 1] range. The reading ${dqNormTemp < dqNormMin ?"is below zone minimum":"exceeds zone maximum"} -- this may indicate a sensor issue or incorrect zone bounds.</p>`:html``} `} </div> <div> <h4 style="margin: 0 0 0.5rem 0; font-size: 1rem; color: #2C3E50;">Pipeline Throughput</h4> <p style="margin: 0.3rem 0;"><strong>Readings/minute:</strong> ${(readingsPerMin * dqNormSensors).toLocaleString()}</p> <p style="margin: 0.3rem 0;"><strong>Readings/hour:</strong> <span style="font-size: 1.3rem; font-weight: bold; color: #3498DB;">${readingsPerHour.toLocaleString()}</span></p> <p style="margin: 0.3rem 0;"><strong>Readings/day:</strong> ${(readingsPerHour *24).toLocaleString()}</p> </div> </div> </div>`;}
Result: With the default settings, the pipeline processes 24,000 readings per hour (200 sensors x 2 readings/min x 60 min). With edge-side validation, roughly 0.1-0.5% of readings are flagged or rejected, preventing those errors from reaching the HVAC control algorithm. The cleaning stage fills the approximately 2-3% of readings lost to temporary network issues.
Common Pitfalls in IoT Data Quality
1. Skipping validation because “the sensor is reliable” Even high-quality sensors fail. A $500 industrial temperature sensor can still produce garbage readings when its wiring corrodes, its power supply fluctuates, or firmware bugs cause buffer overflows. Always validate.
2. Using the same thresholds for all environments A valid temperature range for an indoor office (15-30 degrees Celsius) is completely wrong for a cold storage facility (-25 to -15 degrees Celsius) or a server room (18-27 degrees Celsius). Validation rules must be context-specific.
3. Over-smoothing the signal Aggressive noise filtering (large window moving averages, very low alpha in EMA) removes real events along with noise. A sudden temperature spike might be a genuine HVAC failure, not noise. Balance smoothness with responsiveness.
4. Ignoring sensor drift A sensor that reads 0.5 degrees Celsius too high on day 1 might read 3 degrees too high by month 6. Without periodic recalibration or drift detection, your “clean” data slowly becomes systematically wrong.
5. Normalizing before cleaning If you normalize data that contains outliers, the outliers distort the scaling parameters (min, max, mean, standard deviation), making all your normalized values wrong. Always clean first, then normalize.
6. Treating all missing data the same A 30-second gap (one missed reading) is very different from a 2-hour gap (network outage). Simple forward fill works for the former but introduces dangerous stale data for the latter. Match your imputation strategy to the gap duration and sensor type.
Try It: Normalization Methods Comparison
Enter a set of sensor values (including an outlier) to see how three normalization methods – Min-Max, Z-Score, and Robust Scaling – handle the data differently. Notice how outliers distort Min-Max and Z-Score but have less effect on Robust Scaling.
{// Parse input valuesconst ncomp_raw = ncomp_values.split(",").map(s =>parseFloat(s.trim())).filter(v =>!isNaN(v));if (ncomp_raw.length<3) {returnhtml`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; border-left: 4px solid #E74C3C; color: var(--bs-body-color);"> <p style="color: #E74C3C; font-weight: bold;">Please enter at least 3 comma-separated numeric values.</p> </div>`; }const n = ncomp_raw.length;const hlIdx =Math.min(ncomp_highlightIdx, n -1);// Min-Max scalingconst ncomp_min =Math.min(...ncomp_raw);const ncomp_max =Math.max(...ncomp_raw);const ncomp_mmRange = ncomp_max - ncomp_min;const ncomp_minmax = ncomp_raw.map(v => ncomp_mmRange >0? (v - ncomp_min) / ncomp_mmRange :0);// Z-Score normalizationconst ncomp_mean = ncomp_raw.reduce((a, b) => a + b,0) / n;const ncomp_variance = ncomp_raw.reduce((a, v) => a +Math.pow(v - ncomp_mean,2),0) / n;const ncomp_std =Math.sqrt(ncomp_variance);const ncomp_zscore = ncomp_raw.map(v => ncomp_std >0? (v - ncomp_mean) / ncomp_std :0);// Robust scaling (using median and IQR)const ncomp_sorted = [...ncomp_raw].sort((a, b) => a - b);const ncomp_median = n %2===0? (ncomp_sorted[n/2-1] + ncomp_sorted[n/2]) /2: ncomp_sorted[Math.floor(n/2)];const ncomp_q1Idx =Math.floor(n /4);const ncomp_q3Idx =Math.floor(3* n /4);const ncomp_q1 = ncomp_sorted[ncomp_q1Idx];const ncomp_q3 = ncomp_sorted[ncomp_q3Idx];const ncomp_iqr = ncomp_q3 - ncomp_q1;const ncomp_robust = ncomp_raw.map(v => ncomp_iqr >0? (v - ncomp_median) / ncomp_iqr :0);// Check for outlier distortionconst ncomp_rawNoOutlier = ncomp_raw.filter((_, i) => i !== hlIdx);const ncomp_minNoOut =Math.min(...ncomp_rawNoOutlier);const ncomp_maxNoOut =Math.max(...ncomp_rawNoOutlier);const ncomp_rangeNoOut = ncomp_maxNoOut - ncomp_minNoOut;const ncomp_distortion = ncomp_mmRange >0&& ncomp_rangeNoOut >0? ((ncomp_mmRange / ncomp_rangeNoOut) -1) *100:0;returnhtml`<div style="background: var(--bs-light, #f8f9fa); padding: 1.2rem; border-radius: 8px; border-left: 4px solid #E67E22; color: var(--bs-body-color); margin-top: 0.5rem;"> <div style="overflow-x: auto;"> <table style="width: 100%; border-collapse: collapse; font-size: 0.8rem; margin-bottom: 0.8rem;"> <thead> <tr style="border-bottom: 2px solid #2C3E50;"> <th style="padding: 6px; text-align: left; min-width: 100px;">Method</th>${ncomp_raw.map((_, i) =>html`<th style="padding: 6px; text-align: center; ${i === hlIdx ?'background: #ffeaa7;':''}">v${i}</th>`)} </tr> </thead> <tbody> <tr style="border-bottom: 1px solid #ddd;"> <td style="padding: 6px; font-weight: bold; color: #2C3E50;">Raw</td>${ncomp_raw.map((v, i) =>html`<td style="padding: 6px; text-align: center; ${i === hlIdx ?'background: #ffeaa7; font-weight: bold;':''}">${v.toFixed(1)}</td>`)} </tr> <tr style="border-bottom: 1px solid #ddd;"> <td style="padding: 6px; font-weight: bold; color: #3498DB;">Min-Max</td>${ncomp_minmax.map((v, i) =>html`<td style="padding: 6px; text-align: center; ${i === hlIdx ?'background: #ffeaa7; font-weight: bold;':''}">${v.toFixed(4)}</td>`)} </tr> <tr style="border-bottom: 1px solid #ddd;"> <td style="padding: 6px; font-weight: bold; color: #16A085;">Z-Score</td>${ncomp_zscore.map((v, i) =>html`<td style="padding: 6px; text-align: center; ${i === hlIdx ?'background: #ffeaa7; font-weight: bold;':''}">${v.toFixed(4)}</td>`)} </tr> <tr> <td style="padding: 6px; font-weight: bold; color: #E67E22;">Robust</td>${ncomp_robust.map((v, i) =>html`<td style="padding: 6px; text-align: center; ${i === hlIdx ?'background: #ffeaa7; font-weight: bold;':''}">${v.toFixed(4)}</td>`)} </tr> </tbody> </table> </div> <div style="display: grid; grid-template-columns: 1fr 1fr 1fr 1fr; gap: 1rem; text-align: center;"> <div> <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Mean / Std Dev</div> <div style="font-size: 1rem; font-weight: bold; color: #16A085;">${ncomp_mean.toFixed(2)} / ${ncomp_std.toFixed(2)}</div> </div> <div> <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Median / IQR</div> <div style="font-size: 1rem; font-weight: bold; color: #E67E22;">${ncomp_median.toFixed(2)} / ${ncomp_iqr.toFixed(2)}</div> </div> <div> <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Min-Max Range</div> <div style="font-size: 1rem; font-weight: bold; color: #3498DB;">${ncomp_mmRange.toFixed(2)}</div> </div> <div> <div style="font-size: 0.8rem; color: var(--bs-body-color); opacity: 0.7;">Range Distortion</div> <div style="font-size: 1rem; font-weight: bold; color: ${ncomp_distortion >100?'#E74C3C':'#16A085'};">${ncomp_distortion.toFixed(0)}%</div> </div> </div> <p style="margin: 0.8rem 0 0 0; font-size: 0.85rem; text-align: center; color: var(--bs-body-color); opacity: 0.8;"><strong>Key insight:</strong> ${ncomp_distortion >100?"The highlighted value inflates the Min-Max range by "+ ncomp_distortion.toFixed(0) +"%, compressing all normal values near zero. Robust Scaling uses median and IQR, which are resistant to this distortion. This demonstrates why you should clean outliers before normalizing -- or use Robust Scaling when outliers cannot be removed.":"With no extreme outliers, all three methods produce reasonable results. Try adding an outlier (e.g., 500) to see how Min-Max and Z-Score get distorted while Robust Scaling remains stable."}</p> </div>`;}
42.8 Knowledge Check
Test your understanding of data quality preprocessing concepts:
Interactive Quiz: Match Concepts
Interactive Quiz: Sequence the Steps
Label the Diagram
Code Challenge
42.9 Summary and Key Takeaways
Data quality preprocessing is not optional in IoT systems – it is the critical foundation that determines whether your analytics, ML models, and automated decisions can be trusted.
Core principles to remember:
Follow the pipeline order: Validate first, clean second, transform third. Skipping or reordering stages causes compounding errors.
Catch issues at the edge: The 1-10-100 rule shows that prevention at the source is 100x cheaper than fixing downstream failures.
Customize for context: Validation thresholds, imputation strategies, and normalization methods must match the specific sensor type, deployment environment, and downstream use case.
Always flag imputed data: Downstream analysis needs to know which values are measured versus estimated. Never silently replace data.
Balance filtering with responsiveness: Over-smoothing removes real events. Under-smoothing leaves noise that corrupts analysis. Tune your filters to the specific signal characteristics.
This overview chapter introduces the three-stage data quality pipeline that underpins all IoT analytics. The validate-clean-transform sequence is critical because each stage builds on the previous one – skipping or reordering stages causes compounding errors.
Critical Dependencies:
Edge Data Acquisition – Where raw data originates; edge preprocessing catches issues at source (negligible cost vs. expensive cloud fixes)