43 Imputation & Noise Filtering
43.1 Learning Objectives
By the end of this chapter, you will be able to:
- Implement Missing Data Handling: Apply appropriate imputation strategies (forward-fill, interpolation, seasonal decomposition) for different sensor types
- Compare Imputation Methods: Evaluate trade-offs between forward-fill, interpolation, and seasonal decomposition based on data characteristics
- Design Noise Filters: Implement moving average, median, and exponential smoothing filters and assess their signal conditioning performance
- Distinguish Strategy by Sensor Type: Justify the correct imputation and filtering approach based on sensor semantics and physical behavior
43.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- Data Validation and Outlier Detection: Understanding validation as the first stage of the data quality pipeline
- Signal Processing Essentials: Basic concepts of filtering and signal conditioning
- Sensor Fundamentals: Knowledge of different sensor types and their output characteristics
For Kids: The Sensor Squad Fills the Gaps!
Missing data is like a puzzle with missing pieces - but we can use clues to fill them in!
43.2.1 The Sensor Squad Adventure: The Mystery of the Missing Messages
Max the Microcontroller was worried. “Motion Mo didn’t send ANY reading! His battery must have died.”
“Oh no!” said Lila the LED. “What do we put in the report?”
Sammy the Sensor thought carefully. “Well, what kind of sensor is Mo?”
“He’s a motion sensor - he only says something when he sees movement!”
Sammy smiled. “Then if he didn’t report anything… what do you think that means?”
Max realized: “No message means no motion! We can write down ‘No motion detected!’”
But then they looked at Temperature Terry’s readings: 22… 22… GAP… GAP… GAP… 23.
“Hmm,” said Sammy. “Temperature changes slowly. If it was 22 before and 23 after, what was it probably during the gap?”
Bella the Battery did the math: “Probably 22… then 22.5… then 23! It was slowly warming up!”
The Sensor Squad learned two tricks:
- For motion sensors: No news = no motion (use zero!)
- For temperature: Connect the dots between the readings we DO have
43.2.2 Smoothing Out the Noise
Pressure Pete’s readings were jumping around: 100, 5, 98, 3, 101…
“Wait,” said Sammy. “Pressure can’t really jump from 100 to 5 and back! That’s just noise - like static on a radio.”
“How do we fix it?” asked Lila.
“Let me take the AVERAGE of a few readings: (100 + 5 + 98 + 3 + 101) / 5 = about 61… Hmm, that’s not right either because those 5s and 3s are wrong!”
Bella suggested: “What if we put them in order and take the MIDDLE one? 3, 5, 98, 100, 101 - the middle is 98! That’s probably the real pressure!”
The Sensor Squad learned: The median filter ignores the crazy outliers and finds the true reading!
43.2.3 Key Words for Kids
| Word | What It Means |
|---|---|
| Missing Data | When a sensor doesn’t send any information - like a friend who doesn’t answer |
| Imputation | Filling in gaps with good guesses based on clues |
| Noise | Random jumpy readings that hide the real information |
| Filtering | Smoothing out the noise to find the real signal |
| Median | The middle number when you sort them in order |
For Beginners: Why Do We Need Imputation and Filtering?
Missing data and noise are inevitable in real IoT deployments. Sensors lose power, networks drop packets, and electronic noise corrupts readings. Rather than discard incomplete data, we can intelligently fill gaps and smooth noise.
Two key challenges:
| Challenge | Cause | Solution |
|---|---|---|
| Missing Values | Battery death, network outage, sensor failure | Imputation (filling gaps) |
| Noisy Signals | Electrical interference, quantization, vibration | Filtering (smoothing) |
Important distinction:
- Missing: No data point received at all
- Noisy: Data received but corrupted or fluctuating
Key question this chapter answers: “How do I fill gaps in sensor data and smooth out noise without losing important information?”
Minimum Viable Understanding: Imputation and Filtering
Core Concept: Missing value imputation fills data gaps using neighboring or historical values, while noise filtering smooths random fluctuations to reveal the underlying signal - both must be matched to sensor semantics.
Why It Matters: Analytics and ML models require complete data series. Gaps cause errors or require discarding entire time windows. Noise obscures real patterns and triggers false alerts. Proper handling preserves data integrity while enabling downstream processing.
Key Takeaway: Match your strategy to sensor type - use forward-fill for slow-changing values (temperature), zero for event sensors (motion), and median filter for spike removal. Never interpolate binary/categorical data or forward-fill event streams.
43.3 Missing Value Imputation
Key Concepts
- Missing data imputation: The process of estimating and filling in missing sensor readings using statistical or model-based methods rather than discarding incomplete records.
- Forward-fill (Last Observation Carried Forward): An imputation strategy that replaces a missing value with the most recent valid reading — appropriate for slowly changing sensors but misleading for rapidly varying signals.
- Linear interpolation: Estimating a missing value by drawing a straight line between the surrounding valid readings — appropriate for sensors with smooth, continuous dynamics.
- Moving average filter: A signal smoothing technique that replaces each reading with the mean of a surrounding window, attenuating high-frequency noise at the cost of introducing lag.
- Median filter: A non-linear filter that replaces each reading with the median of its window, highly effective at removing impulse noise (transient spikes) without distorting edges.
- Missingness mechanism: The reason data is missing — Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR) — which determines which imputation methods are statistically valid.
43.3.1 Forward Fill (Last Observation Carried Forward)
Best for slowly-changing values like temperature:
class ForwardFillImputer:
def __init__(self, max_gap=60): # Maximum gap in samples
self.last_valid = None
self.gap_count = 0
self.max_gap = max_gap
def impute(self, value, is_valid):
if is_valid:
self.last_valid = value
self.gap_count = 0
return value, "original"
if self.last_valid is None:
return None, "no_history"
self.gap_count += 1
if self.gap_count > self.max_gap:
return None, "gap_too_large"
return self.last_valid, "imputed_ffill"43.3.2 Linear Interpolation
Better for trending values when future data is available:
def linear_interpolate(data, timestamps):
"""
Interpolate missing values (None/NaN) using linear interpolation.
Requires knowledge of surrounding valid points.
"""
import numpy as np
data = np.array(data, dtype=float)
timestamps = np.array(timestamps, dtype=float)
valid_mask = ~np.isnan(data)
valid_indices = np.where(valid_mask)[0]
if len(valid_indices) < 2:
return data
# Interpolate
interpolated = np.interp(
timestamps,
timestamps[valid_mask],
data[valid_mask]
)
return interpolated43.3.3 Seasonal Decomposition Fill
For data with known patterns (e.g., temperature with daily cycles):
def seasonal_fill(data, period=24):
"""
Fill missing values using seasonal pattern from historical data.
period: number of samples in one cycle (e.g., 24 for hourly data with daily cycle)
"""
import numpy as np
data = np.array(data, dtype=float)
n = len(data)
# Calculate seasonal pattern from valid data
seasonal = np.zeros(period)
counts = np.zeros(period)
for i, val in enumerate(data):
if not np.isnan(val):
seasonal[i % period] += val
counts[i % period] += 1
# Average seasonal values
with np.errstate(divide='ignore', invalid='ignore'):
seasonal = np.where(counts > 0, seasonal / counts, np.nan)
# Fill missing with seasonal pattern
filled = data.copy()
for i in range(n):
if np.isnan(filled[i]) and not np.isnan(seasonal[i % period]):
filled[i] = seasonal[i % period]
return filled
Common Pitfall: Wrong Imputation Strategy
The mistake: Using forward-fill for event-driven sensors (motion, door open/close) or interpolation for categorical data.
Symptoms:
- Motion sensor shows constant “motion detected” during sensor offline period
- Door sensor shows “open” for hours when sensor battery died while door was open
- Analytics show unrealistic patterns during imputed periods
Why it happens: One-size-fits-all imputation applied without considering sensor semantics.
The fix: Always match imputation strategy to sensor type. See the Decision Framework later in this chapter for a complete sensor-to-strategy mapping table and decision tree.
Prevention: Create sensor metadata that specifies the imputation strategy for each sensor type in your deployment.
Putting Numbers to It
How long can you safely forward-fill temperature data?
For a typical indoor temperature sensor: - Normal change rate: 0.5°C per hour (HVAC cycles) - Maximum change rate: 3°C per hour (HVAC failure, door open) - Sensor sampling: Every 60 seconds
Gap duration analysis:
| Gap Duration | Expected Change | Forward-Fill Error | Acceptable? |
|---|---|---|---|
| 1 minute | 0.008°C (normal) | ~0.01°C | ✓ Excellent |
| 5 minutes | 0.042°C | ~0.05°C | ✓ Very good |
| 30 minutes | 0.25°C | ~0.3°C | ✓ Good (trend analysis OK) |
| 2 hours | 1.0°C | ~1.5°C | ✗ Poor (alert thresholds invalid) |
Formula for maximum safe gap:
\[t_{max} = \frac{\epsilon_{acceptable}}{r_{max}}\]
Where \(\epsilon_{acceptable}\) is the maximum tolerable error and \(r_{max}\) is the maximum expected change rate.
Example: For ±1°C acceptable error and 3°C/hour max rate:
\[t_{max} = \frac{1°C}{3°C/\text{hour}} = 0.33 \text{ hours} = 20 \text{ minutes}\]
Recommendation: Forward-fill temperature for max 20 minutes. Beyond that, flag as “sensor offline” rather than impute.
43.3.4 Try It: Forward-Fill Gap Calculator
43.4 Noise Filtering Techniques
43.4.1 Moving Average Filter
Simple and effective for steady-state noise reduction:
class MovingAverageFilter:
def __init__(self, window_size=5):
self.window_size = window_size
self.buffer = []
def filter(self, value):
self.buffer.append(value)
if len(self.buffer) > self.window_size:
self.buffer.pop(0)
return sum(self.buffer) / len(self.buffer)43.4.2 Median Filter
Excellent for removing spike noise while preserving edges:
class MedianFilter:
def __init__(self, window_size=5):
self.window_size = window_size
self.buffer = []
def filter(self, value):
self.buffer.append(value)
if len(self.buffer) > self.window_size:
self.buffer.pop(0)
sorted_buffer = sorted(self.buffer)
mid = len(sorted_buffer) // 2
if len(sorted_buffer) % 2 == 0:
return (sorted_buffer[mid - 1] + sorted_buffer[mid]) / 2
return sorted_buffer[mid]43.4.3 Exponential Smoothing
Provides weighted average with more weight on recent values:
class ExponentialSmoothingFilter:
def __init__(self, alpha=0.3):
"""
alpha: smoothing factor (0-1)
Higher alpha = more weight on recent values = less smoothing
Lower alpha = more weight on history = more smoothing
"""
self.alpha = alpha
self.smoothed = None
def filter(self, value):
if self.smoothed is None:
self.smoothed = value
else:
self.smoothed = self.alpha * value + (1 - self.alpha) * self.smoothed
return self.smoothed43.4.4 Filter Comparison
| Filter | Latency | Edge Preservation | Spike Removal | Best For |
|---|---|---|---|---|
| Moving Average | (N-1)/2 samples | Poor | Moderate | Steady-state signals |
| Median | (N-1)/2 samples | Excellent | Excellent | Spike-contaminated data |
| Exponential | Continuous | Good | Moderate | Real-time smoothing |
| Kalman | Minimal | Excellent | Excellent | Known dynamics, sensor fusion |
Putting Numbers to It
How does filter window size affect noise reduction and latency?
For a moving average filter with Gaussian noise (σ = 2.0°C) on temperature sensor:
Noise reduction formula:
\[\sigma_{filtered} = \frac{\sigma_{original}}{\sqrt{N}}\]
Where \(N\) is the window size.
| Window Size (N) | Noise Reduction | Latency (samples) | Temperature Example |
|---|---|---|---|
| 3 | \(\sigma / \sqrt{3} = 0.58\sigma\) | 1.0 | 2.0°C → 1.15°C noise |
| 5 | \(\sigma / \sqrt{5} = 0.45\sigma\) | 2.0 | 2.0°C → 0.89°C noise |
| 10 | \(\sigma / \sqrt{10} = 0.32\sigma\) | 4.5 | 2.0°C → 0.63°C noise |
| 20 | \(\sigma / \sqrt{20} = 0.22\sigma\) | 9.5 | 2.0°C → 0.45°C noise |
Trade-off: Larger window → better noise rejection BUT longer delay detecting real changes.
Latency calculation: Output lags input by \(\frac{N-1}{2}\) samples. For \(N=10\) at 1 Hz sampling → 4.5 second delay.
Practical rule: Choose \(N\) such that latency is < 10% of the timescale you care about. If monitoring hourly HVAC cycles (3600s), 5-10 sample window (5-10s latency) is acceptable. If detecting rapid door opening events (10s timescale), use \(N=3\) (1.5s latency max).
43.4.5 Try It: Filter Window Size Calculator
43.4.6 Try It: Exponential Smoothing Explorer
43.4.7 Choosing the Right Filter
def choose_filter(signal_characteristics):
"""
Guide for selecting appropriate noise filter based on signal characteristics.
"""
recommendations = {
'steady_state_with_gaussian_noise': {
'filter': 'MovingAverage',
'reason': 'Averages out random noise effectively',
'window_size': 5 # Adjust based on noise frequency
},
'spiky_noise_impulse': {
'filter': 'MedianFilter',
'reason': 'Completely ignores outlier spikes',
'window_size': 5 # Odd number works best
},
'real_time_tracking': {
'filter': 'ExponentialSmoothing',
'reason': 'No latency, responsive to changes',
'alpha': 0.3 # Lower = smoother, higher = more responsive
},
'sensor_fusion_known_dynamics': {
'filter': 'KalmanFilter',
'reason': 'Optimal estimation with uncertainty tracking',
'params': 'process_noise, measurement_noise'
},
'edge_preserving': {
'filter': 'MedianFilter',
'reason': 'Preserves sharp transitions in data',
'window_size': 3
}
}
return recommendations.get(signal_characteristics, recommendations['steady_state_with_gaussian_noise'])43.4.8 Combining Filters
For robust noise removal, filters can be cascaded:
class CombinedFilter:
"""
Two-stage filter: Median first (remove spikes), then exponential smooth.
"""
def __init__(self, median_window=5, exp_alpha=0.3):
self.median_filter = MedianFilter(median_window)
self.exp_filter = ExponentialSmoothingFilter(exp_alpha)
def filter(self, value):
# Stage 1: Remove spikes with median
despike = self.median_filter.filter(value)
# Stage 2: Smooth remaining noise
smooth = self.exp_filter.filter(despike)
return smooth
# Usage
combined = CombinedFilter(median_window=5, exp_alpha=0.2)
for reading in sensor_stream:
clean_value = combined.filter(reading)
Worked Example: Calculating Optimal Filter Window for Temperature Sensor
Scenario: You have a temperature sensor monitoring a cold storage facility. The sensor reports every 10 seconds. You’ve observed occasional spikes due to electrical interference when the cooling compressor starts (5-10C jumps), and you need to filter these without losing legitimate temperature trends.
Given:
- Sampling rate: 0.1 Hz (1 sample per 10 seconds)
- Normal temperature: -18C ± 2C
- Compressor noise: Random spikes to -8C or -28C (duration: 1-2 samples)
- Legitimate temperature changes: 0.5C per minute maximum
Question: Should you use a moving average or median filter, and what window size?
Solution:
Step 1: Analyze the noise characteristics
- Spike duration: 1-2 samples = 10-20 seconds
- Spike frequency: Approximately 5% of readings (every 200 seconds when compressor cycles)
- Spike magnitude: 10C deviation (huge compared to normal 2C variation)
Step 2: Calculate required window size
For moving average: - Window needs to span spike duration - 3-sample window: averages noise into adjacent readings - Example: [-18, -28, -18] → average = -21.3C (still shows distortion)
For median filter: - Window needs odd number of samples - 3-sample window: [-18, -28, -18] → median = -18C (spike completely removed!) - 5-sample window: [-18, -18, -28, -18, -18] → median = -18C (still perfect)
Step 3: Verify edge preservation
Legitimate temperature change over 1 minute: - Rate: 0.5C/min = 0.083C per 10 seconds - Over 5 samples: [-18.0, -18.1, -18.2, -18.3, -18.4] - Median of 5: -18.2C (preserves trend!)
Step 4: Calculate latency
Window size 5 = (5-1)/2 = 2 samples delay = 20 seconds latency
For cold storage monitoring (not time-critical), 20 seconds is acceptable.
Answer: Use 5-sample median filter
Why:
- Completely removes 1-2 sample spikes
- Preserves legitimate temperature trends
- No tuning parameters (unlike moving average weights)
- Latency (20s) acceptable for this application
Implementation:
median_filter = MedianFilter(window_size=5)
for reading in sensor_stream:
clean_temp = median_filter.filter(reading)
if clean_temp < -20: # After filtering, threshold check is reliable
trigger_high_temp_alarm()Key Insight: Median filters excel when noise is sparse spikes rather than continuous Gaussian noise. The window should be large enough to ensure spikes are minority values (< 50%) within the window.
Decision Framework: Imputation Strategy Selection
When sensor data goes missing, selecting the correct imputation strategy depends on sensor characteristics and downstream requirements. Use this framework to guide your decision:
| Sensor Characteristic | Imputation Strategy | Rationale | Max Gap Duration |
|---|---|---|---|
| Slowly changing continuous (temperature, humidity) | Forward-fill or linear interpolation | Physical inertia prevents rapid changes | 10-30 minutes |
| Event-driven binary (motion detector, door switch) | Zero/False for missing periods | Absence of event signal = no event occurred | Unlimited (but flag gaps >1 hour) |
| Monotonic counter (energy meter, flow meter) | Zero increment for missing period | No reading = no consumption during gap | Up to 1 day |
| Periodic with known pattern (daily temperature cycle) | Seasonal decomposition | Leverage historical pattern | Up to 6 hours |
| High-frequency volatile (stock price, vibration) | Do not impute - mark as missing | Interpolation creates false data | N/A - preserve gaps |
| Redundant sensor array | Use nearby sensor + bias correction | Spatial correlation for better estimate | Depends on sensor density |
Decision Tree:
- Is the sensor event-driven?
- YES → Use zero/default state for missing periods
- NO → Continue to step 2
- Does the signal change slowly? (Rate < 10% per time constant)
- YES → Forward-fill acceptable for gaps < 10x sampling interval
- NO → Continue to step 3
- Is there a known periodic pattern?
- YES → Use seasonal decomposition fill
- NO → Continue to step 4
- Are there nearby sensors measuring the same quantity?
- YES → Use spatial interpolation from neighbors
- NO → Use linear interpolation or mark as missing
Example Application:
def select_imputation_strategy(sensor_type, gap_duration_minutes):
if sensor_type == "motion_detector":
return "zero_fill" # No motion during gap
elif sensor_type == "temperature":
if gap_duration_minutes < 30:
return "forward_fill"
elif gap_duration_minutes < 360:
return "seasonal_fill" # Use daily pattern
else:
return "mark_missing" # Gap too long
elif sensor_type == "energy_meter":
return "zero_increment" # No consumption
elif sensor_type == "vibration":
return "mark_missing" # Cannot safely interpolateWarning Signs of Wrong Strategy:
- Motion sensor shows continuous “detected” during power outage → Used forward-fill instead of zero
- Temperature shows impossible linear ramp over 6-hour gap → Used interpolation instead of seasonal pattern
- Energy meter shows zero consumption for entire day → Used zero_increment for too-long gap (should alarm)
Common Mistake: Imputing Before Outlier Detection
The Mistake: Running imputation before outlier detection, causing outliers to be forward-filled or interpolated into the data stream, permanently corrupting adjacent readings.
Why It Happens: Data quality pipelines are often built incrementally. Engineers add imputation first (to handle missing data), then later realize they need outlier detection. By then, the pipeline order is established and changing it requires refactoring.
Example of the Problem:
# WRONG: Impute first, detect outliers second
readings = [22.1, 22.3, 99.9, None, None, None, 22.8] # 99.9 is sensor fault
# Step 1: Forward-fill (MISTAKE - happens before outlier removal)
imputed = forward_fill(readings)
# Result: [22.1, 22.3, 99.9, 99.9, 99.9, 99.9, 22.8]
# Step 2: Outlier detection
cleaned = remove_outliers(imputed, threshold=3_sigma)
# Result: [22.1, 22.3, REMOVED, REMOVED, REMOVED, REMOVED, 22.8]
# Lost 4 data points! The None values became 99.9 outliers.The Fix: Always apply outlier detection and validation BEFORE imputation:
# CORRECT: Validate first, impute second
readings = [22.1, 22.3, 99.9, None, None, None, 22.8]
# Step 1: Outlier detection and removal (mark as None)
validated = remove_outliers(readings, threshold=3_sigma)
# Result: [22.1, 22.3, None, None, None, None, 22.8]
# Step 2: Forward-fill ONLY the validated stream
imputed = forward_fill(validated)
# Result: [22.1, 22.3, 22.3, 22.3, 22.3, 22.3, 22.8]
# Correctly preserved legitimate data!Correct Pipeline Order:
- Range Validation → Mark out-of-range values as None
- Rate-of-Change Validation → Mark impossible jumps as None
- Outlier Detection → Mark statistical outliers as None
- Missing Value Imputation → Fill None values using appropriate strategy
- Noise Filtering → Apply smoothing to cleaned data
Real-World Impact: A temperature monitoring system in a pharmaceutical warehouse experienced this bug. A faulty sensor spiked to 85C for one reading before going offline. The spike was forward-filled for 30 minutes (the gap duration), triggering false temperature excursion alarms and requiring destruction of $50,000 worth of temperature-sensitive drugs. Root cause: imputation ran before outlier removal in the data pipeline.
Prevention Checklist:
43.5 Knowledge Check
Common Pitfalls
1. Forward-filling rapidly changing sensor signals
Forward-filling works for temperature that changes by 0.5°C per minute but creates flat-line artefacts for high-frequency vibration data. Match the imputation method to the signal dynamics.
2. Imputing without flagging filled values
If downstream analytics cannot distinguish real readings from imputed ones, anomaly detectors may flag imputed values as anomalies or ML models may learn from artefacts. Always add an imputation flag column alongside filled values.
3. Using mean imputation for time-series IoT data
Global mean imputation destroys temporal patterns (seasonality, trends) that are the most valuable features in IoT data. Use time-local methods (linear interpolation, seasonal decomposition imputation) instead.
4. Applying a moving average filter with too large a window
A 100-point moving average on a 1 Hz sensor introduces 50-second lag — unacceptable for real-time anomaly detection. Balance noise suppression against lag, or use an exponential moving average for zero-phase filtering.
43.6 Summary
Missing value imputation and noise filtering are essential for producing clean, complete sensor data:
- Forward Fill: Simple and effective for slowly-changing continuous values (temperature, humidity)
- Linear Interpolation: Better for trending data when you have values on both sides of the gap
- Seasonal Fill: Use when data has known periodic patterns (daily temperature cycles)
- Sensor-Specific Imputation: Motion sensors get zero, state sensors get “unknown”, counters get zero increment
- Moving Average: Good for steady-state Gaussian noise, but blurs edges
- Median Filter: Excellent for spike removal, preserves sharp transitions
- Exponential Smoothing: Real-time with no latency, tunable responsiveness
Critical Design Principle: Always match your imputation and filtering strategy to the sensor type and data characteristics. A one-size-fits-all approach will produce incorrect results for at least some of your sensors.
See Also
Data Quality Pipeline:
- Data Quality Overview - Full pipeline architecture
- Data Validation - Validation before imputation
- Anomaly Detection - Finding patterns in clean data
Filtering Techniques:
- Kalman Filters - Optimal filtering with fusion
- Complementary Filters - IMU signal conditioning
- Digital Signal Processing - Filter design
Applications:
- Edge Compute Patterns - Real-time filtering at edge
- Time Series Fundamentals - Handling temporal patterns
- Sensor Calibration - Bias correction
43.7 What’s Next
| If you want to… | Read this |
|---|---|
| Understand data quality validation before imputation | Data Quality Validation |
| Apply preprocessing in the broader pipeline context | Data Quality and Preprocessing |
| Practise normalisation techniques in the lab | Data Quality Normalisation Lab |
| Apply clean data to anomaly detection | Anomaly Detection Overview |
| Return to the module overview | Big Data Overview |