%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
Start[Missing Data<br/>Detected] --> Type{Sensor Type?}
Type -->|Continuous<br/>Temperature, Humidity| Cont[Continuous<br/>Measurement]
Type -->|Event-Based<br/>Motion, Button| Event[Event<br/>Sensor]
Type -->|State<br/>Door, Switch| State[Binary<br/>State]
Type -->|Counter<br/>Energy, Traffic| Counter[Cumulative<br/>Counter]
Cont --> Gap{Gap Size?}
Gap -->|Short<br/>< 5 samples| FFill[Forward Fill]
Gap -->|Medium<br/>5-60 samples| Interp[Linear<br/>Interpolation]
Gap -->|Long<br/>> 60 samples| Seasonal[Seasonal<br/>Decomposition]
Event --> Zero[Use Zero<br/>No Event]
State --> Unknown[Flag as<br/>Unknown]
Counter --> ZeroInc[Zero<br/>Increment]
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style FFill fill:#27AE60,stroke:#2C3E50,color:#fff
style Interp fill:#16A085,stroke:#2C3E50,color:#fff
style Seasonal fill:#3498DB,stroke:#2C3E50,color:#fff
style Zero fill:#E67E22,stroke:#2C3E50,color:#fff
style Unknown fill:#7F8C8D,stroke:#2C3E50,color:#fff
style ZeroInc fill:#9B59B6,stroke:#2C3E50,color:#fff
1307 Missing Value Imputation and Noise Filtering
1307.1 Learning Objectives
By the end of this chapter, you will be able to:
- Handle Missing Data: Implement appropriate imputation strategies for different sensor types and use cases
- Select Imputation Methods: Choose between forward-fill, interpolation, and seasonal decomposition based on data characteristics
- Design Noise Filters: Implement moving average, median, and exponential smoothing filters for signal conditioning
- Match Strategy to Sensor Type: Apply the correct imputation and filtering approach based on sensor semantics
1307.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- Data Validation and Outlier Detection: Understanding validation as the first stage of the data quality pipeline
- Signal Processing Essentials: Basic concepts of filtering and signal conditioning
- Sensor Fundamentals: Knowledge of different sensor types and their output characteristics
Missing data is like a puzzle with missing pieces - but we can use clues to fill them in!
1307.2.1 The Sensor Squad Adventure: The Mystery of the Missing Messages
Max the Microcontroller was worried. “Motion Mo didn’t send ANY reading! His battery must have died.”
“Oh no!” said Lila the LED. “What do we put in the report?”
Sammy the Sensor thought carefully. “Well, what kind of sensor is Mo?”
“He’s a motion sensor - he only says something when he sees movement!”
Sammy smiled. “Then if he didn’t report anything… what do you think that means?”
Max realized: “No message means no motion! We can write down ‘No motion detected!’”
But then they looked at Temperature Terry’s readings: 22… 22… GAP… GAP… GAP… 23.
“Hmm,” said Sammy. “Temperature changes slowly. If it was 22 before and 23 after, what was it probably during the gap?”
Bella the Battery did the math: “Probably 22… then 22.5… then 23! It was slowly warming up!”
The Sensor Squad learned two tricks:
- For motion sensors: No news = no motion (use zero!)
- For temperature: Connect the dots between the readings we DO have
1307.2.2 Smoothing Out the Noise
Pressure Pete’s readings were jumping around: 100, 5, 98, 3, 101…
“Wait,” said Sammy. “Pressure can’t really jump from 100 to 5 and back! That’s just noise - like static on a radio.”
“How do we fix it?” asked Lila.
“Let me take the AVERAGE of a few readings: (100 + 5 + 98 + 3 + 101) / 5 = about 61… Hmm, that’s not right either because those 5s and 3s are wrong!”
Bella suggested: “What if we put them in order and take the MIDDLE one? 3, 5, 98, 100, 101 - the middle is 98! That’s probably the real pressure!”
The Sensor Squad learned: The median filter ignores the crazy outliers and finds the true reading!
1307.2.3 Key Words for Kids
| Word | What It Means |
|---|---|
| Missing Data | When a sensor doesn’t send any information - like a friend who doesn’t answer |
| Imputation | Filling in gaps with good guesses based on clues |
| Noise | Random jumpy readings that hide the real information |
| Filtering | Smoothing out the noise to find the real signal |
| Median | The middle number when you sort them in order |
Missing data and noise are inevitable in real IoT deployments. Sensors lose power, networks drop packets, and electronic noise corrupts readings. Rather than discard incomplete data, we can intelligently fill gaps and smooth noise.
Two key challenges:
| Challenge | Cause | Solution |
|---|---|---|
| Missing Values | Battery death, network outage, sensor failure | Imputation (filling gaps) |
| Noisy Signals | Electrical interference, quantization, vibration | Filtering (smoothing) |
Important distinction:
- Missing: No data point received at all
- Noisy: Data received but corrupted or fluctuating
Key question this chapter answers: “How do I fill gaps in sensor data and smooth out noise without losing important information?”
Core Concept: Missing value imputation fills data gaps using neighboring or historical values, while noise filtering smooths random fluctuations to reveal the underlying signal - both must be matched to sensor semantics.
Why It Matters: Analytics and ML models require complete data series. Gaps cause errors or require discarding entire time windows. Noise obscures real patterns and triggers false alerts. Proper handling preserves data integrity while enabling downstream processing.
Key Takeaway: Match your strategy to sensor type - use forward-fill for slow-changing values (temperature), zero for event sensors (motion), and median filter for spike removal. Never interpolate binary/categorical data or forward-fill event streams.
1307.3 Missing Value Imputation
1307.3.1 Forward Fill (Last Observation Carried Forward)
Best for slowly-changing values like temperature:
class ForwardFillImputer:
def __init__(self, max_gap=60): # Maximum gap in samples
self.last_valid = None
self.gap_count = 0
self.max_gap = max_gap
def impute(self, value, is_valid):
if is_valid:
self.last_valid = value
self.gap_count = 0
return value, "original"
if self.last_valid is None:
return None, "no_history"
self.gap_count += 1
if self.gap_count > self.max_gap:
return None, "gap_too_large"
return self.last_valid, "imputed_ffill"1307.3.2 Linear Interpolation
Better for trending values when future data is available:
def linear_interpolate(data, timestamps):
"""
Interpolate missing values (None/NaN) using linear interpolation.
Requires knowledge of surrounding valid points.
"""
import numpy as np
data = np.array(data, dtype=float)
timestamps = np.array(timestamps, dtype=float)
valid_mask = ~np.isnan(data)
valid_indices = np.where(valid_mask)[0]
if len(valid_indices) < 2:
return data
# Interpolate
interpolated = np.interp(
timestamps,
timestamps[valid_mask],
data[valid_mask]
)
return interpolated1307.3.3 Seasonal Decomposition Fill
For data with known patterns (e.g., temperature with daily cycles):
def seasonal_fill(data, period=24):
"""
Fill missing values using seasonal pattern from historical data.
period: number of samples in one cycle (e.g., 24 for hourly data with daily cycle)
"""
import numpy as np
data = np.array(data, dtype=float)
n = len(data)
# Calculate seasonal pattern from valid data
seasonal = np.zeros(period)
counts = np.zeros(period)
for i, val in enumerate(data):
if not np.isnan(val):
seasonal[i % period] += val
counts[i % period] += 1
# Average seasonal values
with np.errstate(divide='ignore', invalid='ignore'):
seasonal = np.where(counts > 0, seasonal / counts, np.nan)
# Fill missing with seasonal pattern
filled = data.copy()
for i in range(n):
if np.isnan(filled[i]) and not np.isnan(seasonal[i % period]):
filled[i] = seasonal[i % period]
return filledThe mistake: Using forward-fill for event-driven sensors (motion, door open/close) or interpolation for categorical data.
Symptoms:
- Motion sensor shows constant “motion detected” during sensor offline period
- Door sensor shows “open” for hours when sensor battery died while door was open
- Analytics show unrealistic patterns during imputed periods
Why it happens: One-size-fits-all imputation applied without considering sensor semantics.
The fix: Match imputation strategy to sensor type:
| Sensor Type | Imputation Strategy | Reason |
|---|---|---|
| Temperature | Forward-fill or interpolate | Slowly changing, continuous |
| Motion | Use “no motion” (0) | Absence of reading means no motion |
| Door state | Flag as unknown | Cannot assume state |
| Counter | Zero increment | Missing period means no events |
| Humidity | Interpolate with bounds | Continuous, bounded 0-100% |
Prevention: Create sensor metadata that specifies the imputation strategy for each sensor type in your deployment.
1307.3.4 Imputation Strategy Selection
1307.4 Noise Filtering Techniques
1307.4.1 Moving Average Filter
Simple and effective for steady-state noise reduction:
class MovingAverageFilter:
def __init__(self, window_size=5):
self.window_size = window_size
self.buffer = []
def filter(self, value):
self.buffer.append(value)
if len(self.buffer) > self.window_size:
self.buffer.pop(0)
return sum(self.buffer) / len(self.buffer)1307.4.2 Median Filter
Excellent for removing spike noise while preserving edges:
class MedianFilter:
def __init__(self, window_size=5):
self.window_size = window_size
self.buffer = []
def filter(self, value):
self.buffer.append(value)
if len(self.buffer) > self.window_size:
self.buffer.pop(0)
sorted_buffer = sorted(self.buffer)
mid = len(sorted_buffer) // 2
if len(sorted_buffer) % 2 == 0:
return (sorted_buffer[mid - 1] + sorted_buffer[mid]) / 2
return sorted_buffer[mid]1307.4.3 Exponential Smoothing
Provides weighted average with more weight on recent values:
class ExponentialSmoothingFilter:
def __init__(self, alpha=0.3):
"""
alpha: smoothing factor (0-1)
Higher alpha = more weight on recent values = less smoothing
Lower alpha = more weight on history = more smoothing
"""
self.alpha = alpha
self.smoothed = None
def filter(self, value):
if self.smoothed is None:
self.smoothed = value
else:
self.smoothed = self.alpha * value + (1 - self.alpha) * self.smoothed
return self.smoothed1307.4.4 Filter Comparison
| Filter | Latency | Edge Preservation | Spike Removal | Best For |
|---|---|---|---|---|
| Moving Average | window/2 samples | Poor | Moderate | Steady-state signals |
| Median | window/2 samples | Excellent | Excellent | Spike-contaminated data |
| Exponential | Continuous | Good | Moderate | Real-time smoothing |
| Kalman | Minimal | Excellent | Excellent | Known dynamics, sensor fusion |
1307.4.5 Visual Comparison of Filters
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
subgraph Input["Noisy Signal"]
A[Raw Data<br/>with Spikes]
end
subgraph Filters["Filter Options"]
MA[Moving<br/>Average]
MED[Median<br/>Filter]
EXP[Exponential<br/>Smoothing]
end
subgraph Results["Filtered Output"]
MA_OUT[Smoothed<br/>but Blurred]
MED_OUT[Spikes<br/>Removed]
EXP_OUT[Responsive<br/>Smooth]
end
A --> MA --> MA_OUT
A --> MED --> MED_OUT
A --> EXP --> EXP_OUT
style A fill:#E74C3C,stroke:#2C3E50,color:#fff
style MA fill:#3498DB,stroke:#2C3E50,color:#fff
style MED fill:#16A085,stroke:#2C3E50,color:#fff
style EXP fill:#9B59B6,stroke:#2C3E50,color:#fff
style MA_OUT fill:#3498DB,stroke:#2C3E50,color:#fff
style MED_OUT fill:#16A085,stroke:#2C3E50,color:#fff
style EXP_OUT fill:#9B59B6,stroke:#2C3E50,color:#fff
1307.4.6 Choosing the Right Filter
def choose_filter(signal_characteristics):
"""
Guide for selecting appropriate noise filter based on signal characteristics.
"""
recommendations = {
'steady_state_with_gaussian_noise': {
'filter': 'MovingAverage',
'reason': 'Averages out random noise effectively',
'window_size': 5 # Adjust based on noise frequency
},
'spiky_noise_impulse': {
'filter': 'MedianFilter',
'reason': 'Completely ignores outlier spikes',
'window_size': 5 # Odd number works best
},
'real_time_tracking': {
'filter': 'ExponentialSmoothing',
'reason': 'No latency, responsive to changes',
'alpha': 0.3 # Lower = smoother, higher = more responsive
},
'sensor_fusion_known_dynamics': {
'filter': 'KalmanFilter',
'reason': 'Optimal estimation with uncertainty tracking',
'params': 'process_noise, measurement_noise'
},
'edge_preserving': {
'filter': 'MedianFilter',
'reason': 'Preserves sharp transitions in data',
'window_size': 3
}
}
return recommendations.get(signal_characteristics, recommendations['steady_state_with_gaussian_noise'])1307.4.7 Combining Filters
For robust noise removal, filters can be cascaded:
class CombinedFilter:
"""
Two-stage filter: Median first (remove spikes), then exponential smooth.
"""
def __init__(self, median_window=5, exp_alpha=0.3):
self.median_filter = MedianFilter(median_window)
self.exp_filter = ExponentialSmoothingFilter(exp_alpha)
def filter(self, value):
# Stage 1: Remove spikes with median
despike = self.median_filter.filter(value)
# Stage 2: Smooth remaining noise
smooth = self.exp_filter.filter(despike)
return smooth
# Usage
combined = CombinedFilter(median_window=5, exp_alpha=0.2)
for reading in sensor_stream:
clean_value = combined.filter(reading)1307.5 Knowledge Check
1307.6 Summary
Missing value imputation and noise filtering are essential for producing clean, complete sensor data:
- Forward Fill: Simple and effective for slowly-changing continuous values (temperature, humidity)
- Linear Interpolation: Better for trending data when you have values on both sides of the gap
- Seasonal Fill: Use when data has known periodic patterns (daily temperature cycles)
- Sensor-Specific Imputation: Motion sensors get zero, state sensors get “unknown”, counters get zero increment
- Moving Average: Good for steady-state Gaussian noise, but blurs edges
- Median Filter: Excellent for spike removal, preserves sharp transitions
- Exponential Smoothing: Real-time with no latency, tunable responsiveness
Critical Design Principle: Always match your imputation and filtering strategy to the sensor type and data characteristics. A one-size-fits-all approach will produce incorrect results for at least some of your sensors.
1307.7 What’s Next
The next chapter covers Data Normalization and the Preprocessing Lab, exploring how to scale data for multi-sensor fusion and providing hands-on practice with a complete data quality pipeline.
Data Quality Series:
- Data Quality and Preprocessing - Overview and index
- Data Validation and Outlier Detection - Validation and outliers
- Data Normalization and Preprocessing Lab - Scaling and hands-on practice
Advanced Topics:
- Multi-Sensor Data Fusion - Combining preprocessed data
- Stream Processing - Real-time data pipelines
- Anomaly Detection - Finding meaningful patterns in clean data