43  Imputation & Noise Filtering

43.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Implement Missing Data Handling: Apply appropriate imputation strategies (forward-fill, interpolation, seasonal decomposition) for different sensor types
  • Compare Imputation Methods: Evaluate trade-offs between forward-fill, interpolation, and seasonal decomposition based on data characteristics
  • Design Noise Filters: Implement moving average, median, and exponential smoothing filters and assess their signal conditioning performance
  • Distinguish Strategy by Sensor Type: Justify the correct imputation and filtering approach based on sensor semantics and physical behavior
In 60 Seconds

Missing and noisy sensor readings corrupt every downstream analysis, making data imputation (filling gaps) and filtering (smoothing noise) essential preprocessing steps before any IoT analytics pipeline can produce reliable results. The choice of imputation strategy — forward-fill, interpolation, or model-based — fundamentally affects anomaly detection sensitivity and ML model accuracy.

43.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Missing data is like a puzzle with missing pieces - but we can use clues to fill them in!

43.2.1 The Sensor Squad Adventure: The Mystery of the Missing Messages

Max the Microcontroller was worried. “Motion Mo didn’t send ANY reading! His battery must have died.”

“Oh no!” said Lila the LED. “What do we put in the report?”

Sammy the Sensor thought carefully. “Well, what kind of sensor is Mo?”

“He’s a motion sensor - he only says something when he sees movement!”

Sammy smiled. “Then if he didn’t report anything… what do you think that means?”

Max realized: “No message means no motion! We can write down ‘No motion detected!’”

But then they looked at Temperature Terry’s readings: 22… 22… GAP… GAP… GAP… 23.

“Hmm,” said Sammy. “Temperature changes slowly. If it was 22 before and 23 after, what was it probably during the gap?”

Bella the Battery did the math: “Probably 22… then 22.5… then 23! It was slowly warming up!”

The Sensor Squad learned two tricks:

  1. For motion sensors: No news = no motion (use zero!)
  2. For temperature: Connect the dots between the readings we DO have

43.2.2 Smoothing Out the Noise

Pressure Pete’s readings were jumping around: 100, 5, 98, 3, 101…

“Wait,” said Sammy. “Pressure can’t really jump from 100 to 5 and back! That’s just noise - like static on a radio.”

“How do we fix it?” asked Lila.

“Let me take the AVERAGE of a few readings: (100 + 5 + 98 + 3 + 101) / 5 = about 61… Hmm, that’s not right either because those 5s and 3s are wrong!”

Bella suggested: “What if we put them in order and take the MIDDLE one? 3, 5, 98, 100, 101 - the middle is 98! That’s probably the real pressure!”

The Sensor Squad learned: The median filter ignores the crazy outliers and finds the true reading!

43.2.3 Key Words for Kids

Word What It Means
Missing Data When a sensor doesn’t send any information - like a friend who doesn’t answer
Imputation Filling in gaps with good guesses based on clues
Noise Random jumpy readings that hide the real information
Filtering Smoothing out the noise to find the real signal
Median The middle number when you sort them in order

Missing data and noise are inevitable in real IoT deployments. Sensors lose power, networks drop packets, and electronic noise corrupts readings. Rather than discard incomplete data, we can intelligently fill gaps and smooth noise.

Two key challenges:

Challenge Cause Solution
Missing Values Battery death, network outage, sensor failure Imputation (filling gaps)
Noisy Signals Electrical interference, quantization, vibration Filtering (smoothing)

Important distinction:

  • Missing: No data point received at all
  • Noisy: Data received but corrupted or fluctuating

Key question this chapter answers: “How do I fill gaps in sensor data and smooth out noise without losing important information?”

Minimum Viable Understanding: Imputation and Filtering

Core Concept: Missing value imputation fills data gaps using neighboring or historical values, while noise filtering smooths random fluctuations to reveal the underlying signal - both must be matched to sensor semantics.

Why It Matters: Analytics and ML models require complete data series. Gaps cause errors or require discarding entire time windows. Noise obscures real patterns and triggers false alerts. Proper handling preserves data integrity while enabling downstream processing.

Key Takeaway: Match your strategy to sensor type - use forward-fill for slow-changing values (temperature), zero for event sensors (motion), and median filter for spike removal. Never interpolate binary/categorical data or forward-fill event streams.

43.3 Missing Value Imputation

  • ~10 min | - - Intermediate | - P10.C09.U04

Key Concepts

  • Missing data imputation: The process of estimating and filling in missing sensor readings using statistical or model-based methods rather than discarding incomplete records.
  • Forward-fill (Last Observation Carried Forward): An imputation strategy that replaces a missing value with the most recent valid reading — appropriate for slowly changing sensors but misleading for rapidly varying signals.
  • Linear interpolation: Estimating a missing value by drawing a straight line between the surrounding valid readings — appropriate for sensors with smooth, continuous dynamics.
  • Moving average filter: A signal smoothing technique that replaces each reading with the mean of a surrounding window, attenuating high-frequency noise at the cost of introducing lag.
  • Median filter: A non-linear filter that replaces each reading with the median of its window, highly effective at removing impulse noise (transient spikes) without distorting edges.
  • Missingness mechanism: The reason data is missing — Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR) — which determines which imputation methods are statistically valid.

43.3.1 Forward Fill (Last Observation Carried Forward)

Best for slowly-changing values like temperature:

class ForwardFillImputer:
    def __init__(self, max_gap=60):  # Maximum gap in samples
        self.last_valid = None
        self.gap_count = 0
        self.max_gap = max_gap

    def impute(self, value, is_valid):
        if is_valid:
            self.last_valid = value
            self.gap_count = 0
            return value, "original"

        if self.last_valid is None:
            return None, "no_history"

        self.gap_count += 1
        if self.gap_count > self.max_gap:
            return None, "gap_too_large"

        return self.last_valid, "imputed_ffill"
Try It: Forward Fill Simulator

43.3.2 Linear Interpolation

Better for trending values when future data is available:

def linear_interpolate(data, timestamps):
    """
    Interpolate missing values (None/NaN) using linear interpolation.
    Requires knowledge of surrounding valid points.
    """
    import numpy as np

    data = np.array(data, dtype=float)
    timestamps = np.array(timestamps, dtype=float)

    valid_mask = ~np.isnan(data)
    valid_indices = np.where(valid_mask)[0]

    if len(valid_indices) < 2:
        return data

    # Interpolate
    interpolated = np.interp(
        timestamps,
        timestamps[valid_mask],
        data[valid_mask]
    )

    return interpolated
Try It: Linear Interpolation vs Forward Fill

43.3.3 Seasonal Decomposition Fill

For data with known patterns (e.g., temperature with daily cycles):

def seasonal_fill(data, period=24):
    """
    Fill missing values using seasonal pattern from historical data.
    period: number of samples in one cycle (e.g., 24 for hourly data with daily cycle)
    """
    import numpy as np

    data = np.array(data, dtype=float)
    n = len(data)

    # Calculate seasonal pattern from valid data
    seasonal = np.zeros(period)
    counts = np.zeros(period)

    for i, val in enumerate(data):
        if not np.isnan(val):
            seasonal[i % period] += val
            counts[i % period] += 1

    # Average seasonal values
    with np.errstate(divide='ignore', invalid='ignore'):
        seasonal = np.where(counts > 0, seasonal / counts, np.nan)

    # Fill missing with seasonal pattern
    filled = data.copy()
    for i in range(n):
        if np.isnan(filled[i]) and not np.isnan(seasonal[i % period]):
            filled[i] = seasonal[i % period]

    return filled
Try It: Seasonal Decomposition Fill
Common Pitfall: Wrong Imputation Strategy

The mistake: Using forward-fill for event-driven sensors (motion, door open/close) or interpolation for categorical data.

Symptoms:

  • Motion sensor shows constant “motion detected” during sensor offline period
  • Door sensor shows “open” for hours when sensor battery died while door was open
  • Analytics show unrealistic patterns during imputed periods

Why it happens: One-size-fits-all imputation applied without considering sensor semantics.

The fix: Always match imputation strategy to sensor type. See the Decision Framework later in this chapter for a complete sensor-to-strategy mapping table and decision tree.

Prevention: Create sensor metadata that specifies the imputation strategy for each sensor type in your deployment.

How long can you safely forward-fill temperature data?

For a typical indoor temperature sensor: - Normal change rate: 0.5°C per hour (HVAC cycles) - Maximum change rate: 3°C per hour (HVAC failure, door open) - Sensor sampling: Every 60 seconds

Gap duration analysis:

Gap Duration Expected Change Forward-Fill Error Acceptable?
1 minute 0.008°C (normal) ~0.01°C ✓ Excellent
5 minutes 0.042°C ~0.05°C ✓ Very good
30 minutes 0.25°C ~0.3°C ✓ Good (trend analysis OK)
2 hours 1.0°C ~1.5°C ✗ Poor (alert thresholds invalid)

Formula for maximum safe gap:

\[t_{max} = \frac{\epsilon_{acceptable}}{r_{max}}\]

Where \(\epsilon_{acceptable}\) is the maximum tolerable error and \(r_{max}\) is the maximum expected change rate.

Example: For ±1°C acceptable error and 3°C/hour max rate:

\[t_{max} = \frac{1°C}{3°C/\text{hour}} = 0.33 \text{ hours} = 20 \text{ minutes}\]

Recommendation: Forward-fill temperature for max 20 minutes. Beyond that, flag as “sensor offline” rather than impute.

43.3.4 Try It: Forward-Fill Gap Calculator

43.4 Noise Filtering Techniques

  • ~15 min | - - - Advanced | - P10.C09.U05

43.4.1 Moving Average Filter

Simple and effective for steady-state noise reduction:

class MovingAverageFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        return sum(self.buffer) / len(self.buffer)
Try It: Moving Average Noise Reduction

43.4.2 Median Filter

Excellent for removing spike noise while preserving edges:

class MedianFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        sorted_buffer = sorted(self.buffer)
        mid = len(sorted_buffer) // 2

        if len(sorted_buffer) % 2 == 0:
            return (sorted_buffer[mid - 1] + sorted_buffer[mid]) / 2
        return sorted_buffer[mid]
Try It: Median Filter vs Moving Average for Spike Removal

43.4.3 Exponential Smoothing

Provides weighted average with more weight on recent values:

class ExponentialSmoothingFilter:
    def __init__(self, alpha=0.3):
        """
        alpha: smoothing factor (0-1)
        Higher alpha = more weight on recent values = less smoothing
        Lower alpha = more weight on history = more smoothing
        """
        self.alpha = alpha
        self.smoothed = None

    def filter(self, value):
        if self.smoothed is None:
            self.smoothed = value
        else:
            self.smoothed = self.alpha * value + (1 - self.alpha) * self.smoothed

        return self.smoothed

43.4.4 Filter Comparison

Filter Latency Edge Preservation Spike Removal Best For
Moving Average (N-1)/2 samples Poor Moderate Steady-state signals
Median (N-1)/2 samples Excellent Excellent Spike-contaminated data
Exponential Continuous Good Moderate Real-time smoothing
Kalman Minimal Excellent Excellent Known dynamics, sensor fusion

How does filter window size affect noise reduction and latency?

For a moving average filter with Gaussian noise (σ = 2.0°C) on temperature sensor:

Noise reduction formula:

\[\sigma_{filtered} = \frac{\sigma_{original}}{\sqrt{N}}\]

Where \(N\) is the window size.

Window Size (N) Noise Reduction Latency (samples) Temperature Example
3 \(\sigma / \sqrt{3} = 0.58\sigma\) 1.0 2.0°C → 1.15°C noise
5 \(\sigma / \sqrt{5} = 0.45\sigma\) 2.0 2.0°C → 0.89°C noise
10 \(\sigma / \sqrt{10} = 0.32\sigma\) 4.5 2.0°C → 0.63°C noise
20 \(\sigma / \sqrt{20} = 0.22\sigma\) 9.5 2.0°C → 0.45°C noise

Trade-off: Larger window → better noise rejection BUT longer delay detecting real changes.

Latency calculation: Output lags input by \(\frac{N-1}{2}\) samples. For \(N=10\) at 1 Hz sampling → 4.5 second delay.

Practical rule: Choose \(N\) such that latency is < 10% of the timescale you care about. If monitoring hourly HVAC cycles (3600s), 5-10 sample window (5-10s latency) is acceptable. If detecting rapid door opening events (10s timescale), use \(N=3\) (1.5s latency max).

43.4.5 Try It: Filter Window Size Calculator

43.4.6 Try It: Exponential Smoothing Explorer

43.4.7 Choosing the Right Filter

def choose_filter(signal_characteristics):
    """
    Guide for selecting appropriate noise filter based on signal characteristics.
    """
    recommendations = {
        'steady_state_with_gaussian_noise': {
            'filter': 'MovingAverage',
            'reason': 'Averages out random noise effectively',
            'window_size': 5  # Adjust based on noise frequency
        },
        'spiky_noise_impulse': {
            'filter': 'MedianFilter',
            'reason': 'Completely ignores outlier spikes',
            'window_size': 5  # Odd number works best
        },
        'real_time_tracking': {
            'filter': 'ExponentialSmoothing',
            'reason': 'No latency, responsive to changes',
            'alpha': 0.3  # Lower = smoother, higher = more responsive
        },
        'sensor_fusion_known_dynamics': {
            'filter': 'KalmanFilter',
            'reason': 'Optimal estimation with uncertainty tracking',
            'params': 'process_noise, measurement_noise'
        },
        'edge_preserving': {
            'filter': 'MedianFilter',
            'reason': 'Preserves sharp transitions in data',
            'window_size': 3
        }
    }
    return recommendations.get(signal_characteristics, recommendations['steady_state_with_gaussian_noise'])
Try It: Filter Selection Advisor

43.4.8 Combining Filters

For robust noise removal, filters can be cascaded:

class CombinedFilter:
    """
    Two-stage filter: Median first (remove spikes), then exponential smooth.
    """
    def __init__(self, median_window=5, exp_alpha=0.3):
        self.median_filter = MedianFilter(median_window)
        self.exp_filter = ExponentialSmoothingFilter(exp_alpha)

    def filter(self, value):
        # Stage 1: Remove spikes with median
        despike = self.median_filter.filter(value)
        # Stage 2: Smooth remaining noise
        smooth = self.exp_filter.filter(despike)
        return smooth

# Usage
combined = CombinedFilter(median_window=5, exp_alpha=0.2)
for reading in sensor_stream:
    clean_value = combined.filter(reading)
Try It: Two-Stage Combined Filter Pipeline

Scenario: You have a temperature sensor monitoring a cold storage facility. The sensor reports every 10 seconds. You’ve observed occasional spikes due to electrical interference when the cooling compressor starts (5-10C jumps), and you need to filter these without losing legitimate temperature trends.

Given:

  • Sampling rate: 0.1 Hz (1 sample per 10 seconds)
  • Normal temperature: -18C ± 2C
  • Compressor noise: Random spikes to -8C or -28C (duration: 1-2 samples)
  • Legitimate temperature changes: 0.5C per minute maximum

Question: Should you use a moving average or median filter, and what window size?

Solution:

Step 1: Analyze the noise characteristics

  • Spike duration: 1-2 samples = 10-20 seconds
  • Spike frequency: Approximately 5% of readings (every 200 seconds when compressor cycles)
  • Spike magnitude: 10C deviation (huge compared to normal 2C variation)

Step 2: Calculate required window size

For moving average: - Window needs to span spike duration - 3-sample window: averages noise into adjacent readings - Example: [-18, -28, -18] → average = -21.3C (still shows distortion)

For median filter: - Window needs odd number of samples - 3-sample window: [-18, -28, -18] → median = -18C (spike completely removed!) - 5-sample window: [-18, -18, -28, -18, -18] → median = -18C (still perfect)

Step 3: Verify edge preservation

Legitimate temperature change over 1 minute: - Rate: 0.5C/min = 0.083C per 10 seconds - Over 5 samples: [-18.0, -18.1, -18.2, -18.3, -18.4] - Median of 5: -18.2C (preserves trend!)

Step 4: Calculate latency

Window size 5 = (5-1)/2 = 2 samples delay = 20 seconds latency

For cold storage monitoring (not time-critical), 20 seconds is acceptable.

Answer: Use 5-sample median filter

Why:

  • Completely removes 1-2 sample spikes
  • Preserves legitimate temperature trends
  • No tuning parameters (unlike moving average weights)
  • Latency (20s) acceptable for this application

Implementation:

median_filter = MedianFilter(window_size=5)
for reading in sensor_stream:
    clean_temp = median_filter.filter(reading)
    if clean_temp < -20:  # After filtering, threshold check is reliable
        trigger_high_temp_alarm()

Key Insight: Median filters excel when noise is sparse spikes rather than continuous Gaussian noise. The window should be large enough to ensure spikes are minority values (< 50%) within the window.

When sensor data goes missing, selecting the correct imputation strategy depends on sensor characteristics and downstream requirements. Use this framework to guide your decision:

Sensor Characteristic Imputation Strategy Rationale Max Gap Duration
Slowly changing continuous (temperature, humidity) Forward-fill or linear interpolation Physical inertia prevents rapid changes 10-30 minutes
Event-driven binary (motion detector, door switch) Zero/False for missing periods Absence of event signal = no event occurred Unlimited (but flag gaps >1 hour)
Monotonic counter (energy meter, flow meter) Zero increment for missing period No reading = no consumption during gap Up to 1 day
Periodic with known pattern (daily temperature cycle) Seasonal decomposition Leverage historical pattern Up to 6 hours
High-frequency volatile (stock price, vibration) Do not impute - mark as missing Interpolation creates false data N/A - preserve gaps
Redundant sensor array Use nearby sensor + bias correction Spatial correlation for better estimate Depends on sensor density

Decision Tree:

  1. Is the sensor event-driven?
    • YES → Use zero/default state for missing periods
    • NO → Continue to step 2
  2. Does the signal change slowly? (Rate < 10% per time constant)
    • YES → Forward-fill acceptable for gaps < 10x sampling interval
    • NO → Continue to step 3
  3. Is there a known periodic pattern?
    • YES → Use seasonal decomposition fill
    • NO → Continue to step 4
  4. Are there nearby sensors measuring the same quantity?
    • YES → Use spatial interpolation from neighbors
    • NO → Use linear interpolation or mark as missing

Example Application:

def select_imputation_strategy(sensor_type, gap_duration_minutes):
    if sensor_type == "motion_detector":
        return "zero_fill"  # No motion during gap
    elif sensor_type == "temperature":
        if gap_duration_minutes < 30:
            return "forward_fill"
        elif gap_duration_minutes < 360:
            return "seasonal_fill"  # Use daily pattern
        else:
            return "mark_missing"  # Gap too long
    elif sensor_type == "energy_meter":
        return "zero_increment"  # No consumption
    elif sensor_type == "vibration":
        return "mark_missing"  # Cannot safely interpolate

Warning Signs of Wrong Strategy:

  • Motion sensor shows continuous “detected” during power outage → Used forward-fill instead of zero
  • Temperature shows impossible linear ramp over 6-hour gap → Used interpolation instead of seasonal pattern
  • Energy meter shows zero consumption for entire day → Used zero_increment for too-long gap (should alarm)
Common Mistake: Imputing Before Outlier Detection

The Mistake: Running imputation before outlier detection, causing outliers to be forward-filled or interpolated into the data stream, permanently corrupting adjacent readings.

Why It Happens: Data quality pipelines are often built incrementally. Engineers add imputation first (to handle missing data), then later realize they need outlier detection. By then, the pipeline order is established and changing it requires refactoring.

Example of the Problem:

# WRONG: Impute first, detect outliers second
readings = [22.1, 22.3, 99.9, None, None, None, 22.8]  # 99.9 is sensor fault

# Step 1: Forward-fill (MISTAKE - happens before outlier removal)
imputed = forward_fill(readings)
# Result: [22.1, 22.3, 99.9, 99.9, 99.9, 99.9, 22.8]

# Step 2: Outlier detection
cleaned = remove_outliers(imputed, threshold=3_sigma)
# Result: [22.1, 22.3, REMOVED, REMOVED, REMOVED, REMOVED, 22.8]
# Lost 4 data points! The None values became 99.9 outliers.

The Fix: Always apply outlier detection and validation BEFORE imputation:

# CORRECT: Validate first, impute second
readings = [22.1, 22.3, 99.9, None, None, None, 22.8]

# Step 1: Outlier detection and removal (mark as None)
validated = remove_outliers(readings, threshold=3_sigma)
# Result: [22.1, 22.3, None, None, None, None, 22.8]

# Step 2: Forward-fill ONLY the validated stream
imputed = forward_fill(validated)
# Result: [22.1, 22.3, 22.3, 22.3, 22.3, 22.3, 22.8]
# Correctly preserved legitimate data!

Correct Pipeline Order:

  1. Range Validation → Mark out-of-range values as None
  2. Rate-of-Change Validation → Mark impossible jumps as None
  3. Outlier Detection → Mark statistical outliers as None
  4. Missing Value Imputation → Fill None values using appropriate strategy
  5. Noise Filtering → Apply smoothing to cleaned data

Real-World Impact: A temperature monitoring system in a pharmaceutical warehouse experienced this bug. A faulty sensor spiked to 85C for one reading before going offline. The spike was forward-filled for 30 minutes (the gap duration), triggering false temperature excursion alarms and requiring destruction of $50,000 worth of temperature-sensitive drugs. Root cause: imputation ran before outlier removal in the data pipeline.

Prevention Checklist:

43.5 Knowledge Check

Common Pitfalls

Forward-filling works for temperature that changes by 0.5°C per minute but creates flat-line artefacts for high-frequency vibration data. Match the imputation method to the signal dynamics.

If downstream analytics cannot distinguish real readings from imputed ones, anomaly detectors may flag imputed values as anomalies or ML models may learn from artefacts. Always add an imputation flag column alongside filled values.

Global mean imputation destroys temporal patterns (seasonality, trends) that are the most valuable features in IoT data. Use time-local methods (linear interpolation, seasonal decomposition imputation) instead.

A 100-point moving average on a 1 Hz sensor introduces 50-second lag — unacceptable for real-time anomaly detection. Balance noise suppression against lag, or use an exponential moving average for zero-phase filtering.

43.6 Summary

Missing value imputation and noise filtering are essential for producing clean, complete sensor data:

  • Forward Fill: Simple and effective for slowly-changing continuous values (temperature, humidity)
  • Linear Interpolation: Better for trending data when you have values on both sides of the gap
  • Seasonal Fill: Use when data has known periodic patterns (daily temperature cycles)
  • Sensor-Specific Imputation: Motion sensors get zero, state sensors get “unknown”, counters get zero increment
  • Moving Average: Good for steady-state Gaussian noise, but blurs edges
  • Median Filter: Excellent for spike removal, preserves sharp transitions
  • Exponential Smoothing: Real-time with no latency, tunable responsiveness

Critical Design Principle: Always match your imputation and filtering strategy to the sensor type and data characteristics. A one-size-fits-all approach will produce incorrect results for at least some of your sensors.

Concept Relationships

Builds On:

Enables:

Data Quality Pipeline:

Filtering Techniques:

Applications:

43.7 What’s Next

If you want to… Read this
Understand data quality validation before imputation Data Quality Validation
Apply preprocessing in the broader pipeline context Data Quality and Preprocessing
Practise normalisation techniques in the lab Data Quality Normalisation Lab
Apply clean data to anomaly detection Anomaly Detection Overview
Return to the module overview Big Data Overview