1307  Missing Value Imputation and Noise Filtering

1307.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Handle Missing Data: Implement appropriate imputation strategies for different sensor types and use cases
  • Select Imputation Methods: Choose between forward-fill, interpolation, and seasonal decomposition based on data characteristics
  • Design Noise Filters: Implement moving average, median, and exponential smoothing filters for signal conditioning
  • Match Strategy to Sensor Type: Apply the correct imputation and filtering approach based on sensor semantics

1307.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Missing data is like a puzzle with missing pieces - but we can use clues to fill them in!

1307.2.1 The Sensor Squad Adventure: The Mystery of the Missing Messages

Max the Microcontroller was worried. “Motion Mo didn’t send ANY reading! His battery must have died.”

“Oh no!” said Lila the LED. “What do we put in the report?”

Sammy the Sensor thought carefully. “Well, what kind of sensor is Mo?”

“He’s a motion sensor - he only says something when he sees movement!”

Sammy smiled. “Then if he didn’t report anything… what do you think that means?”

Max realized: “No message means no motion! We can write down ‘No motion detected!’”

But then they looked at Temperature Terry’s readings: 22… 22… GAP… GAP… GAP… 23.

“Hmm,” said Sammy. “Temperature changes slowly. If it was 22 before and 23 after, what was it probably during the gap?”

Bella the Battery did the math: “Probably 22… then 22.5… then 23! It was slowly warming up!”

The Sensor Squad learned two tricks:

  1. For motion sensors: No news = no motion (use zero!)
  2. For temperature: Connect the dots between the readings we DO have

1307.2.2 Smoothing Out the Noise

Pressure Pete’s readings were jumping around: 100, 5, 98, 3, 101…

“Wait,” said Sammy. “Pressure can’t really jump from 100 to 5 and back! That’s just noise - like static on a radio.”

“How do we fix it?” asked Lila.

“Let me take the AVERAGE of a few readings: (100 + 5 + 98 + 3 + 101) / 5 = about 61… Hmm, that’s not right either because those 5s and 3s are wrong!”

Bella suggested: “What if we put them in order and take the MIDDLE one? 3, 5, 98, 100, 101 - the middle is 98! That’s probably the real pressure!”

The Sensor Squad learned: The median filter ignores the crazy outliers and finds the true reading!

1307.2.3 Key Words for Kids

Word What It Means
Missing Data When a sensor doesn’t send any information - like a friend who doesn’t answer
Imputation Filling in gaps with good guesses based on clues
Noise Random jumpy readings that hide the real information
Filtering Smoothing out the noise to find the real signal
Median The middle number when you sort them in order

Missing data and noise are inevitable in real IoT deployments. Sensors lose power, networks drop packets, and electronic noise corrupts readings. Rather than discard incomplete data, we can intelligently fill gaps and smooth noise.

Two key challenges:

Challenge Cause Solution
Missing Values Battery death, network outage, sensor failure Imputation (filling gaps)
Noisy Signals Electrical interference, quantization, vibration Filtering (smoothing)

Important distinction:

  • Missing: No data point received at all
  • Noisy: Data received but corrupted or fluctuating

Key question this chapter answers: “How do I fill gaps in sensor data and smooth out noise without losing important information?”

TipMinimum Viable Understanding: Imputation and Filtering

Core Concept: Missing value imputation fills data gaps using neighboring or historical values, while noise filtering smooths random fluctuations to reveal the underlying signal - both must be matched to sensor semantics.

Why It Matters: Analytics and ML models require complete data series. Gaps cause errors or require discarding entire time windows. Noise obscures real patterns and triggers false alerts. Proper handling preserves data integrity while enabling downstream processing.

Key Takeaway: Match your strategy to sensor type - use forward-fill for slow-changing values (temperature), zero for event sensors (motion), and median filter for spike removal. Never interpolate binary/categorical data or forward-fill event streams.

1307.3 Missing Value Imputation

  • ~10 min | - - Intermediate | - P10.C09.U04

1307.3.1 Forward Fill (Last Observation Carried Forward)

Best for slowly-changing values like temperature:

class ForwardFillImputer:
    def __init__(self, max_gap=60):  # Maximum gap in samples
        self.last_valid = None
        self.gap_count = 0
        self.max_gap = max_gap

    def impute(self, value, is_valid):
        if is_valid:
            self.last_valid = value
            self.gap_count = 0
            return value, "original"

        if self.last_valid is None:
            return None, "no_history"

        self.gap_count += 1
        if self.gap_count > self.max_gap:
            return None, "gap_too_large"

        return self.last_valid, "imputed_ffill"

1307.3.2 Linear Interpolation

Better for trending values when future data is available:

def linear_interpolate(data, timestamps):
    """
    Interpolate missing values (None/NaN) using linear interpolation.
    Requires knowledge of surrounding valid points.
    """
    import numpy as np

    data = np.array(data, dtype=float)
    timestamps = np.array(timestamps, dtype=float)

    valid_mask = ~np.isnan(data)
    valid_indices = np.where(valid_mask)[0]

    if len(valid_indices) < 2:
        return data

    # Interpolate
    interpolated = np.interp(
        timestamps,
        timestamps[valid_mask],
        data[valid_mask]
    )

    return interpolated

1307.3.3 Seasonal Decomposition Fill

For data with known patterns (e.g., temperature with daily cycles):

def seasonal_fill(data, period=24):
    """
    Fill missing values using seasonal pattern from historical data.
    period: number of samples in one cycle (e.g., 24 for hourly data with daily cycle)
    """
    import numpy as np

    data = np.array(data, dtype=float)
    n = len(data)

    # Calculate seasonal pattern from valid data
    seasonal = np.zeros(period)
    counts = np.zeros(period)

    for i, val in enumerate(data):
        if not np.isnan(val):
            seasonal[i % period] += val
            counts[i % period] += 1

    # Average seasonal values
    with np.errstate(divide='ignore', invalid='ignore'):
        seasonal = np.where(counts > 0, seasonal / counts, np.nan)

    # Fill missing with seasonal pattern
    filled = data.copy()
    for i in range(n):
        if np.isnan(filled[i]) and not np.isnan(seasonal[i % period]):
            filled[i] = seasonal[i % period]

    return filled
WarningCommon Pitfall: Wrong Imputation Strategy

The mistake: Using forward-fill for event-driven sensors (motion, door open/close) or interpolation for categorical data.

Symptoms:

  • Motion sensor shows constant “motion detected” during sensor offline period
  • Door sensor shows “open” for hours when sensor battery died while door was open
  • Analytics show unrealistic patterns during imputed periods

Why it happens: One-size-fits-all imputation applied without considering sensor semantics.

The fix: Match imputation strategy to sensor type:

Sensor Type Imputation Strategy Reason
Temperature Forward-fill or interpolate Slowly changing, continuous
Motion Use “no motion” (0) Absence of reading means no motion
Door state Flag as unknown Cannot assume state
Counter Zero increment Missing period means no events
Humidity Interpolate with bounds Continuous, bounded 0-100%

Prevention: Create sensor metadata that specifies the imputation strategy for each sensor type in your deployment.

1307.3.4 Imputation Strategy Selection

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
    Start[Missing Data<br/>Detected] --> Type{Sensor Type?}

    Type -->|Continuous<br/>Temperature, Humidity| Cont[Continuous<br/>Measurement]
    Type -->|Event-Based<br/>Motion, Button| Event[Event<br/>Sensor]
    Type -->|State<br/>Door, Switch| State[Binary<br/>State]
    Type -->|Counter<br/>Energy, Traffic| Counter[Cumulative<br/>Counter]

    Cont --> Gap{Gap Size?}
    Gap -->|Short<br/>< 5 samples| FFill[Forward Fill]
    Gap -->|Medium<br/>5-60 samples| Interp[Linear<br/>Interpolation]
    Gap -->|Long<br/>> 60 samples| Seasonal[Seasonal<br/>Decomposition]

    Event --> Zero[Use Zero<br/>No Event]

    State --> Unknown[Flag as<br/>Unknown]

    Counter --> ZeroInc[Zero<br/>Increment]

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style FFill fill:#27AE60,stroke:#2C3E50,color:#fff
    style Interp fill:#16A085,stroke:#2C3E50,color:#fff
    style Seasonal fill:#3498DB,stroke:#2C3E50,color:#fff
    style Zero fill:#E67E22,stroke:#2C3E50,color:#fff
    style Unknown fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style ZeroInc fill:#9B59B6,stroke:#2C3E50,color:#fff

1307.4 Noise Filtering Techniques

  • ~15 min | - - - Advanced | - P10.C09.U05

1307.4.1 Moving Average Filter

Simple and effective for steady-state noise reduction:

class MovingAverageFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        return sum(self.buffer) / len(self.buffer)

1307.4.2 Median Filter

Excellent for removing spike noise while preserving edges:

class MedianFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        sorted_buffer = sorted(self.buffer)
        mid = len(sorted_buffer) // 2

        if len(sorted_buffer) % 2 == 0:
            return (sorted_buffer[mid - 1] + sorted_buffer[mid]) / 2
        return sorted_buffer[mid]

1307.4.3 Exponential Smoothing

Provides weighted average with more weight on recent values:

class ExponentialSmoothingFilter:
    def __init__(self, alpha=0.3):
        """
        alpha: smoothing factor (0-1)
        Higher alpha = more weight on recent values = less smoothing
        Lower alpha = more weight on history = more smoothing
        """
        self.alpha = alpha
        self.smoothed = None

    def filter(self, value):
        if self.smoothed is None:
            self.smoothed = value
        else:
            self.smoothed = self.alpha * value + (1 - self.alpha) * self.smoothed

        return self.smoothed

1307.4.4 Filter Comparison

Filter Latency Edge Preservation Spike Removal Best For
Moving Average window/2 samples Poor Moderate Steady-state signals
Median window/2 samples Excellent Excellent Spike-contaminated data
Exponential Continuous Good Moderate Real-time smoothing
Kalman Minimal Excellent Excellent Known dynamics, sensor fusion

1307.4.5 Visual Comparison of Filters

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Input["Noisy Signal"]
        A[Raw Data<br/>with Spikes]
    end

    subgraph Filters["Filter Options"]
        MA[Moving<br/>Average]
        MED[Median<br/>Filter]
        EXP[Exponential<br/>Smoothing]
    end

    subgraph Results["Filtered Output"]
        MA_OUT[Smoothed<br/>but Blurred]
        MED_OUT[Spikes<br/>Removed]
        EXP_OUT[Responsive<br/>Smooth]
    end

    A --> MA --> MA_OUT
    A --> MED --> MED_OUT
    A --> EXP --> EXP_OUT

    style A fill:#E74C3C,stroke:#2C3E50,color:#fff
    style MA fill:#3498DB,stroke:#2C3E50,color:#fff
    style MED fill:#16A085,stroke:#2C3E50,color:#fff
    style EXP fill:#9B59B6,stroke:#2C3E50,color:#fff
    style MA_OUT fill:#3498DB,stroke:#2C3E50,color:#fff
    style MED_OUT fill:#16A085,stroke:#2C3E50,color:#fff
    style EXP_OUT fill:#9B59B6,stroke:#2C3E50,color:#fff

1307.4.6 Choosing the Right Filter

def choose_filter(signal_characteristics):
    """
    Guide for selecting appropriate noise filter based on signal characteristics.
    """
    recommendations = {
        'steady_state_with_gaussian_noise': {
            'filter': 'MovingAverage',
            'reason': 'Averages out random noise effectively',
            'window_size': 5  # Adjust based on noise frequency
        },
        'spiky_noise_impulse': {
            'filter': 'MedianFilter',
            'reason': 'Completely ignores outlier spikes',
            'window_size': 5  # Odd number works best
        },
        'real_time_tracking': {
            'filter': 'ExponentialSmoothing',
            'reason': 'No latency, responsive to changes',
            'alpha': 0.3  # Lower = smoother, higher = more responsive
        },
        'sensor_fusion_known_dynamics': {
            'filter': 'KalmanFilter',
            'reason': 'Optimal estimation with uncertainty tracking',
            'params': 'process_noise, measurement_noise'
        },
        'edge_preserving': {
            'filter': 'MedianFilter',
            'reason': 'Preserves sharp transitions in data',
            'window_size': 3
        }
    }
    return recommendations.get(signal_characteristics, recommendations['steady_state_with_gaussian_noise'])

1307.4.7 Combining Filters

For robust noise removal, filters can be cascaded:

class CombinedFilter:
    """
    Two-stage filter: Median first (remove spikes), then exponential smooth.
    """
    def __init__(self, median_window=5, exp_alpha=0.3):
        self.median_filter = MedianFilter(median_window)
        self.exp_filter = ExponentialSmoothingFilter(exp_alpha)

    def filter(self, value):
        # Stage 1: Remove spikes with median
        despike = self.median_filter.filter(value)
        # Stage 2: Smooth remaining noise
        smooth = self.exp_filter.filter(despike)
        return smooth

# Usage
combined = CombinedFilter(median_window=5, exp_alpha=0.2)
for reading in sensor_stream:
    clean_value = combined.filter(reading)

1307.5 Knowledge Check

Question: A motion sensor (PIR) has 30 seconds of missing data due to a network outage. What is the most appropriate imputation strategy?

Explanation: For event-driven sensors like PIR motion detectors, zero (no motion) is the correct imputation. The absence of a motion event reading should be interpreted as “no motion detected” rather than carrying forward a previous “motion” state. Forward-fill would incorrectly show continuous motion if the last reading was motion. Interpolation makes no sense for binary event data.

Question: A humidity sensor shows 10 consecutive readings of exactly 78.00%. What quality concern does this indicate?

Explanation: Real analog sensors always have some noise - even in stable environments, readings fluctuate by at least the ADC resolution. Ten identical readings (especially to two decimal places) strongly suggests a stuck sensor, frozen firmware, or caching bug. This is a data quality issue that range and outlier checks will miss because the value itself is valid. Implement “stuck value detection” by checking variance over a window.

Question: You have a pressure sensor with occasional spike readings of 0 or 4095 (ADC limits) due to electrical interference. Which filter is best?

Explanation: The median filter is ideal for spike removal because it completely ignores outliers. With a window of 5, even if 2 readings are spikes, the median will select from the valid readings. Moving average would incorporate the spikes into the output (averaging 1000, 0, 1000, 1000, 1000 gives 800, not 1000). Exponential smoothing would also be affected by spikes. Kalman assumes Gaussian noise, not impulse spikes.

1307.6 Summary

Missing value imputation and noise filtering are essential for producing clean, complete sensor data:

  • Forward Fill: Simple and effective for slowly-changing continuous values (temperature, humidity)
  • Linear Interpolation: Better for trending data when you have values on both sides of the gap
  • Seasonal Fill: Use when data has known periodic patterns (daily temperature cycles)
  • Sensor-Specific Imputation: Motion sensors get zero, state sensors get “unknown”, counters get zero increment
  • Moving Average: Good for steady-state Gaussian noise, but blurs edges
  • Median Filter: Excellent for spike removal, preserves sharp transitions
  • Exponential Smoothing: Real-time with no latency, tunable responsiveness

Critical Design Principle: Always match your imputation and filtering strategy to the sensor type and data characteristics. A one-size-fits-all approach will produce incorrect results for at least some of your sensors.

1307.7 What’s Next

The next chapter covers Data Normalization and the Preprocessing Lab, exploring how to scale data for multi-sensor fusion and providing hands-on practice with a complete data quality pipeline.

Data Quality Series:

Advanced Topics: