1307 Missing Value Imputation and Noise Filtering

1307.1 Learning Objectives

By the end of this chapter, you will be able to:

Handle Missing Data: Implement appropriate imputation strategies for different sensor types and use cases
Select Imputation Methods: Choose between forward-fill, interpolation, and seasonal decomposition based on data characteristics
Design Noise Filters: Implement moving average, median, and exponential smoothing filters for signal conditioning
Match Strategy to Sensor Type: Apply the correct imputation and filtering approach based on sensor semantics

1307.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Data Validation and Outlier Detection: Understanding validation as the first stage of the data quality pipeline
Signal Processing Essentials: Basic concepts of filtering and signal conditioning
Sensor Fundamentals: Knowledge of different sensor types and their output characteristics

For Kids: The Sensor Squad Fills the Gaps!

Missing data is like a puzzle with missing pieces - but we can use clues to fill them in!

1307.2.1 The Sensor Squad Adventure: The Mystery of the Missing Messages

Max the Microcontroller was worried. “Motion Mo didn’t send ANY reading! His battery must have died.”

“Oh no!” said Lila the LED. “What do we put in the report?”

Sammy the Sensor thought carefully. “Well, what kind of sensor is Mo?”

“He’s a motion sensor - he only says something when he sees movement!”

Sammy smiled. “Then if he didn’t report anything… what do you think that means?”

Max realized: “No message means no motion! We can write down ‘No motion detected!’”

But then they looked at Temperature Terry’s readings: 22… 22… GAP… GAP… GAP… 23.

“Hmm,” said Sammy. “Temperature changes slowly. If it was 22 before and 23 after, what was it probably during the gap?”

Bella the Battery did the math: “Probably 22… then 22.5… then 23! It was slowly warming up!”

The Sensor Squad learned two tricks:

For motion sensors: No news = no motion (use zero!)
For temperature: Connect the dots between the readings we DO have

1307.2.2 Smoothing Out the Noise

Pressure Pete’s readings were jumping around: 100, 5, 98, 3, 101…

“Wait,” said Sammy. “Pressure can’t really jump from 100 to 5 and back! That’s just noise - like static on a radio.”

“How do we fix it?” asked Lila.

“Let me take the AVERAGE of a few readings: (100 + 5 + 98 + 3 + 101) / 5 = about 61… Hmm, that’s not right either because those 5s and 3s are wrong!”

Bella suggested: “What if we put them in order and take the MIDDLE one? 3, 5, 98, 100, 101 - the middle is 98! That’s probably the real pressure!”

The Sensor Squad learned: The median filter ignores the crazy outliers and finds the true reading!

1307.2.3 Key Words for Kids

Word	What It Means
Missing Data	When a sensor doesn’t send any information - like a friend who doesn’t answer
Imputation	Filling in gaps with good guesses based on clues
Noise	Random jumpy readings that hide the real information
Filtering	Smoothing out the noise to find the real signal
Median	The middle number when you sort them in order

For Beginners: Why Do We Need Imputation and Filtering?

Missing data and noise are inevitable in real IoT deployments. Sensors lose power, networks drop packets, and electronic noise corrupts readings. Rather than discard incomplete data, we can intelligently fill gaps and smooth noise.

Two key challenges:

Challenge	Cause	Solution
Missing Values	Battery death, network outage, sensor failure	Imputation (filling gaps)
Noisy Signals	Electrical interference, quantization, vibration	Filtering (smoothing)

Important distinction:

Missing: No data point received at all
Noisy: Data received but corrupted or fluctuating

Key question this chapter answers: “How do I fill gaps in sensor data and smooth out noise without losing important information?”

Minimum Viable Understanding: Imputation and Filtering

Core Concept: Missing value imputation fills data gaps using neighboring or historical values, while noise filtering smooths random fluctuations to reveal the underlying signal - both must be matched to sensor semantics.

Why It Matters: Analytics and ML models require complete data series. Gaps cause errors or require discarding entire time windows. Noise obscures real patterns and triggers false alerts. Proper handling preserves data integrity while enabling downstream processing.

Key Takeaway: Match your strategy to sensor type - use forward-fill for slow-changing values (temperature), zero for event sensors (motion), and median filter for spike removal. Never interpolate binary/categorical data or forward-fill event streams.

1307.3 Missing Value Imputation

~10 min | - - Intermediate | - P10.C09.U04

1307.3.1 Forward Fill (Last Observation Carried Forward)

Best for slowly-changing values like temperature:

class ForwardFillImputer:
    def __init__(self, max_gap=60):  # Maximum gap in samples
        self.last_valid = None
        self.gap_count = 0
        self.max_gap = max_gap

    def impute(self, value, is_valid):
        if is_valid:
            self.last_valid = value
            self.gap_count = 0
            return value, "original"

        if self.last_valid is None:
            return None, "no_history"

        self.gap_count += 1
        if self.gap_count > self.max_gap:
            return None, "gap_too_large"

        return self.last_valid, "imputed_ffill"

1307.3.2 Linear Interpolation

Better for trending values when future data is available:

def linear_interpolate(data, timestamps):
    """
    Interpolate missing values (None/NaN) using linear interpolation.
    Requires knowledge of surrounding valid points.
    """
    import numpy as np

    data = np.array(data, dtype=float)
    timestamps = np.array(timestamps, dtype=float)

    valid_mask = ~np.isnan(data)
    valid_indices = np.where(valid_mask)[0]

    if len(valid_indices) < 2:
        return data

    # Interpolate
    interpolated = np.interp(
        timestamps,
        timestamps[valid_mask],
        data[valid_mask]
    )

    return interpolated

1307.3.3 Seasonal Decomposition Fill

For data with known patterns (e.g., temperature with daily cycles):

def seasonal_fill(data, period=24):
    """
    Fill missing values using seasonal pattern from historical data.
    period: number of samples in one cycle (e.g., 24 for hourly data with daily cycle)
    """
    import numpy as np

    data = np.array(data, dtype=float)
    n = len(data)

    # Calculate seasonal pattern from valid data
    seasonal = np.zeros(period)
    counts = np.zeros(period)

    for i, val in enumerate(data):
        if not np.isnan(val):
            seasonal[i % period] += val
            counts[i % period] += 1

    # Average seasonal values
    with np.errstate(divide='ignore', invalid='ignore'):
        seasonal = np.where(counts > 0, seasonal / counts, np.nan)

    # Fill missing with seasonal pattern
    filled = data.copy()
    for i in range(n):
        if np.isnan(filled[i]) and not np.isnan(seasonal[i % period]):
            filled[i] = seasonal[i % period]

    return filled

Common Pitfall: Wrong Imputation Strategy

The mistake: Using forward-fill for event-driven sensors (motion, door open/close) or interpolation for categorical data.

Symptoms:

Motion sensor shows constant “motion detected” during sensor offline period
Door sensor shows “open” for hours when sensor battery died while door was open
Analytics show unrealistic patterns during imputed periods

Why it happens: One-size-fits-all imputation applied without considering sensor semantics.

The fix: Match imputation strategy to sensor type:

Sensor Type	Imputation Strategy	Reason
Temperature	Forward-fill or interpolate	Slowly changing, continuous
Motion	Use “no motion” (0)	Absence of reading means no motion
Door state	Flag as unknown	Cannot assume state
Counter	Zero increment	Missing period means no events
Humidity	Interpolate with bounds	Continuous, bounded 0-100%

Prevention: Create sensor metadata that specifies the imputation strategy for each sensor type in your deployment.

1307.3.4 Imputation Strategy Selection

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TD
    Start[Missing Data<br/>Detected] --> Type{Sensor Type?}

    Type -->|Continuous<br/>Temperature, Humidity| Cont[Continuous<br/>Measurement]
    Type -->|Event-Based<br/>Motion, Button| Event[Event<br/>Sensor]
    Type -->|State<br/>Door, Switch| State[Binary<br/>State]
    Type -->|Counter<br/>Energy, Traffic| Counter[Cumulative<br/>Counter]

    Cont --> Gap{Gap Size?}
    Gap -->|Short<br/>< 5 samples| FFill[Forward Fill]
    Gap -->|Medium<br/>5-60 samples| Interp[Linear<br/>Interpolation]
    Gap -->|Long<br/>> 60 samples| Seasonal[Seasonal<br/>Decomposition]

    Event --> Zero[Use Zero<br/>No Event]

    State --> Unknown[Flag as<br/>Unknown]

    Counter --> ZeroInc[Zero<br/>Increment]

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style FFill fill:#27AE60,stroke:#2C3E50,color:#fff
    style Interp fill:#16A085,stroke:#2C3E50,color:#fff
    style Seasonal fill:#3498DB,stroke:#2C3E50,color:#fff
    style Zero fill:#E67E22,stroke:#2C3E50,color:#fff
    style Unknown fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style ZeroInc fill:#9B59B6,stroke:#2C3E50,color:#fff

1307.4 Noise Filtering Techniques

~15 min | - - - Advanced | - P10.C09.U05

1307.4.1 Moving Average Filter

Simple and effective for steady-state noise reduction:

class MovingAverageFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        return sum(self.buffer) / len(self.buffer)

1307.4.2 Median Filter

Excellent for removing spike noise while preserving edges:

class MedianFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        sorted_buffer = sorted(self.buffer)
        mid = len(sorted_buffer) // 2

        if len(sorted_buffer) % 2 == 0:
            return (sorted_buffer[mid - 1] + sorted_buffer[mid]) / 2
        return sorted_buffer[mid]

1307.4.3 Exponential Smoothing

Provides weighted average with more weight on recent values:

class ExponentialSmoothingFilter:
    def __init__(self, alpha=0.3):
        """
        alpha: smoothing factor (0-1)
        Higher alpha = more weight on recent values = less smoothing
        Lower alpha = more weight on history = more smoothing
        """
        self.alpha = alpha
        self.smoothed = None

    def filter(self, value):
        if self.smoothed is None:
            self.smoothed = value
        else:
            self.smoothed = self.alpha * value + (1 - self.alpha) * self.smoothed

        return self.smoothed

1307.4.4 Filter Comparison

Filter	Latency	Edge Preservation	Spike Removal	Best For
Moving Average	window/2 samples	Poor	Moderate	Steady-state signals
Median	window/2 samples	Excellent	Excellent	Spike-contaminated data
Exponential	Continuous	Good	Moderate	Real-time smoothing
Kalman	Minimal	Excellent	Excellent	Known dynamics, sensor fusion

1307.4.5 Visual Comparison of Filters

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    subgraph Input["Noisy Signal"]
        A[Raw Data<br/>with Spikes]
    end

    subgraph Filters["Filter Options"]
        MA[Moving<br/>Average]
        MED[Median<br/>Filter]
        EXP[Exponential<br/>Smoothing]
    end

    subgraph Results["Filtered Output"]
        MA_OUT[Smoothed<br/>but Blurred]
        MED_OUT[Spikes<br/>Removed]
        EXP_OUT[Responsive<br/>Smooth]
    end

    A --> MA --> MA_OUT
    A --> MED --> MED_OUT
    A --> EXP --> EXP_OUT

    style A fill:#E74C3C,stroke:#2C3E50,color:#fff
    style MA fill:#3498DB,stroke:#2C3E50,color:#fff
    style MED fill:#16A085,stroke:#2C3E50,color:#fff
    style EXP fill:#9B59B6,stroke:#2C3E50,color:#fff
    style MA_OUT fill:#3498DB,stroke:#2C3E50,color:#fff
    style MED_OUT fill:#16A085,stroke:#2C3E50,color:#fff
    style EXP_OUT fill:#9B59B6,stroke:#2C3E50,color:#fff

1307.4.6 Choosing the Right Filter

def choose_filter(signal_characteristics):
    """
    Guide for selecting appropriate noise filter based on signal characteristics.
    """
    recommendations = {
        'steady_state_with_gaussian_noise': {
            'filter': 'MovingAverage',
            'reason': 'Averages out random noise effectively',
            'window_size': 5  # Adjust based on noise frequency
        },
        'spiky_noise_impulse': {
            'filter': 'MedianFilter',
            'reason': 'Completely ignores outlier spikes',
            'window_size': 5  # Odd number works best
        },
        'real_time_tracking': {
            'filter': 'ExponentialSmoothing',
            'reason': 'No latency, responsive to changes',
            'alpha': 0.3  # Lower = smoother, higher = more responsive
        },
        'sensor_fusion_known_dynamics': {
            'filter': 'KalmanFilter',
            'reason': 'Optimal estimation with uncertainty tracking',
            'params': 'process_noise, measurement_noise'
        },
        'edge_preserving': {
            'filter': 'MedianFilter',
            'reason': 'Preserves sharp transitions in data',
            'window_size': 3
        }
    }
    return recommendations.get(signal_characteristics, recommendations['steady_state_with_gaussian_noise'])

1307.4.7 Combining Filters

For robust noise removal, filters can be cascaded:

class CombinedFilter:
    """
    Two-stage filter: Median first (remove spikes), then exponential smooth.
    """
    def __init__(self, median_window=5, exp_alpha=0.3):
        self.median_filter = MedianFilter(median_window)
        self.exp_filter = ExponentialSmoothingFilter(exp_alpha)

    def filter(self, value):
        # Stage 1: Remove spikes with median
        despike = self.median_filter.filter(value)
        # Stage 2: Smooth remaining noise
        smooth = self.exp_filter.filter(despike)
        return smooth

# Usage
combined = CombinedFilter(median_window=5, exp_alpha=0.2)
for reading in sensor_stream:
    clean_value = combined.filter(reading)

1307.5 Knowledge Check

Quiz: Missing Data and Filtering

Question: A motion sensor (PIR) has 30 seconds of missing data due to a network outage. What is the most appropriate imputation strategy?

Explanation: For event-driven sensors like PIR motion detectors, zero (no motion) is the correct imputation. The absence of a motion event reading should be interpreted as “no motion detected” rather than carrying forward a previous “motion” state. Forward-fill would incorrectly show continuous motion if the last reading was motion. Interpolation makes no sense for binary event data.

Question: A humidity sensor shows 10 consecutive readings of exactly 78.00%. What quality concern does this indicate?

Explanation: Real analog sensors always have some noise - even in stable environments, readings fluctuate by at least the ADC resolution. Ten identical readings (especially to two decimal places) strongly suggests a stuck sensor, frozen firmware, or caching bug. This is a data quality issue that range and outlier checks will miss because the value itself is valid. Implement “stuck value detection” by checking variance over a window.

Question: You have a pressure sensor with occasional spike readings of 0 or 4095 (ADC limits) due to electrical interference. Which filter is best?

Explanation: The median filter is ideal for spike removal because it completely ignores outliers. With a window of 5, even if 2 readings are spikes, the median will select from the valid readings. Moving average would incorporate the spikes into the output (averaging 1000, 0, 1000, 1000, 1000 gives 800, not 1000). Exponential smoothing would also be affected by spikes. Kalman assumes Gaussian noise, not impulse spikes.

1307.6 Summary

Missing value imputation and noise filtering are essential for producing clean, complete sensor data:

Forward Fill: Simple and effective for slowly-changing continuous values (temperature, humidity)
Linear Interpolation: Better for trending data when you have values on both sides of the gap
Seasonal Fill: Use when data has known periodic patterns (daily temperature cycles)
Sensor-Specific Imputation: Motion sensors get zero, state sensors get “unknown”, counters get zero increment
Moving Average: Good for steady-state Gaussian noise, but blurs edges
Median Filter: Excellent for spike removal, preserves sharp transitions
Exponential Smoothing: Real-time with no latency, tunable responsiveness

Critical Design Principle: Always match your imputation and filtering strategy to the sensor type and data characteristics. A one-size-fits-all approach will produce incorrect results for at least some of your sensors.

1307.7 What’s Next

The next chapter covers Data Normalization and the Preprocessing Lab, exploring how to scale data for multi-sensor fusion and providing hands-on practice with a complete data quality pipeline.

Related Chapters and Resources

Data Quality Series:

Data Quality and Preprocessing - Overview and index
Data Validation and Outlier Detection - Validation and outliers
Data Normalization and Preprocessing Lab - Scaling and hands-on practice

Advanced Topics:

Multi-Sensor Data Fusion - Combining preprocessed data
Stream Processing - Real-time data pipelines
Anomaly Detection - Finding meaningful patterns in clean data