35 Imputation and Noise Filtering

analytics-ml

data

quality

imputation

35.1 Start With the Story

Picture an IoT team using the ideas in Imputation and Noise Filtering during a live operations review. A device has produced messy evidence, an analytic step is about to change an alert or control decision, and someone has to explain why the result should be trusted.

Read this page as that path from sensor evidence to accountable action. Start with what the system observes, keep the model or data treatment visible, and finish with the check that would convince an operator, maintainer, or auditor to act.

35.2 Learning Objectives

By the end of this chapter, you will be able to:

Implement Missing Data Handling: Apply appropriate imputation strategies (forward-fill, interpolation, seasonal decomposition) for different sensor types
Compare Imputation Methods: Evaluate trade-offs between forward-fill, interpolation, and seasonal decomposition based on data characteristics
Design Noise Filters: Implement moving average, median, and exponential smoothing filters and assess their signal conditioning performance
Distinguish Strategy by Sensor Type: Justify the correct imputation and filtering approach based on sensor semantics and physical behavior

35.3 Imputation Filtering Check

In 60 Seconds

Missing and noisy sensor readings corrupt every downstream analysis, making data imputation (filling gaps) and filtering (smoothing noise) essential preprocessing steps before any IoT analytics pipeline can produce reliable results. The choice of imputation strategy — forward-fill, interpolation, or model-based — fundamentally affects anomaly detection sensitivity and ML model accuracy.

Phoebe’s Field Notes: What a 60-Second Sample Interval Quietly Throws Away

Phoebe the physics guide

Phoebe’s Why

Every gap-filling and filtering method on this page assumes the samples it starts from are a faithful, if imperfect, record of a slower-moving physical quantity. That assumption is set once, at acquisition, by the Nyquist criterion, and no amount of clever imputation downstream can undo a violation of it: a real fluctuation faster than half the sample rate does not vanish, it folds down and reappears disguised as a slower trend, indistinguishable in the log from a genuine slow change. This page’s own forward-fill example samples temperature every 60 seconds, a reasonable choice for a slowly drifting room – but it is also a specific, checkable frequency-domain promise about what the sensor can and cannot see. A second, independent limit comes from the ADC itself: this page’s own humidity quiz already states that a 12-bit ADC over 0-100% humidity gives roughly a 0.024% step, which is exactly the quantization-noise floor derived below, not a number pulled from nowhere.

The Derivation

Nyquist sampling criterion, and the corresponding maximum period that can be resolved without aliasing:

\[f_s \ge 2f_{max} \qquad\Longleftrightarrow\qquad T_{min} = \frac{1}{f_{max}} \le \frac{2}{f_s}\]

Quantization step and RMS noise for an $N$-bit ADC over full-scale range $V_{FSR}$:

\[q = \frac{V_{FSR}}{2^N}, \qquad \sigma_q = \frac{q}{\sqrt{12}}\]

Quantization-limited signal-to-noise ratio:

\[\mathrm{SNR_{dB}} = 6.02N+1.76\]

Worked Numbers: This Page’s Own Sampling and ADC Examples

This page’s own 60-second indoor-temperature sampling gives $f_s=1/60$ Hz. Nyquist sets the fastest resolvable cycle at $f_s/2=1/120$ Hz, i.e. a 120-second (2-minute) period. A real effect that completes a cycle faster than that – a short-cycling HVAC compressor is a well-known building fault with a period on the order of 1-3 minutes, catalog-typical – sits right at or above that limit and would alias into the record as a spurious slow drift: exactly the kind of quiet, plausible-looking, wrong evidence a forward-fill or interpolation step cannot recover, because the corruption happened before any imputation ran.
This page’s own 12-bit humidity example, checked from first principles: $q=100\%/2^{12}=100/4096=0.0244\%$ – confirming the “about 0.024%” figure already stated in this chapter’s quiz feedback, now derived rather than asserted. RMS quantization noise: $\sigma_q=q/\sqrt{12}=0.00705\%$, and the quantization SNR ceiling is $6.02(12)+1.76=74.0$ dB – the reason this page is right that ten identical readings to two decimal places is a red flag: a live 12-bit channel cannot literally sit still to within its own $\sigma_q$.
Transduction-physics tie: a capacitive polymer humidity element has its own moisture-equilibration time constant, typically several seconds to tens of seconds, which acts as a built-in analog anti-alias filter for humidity – one reason 60-second sampling is usually safe for that particular sensor. The same is not automatically true for a faster-moving quantity such as vibration or a fast pressure transient sampled at the same 60-second interval, where the physical sensor may have no comparable built-in filtering and the acquisition system must supply an explicit anti-alias filter instead.

Chapter Roadmap

This chapter has two jobs: repair missing values and quiet noisy values without hiding what happened.

First you separate missing data from noisy data, then choose an imputation method that matches the sensor’s meaning.
Then you compare forward-fill, linear interpolation, and seasonal fill using the simulators and gap calculators.
Next you move from gap repair into filtering: moving average, median, exponential smoothing, and combined pipelines.
After that you choose a strategy from the decision framework, including the correct order of validation before imputation.
Finally you test the workflow with quizzes, a label diagram, and a code challenge.

Checkpoint callouts recap the main decisions. Longer calculators and worked examples are useful deep dives when you need the numbers, but you can skim them on a first pass.

35.4 Prerequisites

Before diving into this chapter, you should be familiar with:

Data Validation and Outlier Detection: Understanding validation as the first stage of the data quality pipeline
Signal Processing Essentials: Basic concepts of filtering and signal conditioning
Sensor Fundamentals: Knowledge of different sensor types and their output characteristics

Why Imputation and Filtering Matter

Missing data and noise are inevitable in real IoT deployments. Sensors lose power, networks drop packets, and electronic noise corrupts readings. Rather than discard incomplete data, we can intelligently fill gaps and smooth noise.

Two key challenges:

Challenge	Cause	Solution
Missing Values	Battery death, network outage, sensor failure	Imputation (filling gaps)
Noisy Signals	Electrical interference, quantization, vibration	Filtering (smoothing)

Important distinction:

Missing: No data point received at all
Noisy: Data received but corrupted or fluctuating

Key question this chapter answers: “How do I fill gaps in sensor data and smooth out noise without losing important information?”

Imputation and Filtering Basics

Core Concept: Missing value imputation fills data gaps using neighboring or historical values, while noise filtering smooths random fluctuations to reveal the underlying signal - both must be matched to sensor semantics.

Why It Matters: Analytics and ML models require complete data series. Gaps cause errors or require discarding entire time windows. Noise obscures real patterns and triggers false alerts. Proper handling preserves data integrity while enabling downstream processing.

Key Takeaway: Match your strategy to sensor type - use forward-fill for slow-changing values (temperature), zero for event sensors (motion), and median filter for spike removal. Never interpolate binary/categorical data or forward-fill event streams.

35.5 Missing Value Imputation

~10 min | - - Intermediate | - P10.C09.U04

The first half of the chapter is about deciding what a gap means. A missing temperature reading, a missing motion event, and a missing daily cycle do not deserve the same treatment.

Key Concepts

Missing data imputation: The process of estimating and filling in missing sensor readings using statistical or model-based methods rather than discarding incomplete records.
Forward-fill (Last Observation Carried Forward): An imputation strategy that replaces a missing value with the most recent valid reading — appropriate for slowly changing sensors but misleading for rapidly varying signals.
Linear interpolation: Estimating a missing value by drawing a straight line between the surrounding valid readings — appropriate for sensors with smooth, continuous dynamics.
Moving average filter: A signal smoothing technique that replaces each reading with the mean of a surrounding window, attenuating high-frequency noise at the cost of introducing lag.
Median filter: A non-linear filter that replaces each reading with the median of its window, highly effective at removing impulse noise (transient spikes) without distorting edges.
Missingness mechanism: The reason data is missing — Missing Completely at Random (MCAR), Missing at Random (MAR), or Missing Not at Random (MNAR) — which determines which imputation methods are statistically valid.

Checkpoint: Gap Semantics

You now know:

Missing data means no point arrived; noisy data means a point arrived but may be corrupted.
A 30 second PIR outage can be filled as zero/no motion, but interpolation would invent a fractional event.
Forward-fill belongs with slowly changing signals; interpolation belongs with smooth trends; seasonal fill belongs with repeated patterns.

35.5.1 Forward Fill for Sensor Data

Best for slowly-changing values like temperature:

class ForwardFillImputer:
    def __init__(self, max_gap=60):  # Maximum gap in samples
        self.last_valid = None
        self.gap_count = 0
        self.max_gap = max_gap

    def impute(self, value, is_valid):
        if is_valid:
            self.last_valid = value
            self.gap_count = 0
            return value, "original"

        if self.last_valid is None:
            return None, "no_history"

        self.gap_count += 1
        if self.gap_count > self.max_gap:
            return None, "gap_too_large"

        return self.last_valid, "imputed_ffill"

Try It: Forward Fill Simulator

Show code

viewof fwd_gapStart = Inputs.range([3, 15], {
  value: 5, step: 1,
  label: "Gap start index"
})

viewof fwd_gapLen = Inputs.range([1, 10], {
  value: 4, step: 1,
  label: "Gap length (samples)"
})

viewof fwd_maxGap = Inputs.range([1, 15], {
  value: 6, step: 1,
  label: "Max allowed gap"
})

Show code

{
  const n = 24;
  const baseSignal = Array.from({length: n}, (_, i) => 20 + 2 * Math.sin(i * 0.3) + (Math.sin(i * 1.7) * 0.3));

  // Create missing data gap
  const gapEnd = Math.min(fwd_gapStart + fwd_gapLen, n);
  const withGap = baseSignal.map((v, i) => (i >= fwd_gapStart && i < gapEnd) ? null : v);

  // Apply forward fill
  let lastValid = null;
  let gapCount = 0;
  const filled = withGap.map((v, i) => {
    if (v !== null) {
      lastValid = v;
      gapCount = 0;
      return {value: v, status: "original"};
    }
    if (lastValid === null) return {value: null, status: "no_history"};
    gapCount++;
    if (gapCount > fwd_maxGap) return {value: null, status: "gap_too_large"};
    return {value: lastValid, status: "imputed"};
  });

  const width = 600;
  const height = 250;
  const margin = {top: 30, right: 20, bottom: 35, left: 50};
  const plotW = width - margin.left - margin.right;
  const plotH = height - margin.top - margin.bottom;

  const allVals = baseSignal;
  const yMin = Math.min(...allVals) - 1;
  const yMax = Math.max(...allVals) + 1;
  const xScale = (i) => margin.left + (i / (n - 1)) * plotW;
  const yScale = (v) => margin.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH;

  // Original signal path
  const origPath = baseSignal.map((v, i) => `${i === 0 ? "M" : "L"}${xScale(i).toFixed(1)},${yScale(v).toFixed(1)}`).join(" ");

  // Imputed segments
  const imputedDots = filled.map((f, i) => {
    if (f.status === "imputed" && f.value !== null) {
      return `<circle cx="${xScale(i).toFixed(1)}" cy="${yScale(f.value).toFixed(1)}" r="5" fill="#16A085" stroke="white" stroke-width="1.5"/>`;
    }
    if (f.status === "gap_too_large") {
      return `<line x1="${xScale(i) - 4}" y1="${yScale(yMin + (yMax - yMin) / 2) - 4}" x2="${xScale(i) + 4}" y2="${yScale(yMin + (yMax - yMin) / 2) + 4}" stroke="#E74C3C" stroke-width="2.5"/>
              <line x1="${xScale(i) - 4}" y1="${yScale(yMin + (yMax - yMin) / 2) + 4}" x2="${xScale(i) + 4}" y2="${yScale(yMin + (yMax - yMin) / 2) - 4}" stroke="#E74C3C" stroke-width="2.5"/>`;
    }
    return "";
  }).join("");

  // Gap shading
  const gapShade = (fwd_gapStart < n) ? `<rect x="${xScale(fwd_gapStart)}" y="${margin.top}" width="${xScale(Math.min(gapEnd, n - 1)) - xScale(fwd_gapStart)}" height="${plotH}" fill="#E74C3C" opacity="0.08" rx="3"/>` : "";

  // Original data dots
  const origDots = withGap.map((v, i) => {
    if (v !== null) {
      return `<circle cx="${xScale(i).toFixed(1)}" cy="${yScale(v).toFixed(1)}" r="3.5" fill="#2C3E50"/>`;
    }
    return "";
  }).join("");

  const imputedCount = filled.filter(f => f.status === "imputed").length;
  const rejectedCount = filled.filter(f => f.status === "gap_too_large").length;

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <svg viewBox="0 0 ${width} ${height}" style="width:100%; max-width:${width}px; font-family: Arial, sans-serif;">
      <text x="${width / 2}" y="16" text-anchor="middle" font-size="12" font-weight="bold" fill="var(--bs-body-color)">Forward Fill: gap=${fwd_gapLen} samples, max_gap=${fwd_maxGap}</text>
      ${gapShade}
      <line x1="${margin.left}" y1="${margin.top}" x2="${margin.left}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      <line x1="${margin.left}" y1="${margin.top + plotH}" x2="${margin.left + plotW}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      ${[yMin, (yMin + yMax) / 2, yMax].map(v => `<text x="${margin.left - 5}" y="${yScale(v) + 4}" text-anchor="end" font-size="9" fill="var(--bs-body-color)">${v.toFixed(1)}</text><line x1="${margin.left}" y1="${yScale(v)}" x2="${margin.left + plotW}" y2="${yScale(v)}" stroke="#dee2e6" stroke-width="0.5"/>`).join("")}
      <path d="${origPath}" fill="none" stroke="#7F8C8D" stroke-width="1" stroke-dasharray="4,3"/>
      ${origDots}
      ${imputedDots}
      <text x="${margin.left + plotW / 2}" y="${height - 5}" text-anchor="middle" font-size="10" fill="var(--bs-body-color)">Sample Index</text>
      <rect x="${margin.left + plotW - 175}" y="${margin.top + 5}" width="170" height="50" fill="var(--bs-body-bg, white)" stroke="#dee2e6" rx="3"/>
      <circle cx="${margin.left + plotW - 162}" cy="${margin.top + 18}" r="3.5" fill="#2C3E50"/>
      <text x="${margin.left + plotW - 155}" y="${margin.top + 22}" font-size="9" fill="var(--bs-body-color)">Original data</text>
      <circle cx="${margin.left + plotW - 162}" cy="${margin.top + 33}" r="5" fill="#16A085" stroke="white" stroke-width="1.5"/>
      <text x="${margin.left + plotW - 155}" y="${margin.top + 37}" font-size="9" fill="var(--bs-body-color)">Forward-filled</text>
      <line x1="${margin.left + plotW - 166}" y1="${margin.top + 44}" x2="${margin.left + plotW - 158}" y2="${margin.top + 52}" stroke="#E74C3C" stroke-width="2"/>
      <text x="${margin.left + plotW - 155}" y="${margin.top + 51}" font-size="9" fill="var(--bs-body-color)">Gap too large (rejected)</text>
    </svg>
    <p style="margin: 0.5rem 0 0 0; font-size: 0.9em;">
      <strong>Result:</strong> ${imputedCount} sample${imputedCount !== 1 ? "s" : ""} imputed${rejectedCount > 0 ? `, <span style="color:#E74C3C; font-weight:bold;">${rejectedCount} rejected</span> (exceeded max_gap=${fwd_maxGap})` : ""}.
      ${fwd_gapLen > fwd_maxGap ? "The gap exceeds the maximum allowed length -- some values cannot be filled. Increase max_gap or accept missing data." : "All gap values were successfully forward-filled from the last valid reading."}
    </p>
  </div>`;
}

35.5.2 Linear Interpolation

Forward-fill answers “what if the last value still held?” Interpolation asks a different question: if the signal changed smoothly between two known readings, what values plausibly sat inside the gap?

Better for trending values when future data is available:

def linear_interpolate(data, timestamps):
    """
    Interpolate missing values (None/NaN) using linear interpolation.
    Requires knowledge of surrounding valid points.
    """
    import numpy as np

    data = np.array(data, dtype=float)
    timestamps = np.array(timestamps, dtype=float)

    valid_mask = ~np.isnan(data)
    valid_indices = np.where(valid_mask)[0]

    if len(valid_indices) < 2:
        return data

    # Interpolate
    interpolated = np.interp(
        timestamps,
        timestamps[valid_mask],
        data[valid_mask]
    )

    return interpolated

Interpolation vs Forward Fill

Show code

viewof lin_gapStart = Inputs.range([2, 16], {
  value: 6, step: 1,
  label: "Gap start index"
})

viewof lin_gapLen = Inputs.range([1, 8], {
  value: 4, step: 1,
  label: "Gap length"
})

viewof lin_trend = Inputs.range([-1.0, 1.0], {
  value: 0.3, step: 0.05,
  label: "Signal trend (slope)"
})

Show code

{
  const n = 24;
  const base = Array.from({length: n}, (_, i) => 20 + lin_trend * i + 1.5 * Math.sin(i * 0.5));

  const gapEnd = Math.min(lin_gapStart + lin_gapLen, n);
  const withGap = base.map((v, i) => (i >= lin_gapStart && i < gapEnd) ? null : v);

  // Forward fill
  let lastV = null;
  const ffilled = withGap.map(v => {
    if (v !== null) { lastV = v; return v; }
    return lastV;
  });

  // Linear interpolation
  const linterp = withGap.slice();
  // Find left and right anchors
  let leftIdx = lin_gapStart - 1;
  let rightIdx = gapEnd < n ? gapEnd : null;
  if (leftIdx >= 0 && rightIdx !== null) {
    const leftVal = base[leftIdx];
    const rightVal = base[rightIdx];
    for (let i = lin_gapStart; i < gapEnd; i++) {
      const t = (i - leftIdx) / (rightIdx - leftIdx);
      linterp[i] = leftVal + t * (rightVal - leftVal);
    }
  }

  const width = 600;
  const height = 260;
  const margin = {top: 30, right: 20, bottom: 35, left: 50};
  const plotW = width - margin.left - margin.right;
  const plotH = height - margin.top - margin.bottom;

  const allVals = base;
  const yMin = Math.min(...allVals) - 1;
  const yMax = Math.max(...allVals) + 1;
  const xS = (i) => margin.left + (i / (n - 1)) * plotW;
  const yS = (v) => margin.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH;

  const truePath = base.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");

  // Compute error
  let ffillErr = 0, linterpErr = 0, count = 0;
  for (let i = lin_gapStart; i < gapEnd && i < n; i++) {
    if (ffilled[i] !== null) ffillErr += Math.abs(ffilled[i] - base[i]);
    if (linterp[i] !== null) linterpErr += Math.abs(linterp[i] - base[i]);
    count++;
  }
  const ffillMAE = count > 0 ? (ffillErr / count).toFixed(3) : "N/A";
  const linterpMAE = count > 0 ? (linterpErr / count).toFixed(3) : "N/A";

  const gapShade = `<rect x="${xS(lin_gapStart)}" y="${margin.top}" width="${xS(Math.min(gapEnd - 1, n - 1)) - xS(lin_gapStart)}" height="${plotH}" fill="#9B59B6" opacity="0.07" rx="3"/>`;

  // Forward-fill dots
  const ffDots = [];
  for (let i = lin_gapStart; i < gapEnd && i < n; i++) {
    if (ffilled[i] !== null) ffDots.push(`<circle cx="${xS(i).toFixed(1)}" cy="${yS(ffilled[i]).toFixed(1)}" r="4.5" fill="#E67E22" opacity="0.85"/>`);
  }

  // Interp dots
  const liDots = [];
  for (let i = lin_gapStart; i < gapEnd && i < n; i++) {
    if (linterp[i] !== null) liDots.push(`<circle cx="${xS(i).toFixed(1)}" cy="${yS(linterp[i]).toFixed(1)}" r="4.5" fill="#3498DB" opacity="0.85"/>`);
  }

  // Orig dots (non-gap)
  const origDots = withGap.map((v, i) => v !== null ? `<circle cx="${xS(i).toFixed(1)}" cy="${yS(v).toFixed(1)}" r="3" fill="#2C3E50"/>` : "").join("");

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <svg viewBox="0 0 ${width} ${height}" style="width:100%; max-width:${width}px; font-family: Arial, sans-serif;">
      <text x="${width / 2}" y="16" text-anchor="middle" font-size="12" font-weight="bold" fill="var(--bs-body-color)">Linear Interpolation vs Forward Fill (trend=${lin_trend >= 0 ? "+" : ""}${lin_trend.toFixed(2)}/sample)</text>
      ${gapShade}
      <line x1="${margin.left}" y1="${margin.top}" x2="${margin.left}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      <line x1="${margin.left}" y1="${margin.top + plotH}" x2="${margin.left + plotW}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      ${[yMin, (yMin + yMax) / 2, yMax].map(v => `<text x="${margin.left - 5}" y="${yS(v) + 4}" text-anchor="end" font-size="9" fill="var(--bs-body-color)">${v.toFixed(1)}</text><line x1="${margin.left}" y1="${yS(v)}" x2="${margin.left + plotW}" y2="${yS(v)}" stroke="#dee2e6" stroke-width="0.5"/>`).join("")}
      <path d="${truePath}" fill="none" stroke="#7F8C8D" stroke-width="1" stroke-dasharray="4,3"/>
      ${origDots}
      ${ffDots.join("")}
      ${liDots.join("")}
      <text x="${margin.left + plotW / 2}" y="${height - 5}" text-anchor="middle" font-size="10" fill="var(--bs-body-color)">Sample Index</text>
      <rect x="${margin.left + plotW - 170}" y="${margin.top + 5}" width="165" height="50" fill="var(--bs-body-bg, white)" stroke="#dee2e6" rx="3"/>
      <circle cx="${margin.left + plotW - 157}" cy="${margin.top + 17}" r="3" fill="#2C3E50"/>
      <text x="${margin.left + plotW - 150}" y="${margin.top + 21}" font-size="9" fill="var(--bs-body-color)">Original / True signal</text>
      <circle cx="${margin.left + plotW - 157}" cy="${margin.top + 31}" r="4.5" fill="#E67E22" opacity="0.85"/>
      <text x="${margin.left + plotW - 150}" y="${margin.top + 35}" font-size="9" fill="var(--bs-body-color)">Forward-fill</text>
      <circle cx="${margin.left + plotW - 157}" cy="${margin.top + 45}" r="4.5" fill="#3498DB" opacity="0.85"/>
      <text x="${margin.left + plotW - 150}" y="${margin.top + 49}" font-size="9" fill="var(--bs-body-color)">Linear interpolation</text>
    </svg>
    <p style="margin: 0.5rem 0 0 0; font-size: 0.9em;">
      <strong>Mean Absolute Error:</strong>
      Forward-fill = <span style="color:#E67E22; font-weight:bold;">${ffillMAE}</span> |
      Linear interp = <span style="color:#3498DB; font-weight:bold;">${linterpMAE}</span>
      ${parseFloat(linterpMAE) < parseFloat(ffillMAE) ? " -- Linear interpolation wins here, as expected for trending data." : parseFloat(linterpMAE) > parseFloat(ffillMAE) ? " -- Forward-fill is closer for this configuration." : " -- Both methods perform equally."}
      Try increasing the trend slope to see when interpolation outperforms forward-fill.
    </p>
  </div>`;
}

35.5.3 Seasonal Decomposition Fill

Some data repeats. When temperature follows a daily cycle, a method that remembers the phase of that cycle can be more honest than a straight line across the gap.

For data with known patterns (e.g., temperature with daily cycles):

def seasonal_fill(data, period=24):
    """
    Fill missing values using seasonal pattern from historical data.
    period: number of samples in one cycle (e.g., 24 for hourly data with daily cycle)
    """
    import numpy as np

    data = np.array(data, dtype=float)
    n = len(data)

    # Calculate seasonal pattern from valid data
    seasonal = np.zeros(period)
    counts = np.zeros(period)

    for i, val in enumerate(data):
        if not np.isnan(val):
            seasonal[i % period] += val
            counts[i % period] += 1

    # Average seasonal values
    with np.errstate(divide='ignore', invalid='ignore'):
        seasonal = np.where(counts > 0, seasonal / counts, np.nan)

    # Fill missing with seasonal pattern
    filled = data.copy()
    for i in range(n):
        if np.isnan(filled[i]) and not np.isnan(seasonal[i % period]):
            filled[i] = seasonal[i % period]

    return filled

Try It: Seasonal Decomposition Fill

Show code

viewof sea_period = Inputs.range([4, 24], {
  value: 12, step: 1,
  label: "Period (cycle length in samples)"
})

viewof sea_gapStart = Inputs.range([10, 40], {
  value: 18, step: 1,
  label: "Gap start index"
})

viewof sea_gapLen = Inputs.range([2, 16], {
  value: 8, step: 1,
  label: "Gap length"
})

viewof sea_amplitude = Inputs.range([1, 8], {
  value: 4, step: 0.5,
  label: "Seasonal amplitude"
})

Show code

{
  const n = 60;
  const period = sea_period;

  // Generate signal with clear seasonal pattern + slow drift
  const trueSignal = Array.from({length: n}, (_, i) =>
    20 + sea_amplitude * Math.sin(2 * Math.PI * i / period) + 0.02 * i + (Math.sin(i * 2.1) * 0.4)
  );

  // Insert gap
  const gapEnd = Math.min(sea_gapStart + sea_gapLen, n);
  const withGap = trueSignal.map((v, i) => (i >= sea_gapStart && i < gapEnd) ? null : v);

  // Compute seasonal pattern from valid data
  const seasonalSum = new Array(period).fill(0);
  const seasonalCnt = new Array(period).fill(0);
  withGap.forEach((v, i) => {
    if (v !== null) {
      seasonalSum[i % period] += v;
      seasonalCnt[i % period] += 1;
    }
  });
  const seasonalPattern = seasonalSum.map((s, i) => seasonalCnt[i] > 0 ? s / seasonalCnt[i] : null);

  // Fill missing using seasonal
  const seaFilled = withGap.map((v, i) => v !== null ? v : seasonalPattern[i % period]);

  const width = 620;
  const height = 280;
  const margin = {top: 30, right: 20, bottom: 35, left: 50};
  const plotW = width - margin.left - margin.right;
  const plotH = height - margin.top - margin.bottom;

  const allV = trueSignal;
  const yMin = Math.min(...allV) - 1;
  const yMax = Math.max(...allV) + 1;
  const xS = (i) => margin.left + (i / (n - 1)) * plotW;
  const yS = (v) => margin.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH;

  const truePath = trueSignal.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");

  const gapShade = `<rect x="${xS(sea_gapStart)}" y="${margin.top}" width="${xS(Math.min(gapEnd - 1, n - 1)) - xS(sea_gapStart)}" height="${plotH}" fill="#9B59B6" opacity="0.08" rx="3"/>`;

  const origDots = withGap.map((v, i) => v !== null ? `<circle cx="${xS(i).toFixed(1)}" cy="${yS(v).toFixed(1)}" r="2" fill="#2C3E50"/>` : "").join("");

  const filledDots = [];
  let seaErr = 0;
  let seaCnt = 0;
  for (let i = sea_gapStart; i < gapEnd && i < n; i++) {
    if (seaFilled[i] !== null) {
      filledDots.push(`<circle cx="${xS(i).toFixed(1)}" cy="${yS(seaFilled[i]).toFixed(1)}" r="5" fill="#9B59B6" opacity="0.85"/>`);
      seaErr += Math.abs(seaFilled[i] - trueSignal[i]);
      seaCnt++;
    }
  }
  const seaMAE = seaCnt > 0 ? (seaErr / seaCnt).toFixed(3) : "N/A";

  // Vertical lines showing period boundaries
  const periodLines = [];
  for (let p = period; p < n; p += period) {
    periodLines.push(`<line x1="${xS(p)}" y1="${margin.top}" x2="${xS(p)}" y2="${margin.top + plotH}" stroke="#16A085" stroke-width="0.5" stroke-dasharray="2,4"/>`);
  }

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <svg viewBox="0 0 ${width} ${height}" style="width:100%; max-width:${width}px; font-family: Arial, sans-serif;">
      <text x="${width / 2}" y="16" text-anchor="middle" font-size="12" font-weight="bold" fill="var(--bs-body-color)">Seasonal Fill: period=${period}, amplitude=${sea_amplitude.toFixed(1)}</text>
      ${gapShade}
      ${periodLines.join("")}
      <line x1="${margin.left}" y1="${margin.top}" x2="${margin.left}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      <line x1="${margin.left}" y1="${margin.top + plotH}" x2="${margin.left + plotW}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      ${[yMin, (yMin + yMax) / 2, yMax].map(v => `<text x="${margin.left - 5}" y="${yS(v) + 4}" text-anchor="end" font-size="9" fill="var(--bs-body-color)">${v.toFixed(1)}</text><line x1="${margin.left}" y1="${yS(v)}" x2="${margin.left + plotW}" y2="${yS(v)}" stroke="#dee2e6" stroke-width="0.5"/>`).join("")}
      <path d="${truePath}" fill="none" stroke="#7F8C8D" stroke-width="1" stroke-dasharray="4,3"/>
      ${origDots}
      ${filledDots.join("")}
      <text x="${margin.left + plotW / 2}" y="${height - 5}" text-anchor="middle" font-size="10" fill="var(--bs-body-color)">Sample Index (green dashes = period boundaries)</text>
      <rect x="${margin.left + plotW - 165}" y="${margin.top + 5}" width="160" height="40" fill="var(--bs-body-bg, white)" stroke="#dee2e6" rx="3"/>
      <circle cx="${margin.left + plotW - 152}" cy="${margin.top + 18}" r="2" fill="#2C3E50"/>
      <text x="${margin.left + plotW - 145}" y="${margin.top + 22}" font-size="9" fill="var(--bs-body-color)">Original (true signal dashed)</text>
      <circle cx="${margin.left + plotW - 152}" cy="${margin.top + 33}" r="5" fill="#9B59B6" opacity="0.85"/>
      <text x="${margin.left + plotW - 145}" y="${margin.top + 37}" font-size="9" fill="var(--bs-body-color)">Seasonal-filled values</text>
    </svg>
    <p style="margin: 0.5rem 0 0 0; font-size: 0.9em;">
      <strong>Mean Absolute Error:</strong> <span style="color:#9B59B6; font-weight:bold;">${seaMAE}</span> |
      The algorithm averages all valid readings at each phase position (index mod ${period}) and uses that pattern to fill gaps.
      ${sea_gapLen > period ? "Note: the gap spans more than one full cycle, so the seasonal pattern repeats within the gap." : "The gap is shorter than one cycle -- seasonal fill uses the learned pattern for each phase position."}
    </p>
  </div>`;
}

Wrong Imputation Strategy

The mistake: Using forward-fill for event-driven sensors (motion, door open/close) or interpolation for categorical data.

Symptoms:

Motion sensor shows constant “motion detected” during sensor offline period
Door sensor shows “open” for hours when sensor battery died while door was open
Analytics show unrealistic patterns during imputed periods

Why it happens: One-size-fits-all imputation applied without considering sensor semantics.

The fix: Always match imputation strategy to sensor type. See the Decision Framework later in this chapter for a complete sensor-to-strategy mapping table and decision tree.

Prevention: Create sensor metadata that specifies the imputation strategy for each sensor type in your deployment.

Putting Numbers to It

How long can you safely forward-fill temperature data?

For a typical indoor temperature sensor: - Normal change rate: 0.5°C per hour (HVAC cycles) - Maximum change rate: 3°C per hour (HVAC failure, door open) - Sensor sampling: Every 60 seconds

Gap duration analysis:

Gap Duration	Expected Change	Forward-Fill Error	Acceptable?
1 minute	0.008°C (normal)	~0.01°C	✓ Excellent
5 minutes	0.042°C	~0.05°C	✓ Very good
30 minutes	0.25°C	~0.3°C	✓ Good (trend analysis OK)
2 hours	1.0°C	~1.5°C	✗ Poor (alert thresholds invalid)

Formula for maximum safe gap:

\[t_{max} = \frac{\epsilon_{acceptable}}{r_{max}}\]

Where $\epsilon_{acceptable}$ is the maximum tolerable error and $r_{max}$ is the maximum expected change rate.

Example: For ±1°C acceptable error and 3°C/hour max rate:

\[t_{max} = \frac{1°C}{3°C/\text{hour}} = 0.33 \text{ hours} = 20 \text{ minutes}\]

Recommendation: Forward-fill temperature for max 20 minutes. Beyond that, flag as “sensor offline” rather than impute.

35.5.4 Try It: Forward-Fill Gap Calculator

Show code

viewof impGapAcceptableError = Inputs.range([0.1, 5.0], {
  value: 1.0, step: 0.1,
  label: "Acceptable error (units)"
})

viewof impGapMaxRate = Inputs.range([0.1, 10.0], {
  value: 3.0, step: 0.1,
  label: "Max change rate (units/hour)"
})

viewof impGapSampleInterval = Inputs.range([1, 300], {
  value: 60, step: 1,
  label: "Sample interval (seconds)"
})

Show code

{
  const tMaxHours = impGapAcceptableError / impGapMaxRate;
  const tMaxMinutes = tMaxHours * 60;
  const tMaxSamples = Math.floor((tMaxMinutes * 60) / impGapSampleInterval);

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; margin-top: 0.5rem; color: var(--bs-body-color);">
    <h4 style="margin-top:0; color: var(--bs-body-color);">Maximum Safe Forward-Fill Gap</h4>
    <p><strong>Formula:</strong> t<sub>max</sub> = acceptable error / max rate = ${impGapAcceptableError.toFixed(1)} / ${impGapMaxRate.toFixed(1)} = <strong>${tMaxHours.toFixed(2)} hours</strong> = <strong>${tMaxMinutes.toFixed(1)} minutes</strong></p>
    <p><strong>In samples:</strong> ${tMaxSamples} samples (at ${impGapSampleInterval}s intervals)</p>
    <p style="margin-bottom:0;"><strong>Recommendation:</strong> ${tMaxMinutes < 5
      ? "Very short gap limit - consider a faster sampling rate or a more tolerant error threshold."
      : tMaxMinutes < 30
        ? "Reasonable gap limit for actively controlled environments."
        : "Long gap tolerance - suitable for slowly varying quantities like soil moisture."}</p>
  </div>`;
}

Checkpoint: Imputation Boundaries

You now know:

The safe forward-fill window comes from acceptable error divided by maximum rate of change.
With a 1°C error limit and 3°C/hour maximum rate, the chapter’s example gives a 20 minute limit.
The strategy table later keeps the same discipline: temperature often allows 10-30 minutes, periodic data can use seasonal fill up to 6 hours, and event gaps over 1 hour should still be flagged.

35.6 Noise Filtering Techniques

~15 min | - - - Advanced | - P10.C09.U05

Once the gaps are accounted for, the next problem is different: readings are present, but the series is jagged, spiky, or too noisy for a reliable decision.

35.6.1 Moving Average Filter

Simple and effective for steady-state noise reduction:

class MovingAverageFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        return sum(self.buffer) / len(self.buffer)

Try It: Moving Average Noise Reduction

Show code

viewof mavg_window = Inputs.range([2, 15], {
  value: 5, step: 1,
  label: "Window size"
})

viewof mavg_noiseLevel = Inputs.range([0, 4], {
  value: 1.5, step: 0.1,
  label: "Noise amplitude"
})

viewof mavg_spikeRate = Inputs.range([0, 0.3], {
  value: 0.05, step: 0.01,
  label: "Spike probability"
})

Show code

{
  // Deterministic pseudo-random using sine
  const prng = (i) => ((Math.sin(i * 127.1 + 311.7) * 43758.5453) % 1 + 1) % 1;
  const n = 50;

  // Clean signal: slow sine
  const clean = Array.from({length: n}, (_, i) => 22 + 3 * Math.sin(i * 0.15));

  // Add noise and spikes
  const noisy = clean.map((v, i) => {
    let val = v + (prng(i) - 0.5) * 2 * mavg_noiseLevel;
    if (prng(i + 1000) < mavg_spikeRate) {
      val += (prng(i + 2000) > 0.5 ? 1 : -1) * 10;
    }
    return val;
  });

  // Moving average filter
  const maFiltered = [];
  const buf = [];
  for (let i = 0; i < n; i++) {
    buf.push(noisy[i]);
    if (buf.length > mavg_window) buf.shift();
    maFiltered.push(buf.reduce((a, b) => a + b, 0) / buf.length);
  }

  const width = 620;
  const height = 250;
  const margin = {top: 30, right: 20, bottom: 35, left: 50};
  const plotW = width - margin.left - margin.right;
  const plotH = height - margin.top - margin.bottom;

  const allV = [...noisy, ...maFiltered];
  const yMin = Math.min(...allV) - 1;
  const yMax = Math.max(...allV) + 1;
  const xS = (i) => margin.left + (i / (n - 1)) * plotW;
  const yS = (v) => margin.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH;

  const noisyPath = noisy.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const maPath = maFiltered.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const cleanPath = clean.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");

  // RMS error
  let mse = 0;
  for (let i = 0; i < n; i++) mse += (maFiltered[i] - clean[i]) ** 2;
  const rmse = Math.sqrt(mse / n).toFixed(3);

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <svg viewBox="0 0 ${width} ${height}" style="width:100%; max-width:${width}px; font-family: Arial, sans-serif;">
      <text x="${width / 2}" y="16" text-anchor="middle" font-size="12" font-weight="bold" fill="var(--bs-body-color)">Moving Average (window=${mavg_window}): noise=${mavg_noiseLevel.toFixed(1)}, spike rate=${(mavg_spikeRate * 100).toFixed(0)}%</text>
      <line x1="${margin.left}" y1="${margin.top}" x2="${margin.left}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      <line x1="${margin.left}" y1="${margin.top + plotH}" x2="${margin.left + plotW}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      ${[yMin, (yMin + yMax) / 2, yMax].map(v => `<text x="${margin.left - 5}" y="${yS(v) + 4}" text-anchor="end" font-size="9" fill="var(--bs-body-color)">${v.toFixed(1)}</text><line x1="${margin.left}" y1="${yS(v)}" x2="${margin.left + plotW}" y2="${yS(v)}" stroke="#dee2e6" stroke-width="0.5"/>`).join("")}
      <path d="${cleanPath}" fill="none" stroke="#16A085" stroke-width="1.5" stroke-dasharray="6,3"/>
      <path d="${noisyPath}" fill="none" stroke="#7F8C8D" stroke-width="0.8" opacity="0.6"/>
      <path d="${maPath}" fill="none" stroke="#E67E22" stroke-width="2"/>
      <text x="${margin.left + plotW / 2}" y="${height - 5}" text-anchor="middle" font-size="10" fill="var(--bs-body-color)">Sample</text>
      <rect x="${margin.left + 5}" y="${margin.top + 5}" width="155" height="50" fill="var(--bs-body-bg, white)" stroke="#dee2e6" rx="3"/>
      <line x1="${margin.left + 10}" y1="${margin.top + 17}" x2="${margin.left + 28}" y2="${margin.top + 17}" stroke="#16A085" stroke-width="1.5" stroke-dasharray="6,3"/>
      <text x="${margin.left + 32}" y="${margin.top + 21}" font-size="9" fill="var(--bs-body-color)">Clean signal</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 31}" x2="${margin.left + 28}" y2="${margin.top + 31}" stroke="#7F8C8D" stroke-width="0.8" opacity="0.6"/>
      <text x="${margin.left + 32}" y="${margin.top + 35}" font-size="9" fill="var(--bs-body-color)">Noisy input</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 45}" x2="${margin.left + 28}" y2="${margin.top + 45}" stroke="#E67E22" stroke-width="2"/>
      <text x="${margin.left + 32}" y="${margin.top + 49}" font-size="9" fill="var(--bs-body-color)">Moving average output</text>
    </svg>
    <p style="margin: 0.5rem 0 0 0; font-size: 0.9em;">
      <strong>RMSE vs clean signal:</strong> <span style="color:#E67E22; font-weight:bold;">${rmse}</span> |
      Latency: ${((mavg_window - 1) / 2).toFixed(1)} samples.
      ${mavg_spikeRate > 0.05 ? "Notice how spikes distort the moving average output -- this filter averages spikes into the result rather than removing them." : ""}
      Try adding spikes to see the moving average's weakness.
    </p>
  </div>`;
}

35.6.2 Median Filter

Moving averages reduce random variation, but a single bad spike still enters the average. The median filter is the next tool because it can ignore a spike when most nearby readings are sane.

Excellent for removing spike noise while preserving edges:

class MedianFilter:
    def __init__(self, window_size=5):
        self.window_size = window_size
        self.buffer = []

    def filter(self, value):
        self.buffer.append(value)
        if len(self.buffer) > self.window_size:
            self.buffer.pop(0)

        sorted_buffer = sorted(self.buffer)
        mid = len(sorted_buffer) // 2

        if len(sorted_buffer) % 2 == 0:
            return (sorted_buffer[mid - 1] + sorted_buffer[mid]) / 2
        return sorted_buffer[mid]

Median vs Moving Average

Show code

viewof med_window = Inputs.range([3, 15], {
  value: 5, step: 2,
  label: "Filter window size (odd)"
})

viewof med_spikeSize = Inputs.range([5, 25], {
  value: 12, step: 1,
  label: "Spike magnitude"
})

viewof med_spikeRate = Inputs.range([0, 0.25], {
  value: 0.08, step: 0.01,
  label: "Spike probability"
})

Show code

{
  const prng2 = (i) => ((Math.sin(i * 97.3 + 217.1) * 31415.9265) % 1 + 1) % 1;
  const n = 50;
  const win = med_window % 2 === 0 ? med_window + 1 : med_window; // force odd

  // Clean signal: step function + gentle sine
  const clean = Array.from({length: n}, (_, i) => {
    const base = i < 25 ? 100 : 105;
    return base + 1.5 * Math.sin(i * 0.2);
  });

  // Add spikes
  const noisy = clean.map((v, i) => {
    if (prng2(i) < med_spikeRate) {
      return v + (prng2(i + 500) > 0.5 ? 1 : -1) * med_spikeSize;
    }
    return v + (prng2(i + 100) - 0.5) * 1.0; // small noise
  });

  // Moving average
  const maBuf = [];
  const maOut = noisy.map(v => {
    maBuf.push(v);
    if (maBuf.length > win) maBuf.shift();
    return maBuf.reduce((a, b) => a + b, 0) / maBuf.length;
  });

  // Median filter
  const medBuf = [];
  const medOut = noisy.map(v => {
    medBuf.push(v);
    if (medBuf.length > win) medBuf.shift();
    const sorted = medBuf.slice().sort((a, b) => a - b);
    const m = Math.floor(sorted.length / 2);
    return sorted.length % 2 === 0 ? (sorted[m - 1] + sorted[m]) / 2 : sorted[m];
  });

  const width = 620;
  const height = 280;
  const margin = {top: 30, right: 20, bottom: 35, left: 50};
  const plotW = width - margin.left - margin.right;
  const plotH = height - margin.top - margin.bottom;

  const allV = [...noisy, ...maOut, ...medOut];
  const yMin = Math.min(...allV) - 2;
  const yMax = Math.max(...allV) + 2;
  const xS = (i) => margin.left + (i / (n - 1)) * plotW;
  const yS = (v) => margin.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH;

  const noisyPath = noisy.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const maPath2 = maOut.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const medPath = medOut.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const cleanPath = clean.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");

  let maErr = 0, medErr = 0;
  for (let i = 0; i < n; i++) {
    maErr += (maOut[i] - clean[i]) ** 2;
    medErr += (medOut[i] - clean[i]) ** 2;
  }
  const maRmse = Math.sqrt(maErr / n).toFixed(3);
  const medRmse = Math.sqrt(medErr / n).toFixed(3);

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <svg viewBox="0 0 ${width} ${height}" style="width:100%; max-width:${width}px; font-family: Arial, sans-serif;">
      <text x="${width / 2}" y="16" text-anchor="middle" font-size="12" font-weight="bold" fill="var(--bs-body-color)">Median vs Moving Avg (window=${win}): spike=${med_spikeSize}, rate=${(med_spikeRate * 100).toFixed(0)}%</text>
      <line x1="${margin.left}" y1="${margin.top}" x2="${margin.left}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      <line x1="${margin.left}" y1="${margin.top + plotH}" x2="${margin.left + plotW}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      ${[yMin, (yMin + yMax) / 2, yMax].map(v => `<text x="${margin.left - 5}" y="${yS(v) + 4}" text-anchor="end" font-size="9" fill="var(--bs-body-color)">${v.toFixed(1)}</text><line x1="${margin.left}" y1="${yS(v)}" x2="${margin.left + plotW}" y2="${yS(v)}" stroke="#dee2e6" stroke-width="0.5"/>`).join("")}
      <path d="${cleanPath}" fill="none" stroke="#16A085" stroke-width="1" stroke-dasharray="6,3"/>
      <path d="${noisyPath}" fill="none" stroke="#7F8C8D" stroke-width="0.7" opacity="0.5"/>
      <path d="${maPath2}" fill="none" stroke="#E67E22" stroke-width="1.8"/>
      <path d="${medPath}" fill="none" stroke="#3498DB" stroke-width="2"/>
      <text x="${margin.left + plotW / 2}" y="${height - 5}" text-anchor="middle" font-size="10" fill="var(--bs-body-color)">Sample (note the step change at index 25)</text>
      <rect x="${margin.left + 5}" y="${margin.top + 5}" width="145" height="66" fill="var(--bs-body-bg, white)" stroke="#dee2e6" rx="3"/>
      <line x1="${margin.left + 10}" y1="${margin.top + 17}" x2="${margin.left + 28}" y2="${margin.top + 17}" stroke="#16A085" stroke-width="1" stroke-dasharray="6,3"/>
      <text x="${margin.left + 32}" y="${margin.top + 21}" font-size="9" fill="var(--bs-body-color)">Clean signal</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 31}" x2="${margin.left + 28}" y2="${margin.top + 31}" stroke="#7F8C8D" stroke-width="0.7" opacity="0.5"/>
      <text x="${margin.left + 32}" y="${margin.top + 35}" font-size="9" fill="var(--bs-body-color)">Noisy + spikes</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 45}" x2="${margin.left + 28}" y2="${margin.top + 45}" stroke="#E67E22" stroke-width="1.8"/>
      <text x="${margin.left + 32}" y="${margin.top + 49}" font-size="9" fill="var(--bs-body-color)">Moving average</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 59}" x2="${margin.left + 28}" y2="${margin.top + 59}" stroke="#3498DB" stroke-width="2"/>
      <text x="${margin.left + 32}" y="${margin.top + 63}" font-size="9" fill="var(--bs-body-color)">Median filter</text>
    </svg>
    <p style="margin: 0.5rem 0 0 0; font-size: 0.9em;">
      <strong>RMSE:</strong>
      Moving avg = <span style="color:#E67E22; font-weight:bold;">${maRmse}</span> |
      Median = <span style="color:#3498DB; font-weight:bold;">${medRmse}</span>
      ${parseFloat(medRmse) < parseFloat(maRmse) ? " -- Median wins, especially with spikes present." : " -- Moving average performs similarly; try increasing spike rate."}
      Notice how the median preserves the step change at index 25, while the moving average smooths it out over ${win} samples.
    </p>
  </div>`;
}

35.6.3 Exponential Smoothing

After windowed filters, exponential smoothing gives you a streaming option: each new reading nudges the estimate without waiting for a full centered window.

Provides weighted average with more weight on recent values:

class ExponentialSmoothingFilter:
    def __init__(self, alpha=0.3):
        """
        alpha: smoothing factor (0-1)
        Higher alpha = more weight on recent values = less smoothing
        Lower alpha = more weight on history = more smoothing
        """
        self.alpha = alpha
        self.smoothed = None

    def filter(self, value):
        if self.smoothed is None:
            self.smoothed = value
        else:
            self.smoothed = self.alpha * value + (1 - self.alpha) * self.smoothed

        return self.smoothed

35.6.4 Filter Comparison

35.6.4.1 Moving Average

Latency: (N-1)/2 samples
Edge preservation: Poor
Spike removal: Moderate
Best for: Steady-state signals

35.6.4.2 Median

Latency: (N-1)/2 samples
Edge preservation: Excellent
Spike removal: Excellent
Best for: Spike-contaminated data

35.6.4.3 Exponential

Latency: Continuous
Edge preservation: Good
Spike removal: Moderate
Best for: Real-time smoothing

35.6.4.4 Kalman

Latency: Minimal
Edge preservation: Excellent
Spike removal: Excellent
Best for: Known dynamics and sensor fusion

Putting Numbers to It

How does filter window size affect noise reduction and latency?

For a moving average filter with Gaussian noise (σ = 2.0°C) on temperature sensor:

Noise reduction formula:

\[\sigma_{filtered} = \frac{\sigma_{original}}{\sqrt{N}}\]

Where $N$ is the window size.

Window Size (N)	Noise Reduction	Latency (samples)	Temperature Example
3	$\sigma / \sqrt{3} = 0.58\sigma$	1.0	2.0°C → 1.15°C noise
5	$\sigma / \sqrt{5} = 0.45\sigma$	2.0	2.0°C → 0.89°C noise
10	$\sigma / \sqrt{10} = 0.32\sigma$	4.5	2.0°C → 0.63°C noise
20	$\sigma / \sqrt{20} = 0.22\sigma$	9.5	2.0°C → 0.45°C noise

Trade-off: Larger window → better noise rejection BUT longer delay detecting real changes.

Latency calculation: Output lags input by $\frac{N-1}{2}$ samples. For $N=10$ at 1 Hz sampling → 4.5 second delay.

Practical rule: Choose $N$ such that latency is < 10% of the timescale you care about. If monitoring hourly HVAC cycles (3600s), 5-10 sample window (5-10s latency) is acceptable. If detecting rapid door opening events (10s timescale), use $N=3$ (1.5s latency max).

Checkpoint: Filter Trade-Offs

You now know:

Moving average noise falls as sigma divided by square root of N, but latency grows as (N-1)/2 samples.
At N=10 and 1 Hz, that latency is 4.5 seconds; a 100 point moving average at 1 Hz later becomes a 50 second lag.
Median filters handle sparse spikes; exponential smoothing is useful when a real-time stream cannot wait for a centered window.

35.6.5 Try It: Filter Window Size Calculator

Show code

viewof filterWindowSize = Inputs.range([2, 50], {
  value: 5, step: 1,
  label: "Window size (N)"
})

viewof filterNoiseStdDev = Inputs.range([0.1, 10.0], {
  value: 2.0, step: 0.1,
  label: "Input noise std dev (units)"
})

viewof filterSampleRate = Inputs.range([0.1, 100], {
  value: 1.0, step: 0.1,
  label: "Sample rate (Hz)"
})

Show code

{
  const N = filterWindowSize;
  const sigma = filterNoiseStdDev;
  const fs = filterSampleRate;
  const filteredNoise = sigma / Math.sqrt(N);
  const reductionPct = ((1 - 1 / Math.sqrt(N)) * 100).toFixed(1);
  const latencySamples = (N - 1) / 2;
  const latencySeconds = latencySamples / fs;

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; margin-top: 0.5rem; color: var(--bs-body-color);">
    <h4 style="margin-top:0; color: var(--bs-body-color);">Moving Average Filter Performance</h4>
    <table style="width:100%; border-collapse: collapse; color: var(--bs-body-color);">
      <tr style="border-bottom: 2px solid var(--bs-body-color);">
        <th style="text-align:left; padding: 4px;">Metric</th>
        <th style="text-align:right; padding: 4px;">Value</th>
      </tr>
      <tr style="border-bottom: 1px solid #dee2e6;">
        <td style="padding: 4px;">Input noise</td>
        <td style="text-align:right; padding: 4px;">${sigma.toFixed(1)} units</td>
      </tr>
      <tr style="border-bottom: 1px solid #dee2e6;">
        <td style="padding: 4px;">Output noise (sigma / sqrt(N))</td>
        <td style="text-align:right; padding: 4px;"><strong>${filteredNoise.toFixed(2)} units</strong></td>
      </tr>
      <tr style="border-bottom: 1px solid #dee2e6;">
        <td style="padding: 4px;">Noise reduction</td>
        <td style="text-align:right; padding: 4px;">${reductionPct}%</td>
      </tr>
      <tr style="border-bottom: 1px solid #dee2e6;">
        <td style="padding: 4px;">Latency ((N-1)/2 samples)</td>
        <td style="text-align:right; padding: 4px;">${latencySamples.toFixed(1)} samples = ${latencySeconds.toFixed(2)}s</td>
      </tr>
    </table>
    <p style="margin-top: 0.5rem; margin-bottom:0;"><strong>Assessment:</strong> ${latencySeconds < 1
      ? "Excellent - sub-second latency suitable for real-time control."
      : latencySeconds < 10
        ? "Good - acceptable for monitoring applications."
        : "High latency - consider a smaller window for time-sensitive applications."}</p>
  </div>`;
}

35.6.6 Try It: Exponential Smoothing Explorer

Show code

{
  // Generate a noisy signal with a step change
  const n = 40;
  const raw = Array.from({length: n}, (_, i) => {
    const base = i < 20 ? 22.0 : 25.0;
    const noise = (Math.sin(i * 3.7) * 0.8 + Math.cos(i * 7.1) * 0.5);
    return base + noise;
  });

  // Apply exponential smoothing
  const smoothed = [];
  let s = raw[0];
  for (let i = 0; i < n; i++) {
    s = expAlpha * raw[i] + (1 - expAlpha) * s;
    smoothed.push(s);
  }

  const timeConstant = (-1 / Math.log(1 - expAlpha)).toFixed(1);

  const width = 580;
  const height = 220;
  const margin = {top: 25, right: 15, bottom: 30, left: 45};
  const plotW = width - margin.left - margin.right;
  const plotH = height - margin.top - margin.bottom;

  const yMin = Math.min(...raw, ...smoothed) - 0.5;
  const yMax = Math.max(...raw, ...smoothed) + 0.5;
  const xScale = (i) => margin.left + (i / (n - 1)) * plotW;
  const yScale = (v) => margin.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH;

  const rawPath = raw.map((v, i) => `${i === 0 ? "M" : "L"}${xScale(i).toFixed(1)},${yScale(v).toFixed(1)}`).join(" ");
  const smoothPath = smoothed.map((v, i) => `${i === 0 ? "M" : "L"}${xScale(i).toFixed(1)},${yScale(v).toFixed(1)}`).join(" ");

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <svg viewBox="0 0 ${width} ${height}" style="width:100%; max-width:${width}px; font-family: Arial, sans-serif;">
      <text x="${width/2}" y="14" text-anchor="middle" font-size="12" font-weight="bold" fill="var(--bs-body-color)">Exponential Smoothing: alpha = ${expAlpha.toFixed(2)} (time constant = ${timeConstant} samples)</text>
      <line x1="${margin.left}" y1="${margin.top}" x2="${margin.left}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      <line x1="${margin.left}" y1="${margin.top + plotH}" x2="${margin.left + plotW}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      ${[yMin, (yMin+yMax)/2, yMax].map(v => `<text x="${margin.left - 5}" y="${yScale(v) + 4}" text-anchor="end" font-size="10" fill="var(--bs-body-color)">${v.toFixed(1)}</text><line x1="${margin.left}" y1="${yScale(v)}" x2="${margin.left + plotW}" y2="${yScale(v)}" stroke="#dee2e6" stroke-width="0.5"/>`).join("")}
      <text x="${margin.left + plotW/2}" y="${height - 3}" text-anchor="middle" font-size="10" fill="var(--bs-body-color)">Sample</text>
      <path d="${rawPath}" fill="none" stroke="#7F8C8D" stroke-width="1" stroke-dasharray="3,2"/>
      <path d="${smoothPath}" fill="none" stroke="#16A085" stroke-width="2"/>
      <rect x="${margin.left + plotW - 140}" y="${margin.top + 5}" width="135" height="35" fill="var(--bs-body-bg, white)" stroke="#dee2e6" rx="3"/>
      <line x1="${margin.left + plotW - 133}" y1="${margin.top + 17}" x2="${margin.left + plotW - 115}" y2="${margin.top + 17}" stroke="#7F8C8D" stroke-width="1" stroke-dasharray="3,2"/>
      <text x="${margin.left + plotW - 110}" y="${margin.top + 21}" font-size="10" fill="var(--bs-body-color)">Raw signal</text>
      <line x1="${margin.left + plotW - 133}" y1="${margin.top + 31}" x2="${margin.left + plotW - 115}" y2="${margin.top + 31}" stroke="#16A085" stroke-width="2"/>
      <text x="${margin.left + plotW - 110}" y="${margin.top + 35}" font-size="10" fill="var(--bs-body-color)">Smoothed</text>
    </svg>
    <p style="margin-top: 0.5rem; margin-bottom:0; font-size: 0.9em;"><strong>Interpretation:</strong> ${expAlpha < 0.15
      ? "Very heavy smoothing - slow to respond to real changes but excellent noise rejection."
      : expAlpha < 0.4
        ? "Balanced smoothing - good noise reduction while tracking gradual trends."
        : expAlpha < 0.7
          ? "Light smoothing - responsive to changes but passes more noise through."
          : "Minimal smoothing - nearly passes raw signal through. Consider a lower alpha."}</p>
  </div>`;
}

35.6.7 Choosing the Right Filter

The examples above show the mechanics. This section turns them into an engineering choice: identify the noise, decide whether edges matter, then pick the simplest filter that preserves the signal you care about.

def choose_filter(signal_characteristics):
    """
    Guide for selecting appropriate noise filter based on signal characteristics.
    """
    recommendations = {
        'steady_state_with_gaussian_noise': {
            'filter': 'MovingAverage',
            'reason': 'Averages out random noise effectively',
            'window_size': 5  # Adjust based on noise frequency
        },
        'spiky_noise_impulse': {
            'filter': 'MedianFilter',
            'reason': 'Completely ignores outlier spikes',
            'window_size': 5  # Odd number works best
        },
        'real_time_tracking': {
            'filter': 'ExponentialSmoothing',
            'reason': 'No latency, responsive to changes',
            'alpha': 0.3  # Lower = smoother, higher = more responsive
        },
        'sensor_fusion_known_dynamics': {
            'filter': 'KalmanFilter',
            'reason': 'Optimal estimation with uncertainty tracking',
            'params': 'process_noise, measurement_noise'
        },
        'edge_preserving': {
            'filter': 'MedianFilter',
            'reason': 'Preserves sharp transitions in data',
            'window_size': 3
        }
    }
    return recommendations.get(signal_characteristics, recommendations['steady_state_with_gaussian_noise'])

Try It: Filter Selection Advisor

Show code

viewof sel_noiseType = Inputs.select(
  ["gaussian", "spike_impulse", "mixed_gaussian_spike", "slow_drift"],
  {value: "gaussian", label: "Noise type"}
)

viewof sel_edgesImportant = Inputs.toggle({
  label: "Preserve sharp edges?",
  value: false
})

viewof sel_realTime = Inputs.toggle({
  label: "Real-time (streaming) data?",
  value: false
})

Show code

{
  let recommendation, reason, color;

  if (sel_noiseType === "spike_impulse" || (sel_noiseType === "mixed_gaussian_spike" && sel_edgesImportant)) {
    recommendation = "Median Filter";
    reason = "Best for spike removal and edge preservation. Completely ignores outlier values.";
    color = "#3498DB";
  } else if (sel_realTime && sel_noiseType === "gaussian") {
    recommendation = "Exponential Smoothing";
    reason = "Zero latency, recursive formula. Ideal for streaming data with Gaussian noise.";
    color = "#9B59B6";
  } else if (sel_noiseType === "mixed_gaussian_spike" && !sel_edgesImportant) {
    recommendation = "Combined: Median + Exponential";
    reason = "Two-stage pipeline: median removes spikes first, then exponential smooths remaining noise.";
    color = "#E67E22";
  } else if (sel_noiseType === "slow_drift") {
    recommendation = "High-pass + Moving Average";
    reason = "Remove drift with high-pass, then smooth residual noise with moving average.";
    color = "#16A085";
  } else {
    recommendation = "Moving Average";
    reason = "Simple and effective for steady-state Gaussian noise. Good general-purpose choice.";
    color = "#2C3E50";
  }

  const scenarios = [
    {noise: "gaussian", edge: false, rt: false, filter: "Moving Average"},
    {noise: "gaussian", edge: false, rt: true, filter: "Exponential Smoothing"},
    {noise: "spike_impulse", edge: true, rt: false, filter: "Median Filter"},
    {noise: "mixed_gaussian_spike", edge: false, rt: false, filter: "Median + Exponential"},
    {noise: "slow_drift", edge: false, rt: false, filter: "High-pass + Moving Avg"},
  ];

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <div style="border-left: 4px solid ${color}; padding: 0.75rem 1rem; background: var(--bs-body-bg, white); border-radius: 0 4px 4px 0; margin-bottom: 0.75rem;">
      <h4 style="margin: 0 0 0.3rem 0; color: ${color};">Recommended: ${recommendation}</h4>
      <p style="margin: 0; font-size: 0.95em;">${reason}</p>
    </div>
    <p style="margin: 0; font-size: 0.85em; color: #7F8C8D;">
      <strong>Decision logic:</strong> Noise type = ${sel_noiseType.replace(/_/g, " ")}${sel_edgesImportant ? " + edge preservation required" : ""}${sel_realTime ? " + real-time constraint" : ""}. Try different combinations to see how the recommendation changes.
    </p>
  </div>`;
}

35.6.8 Combining Filters

For robust noise removal, filters can be cascaded:

class CombinedFilter:
    """
    Two-stage filter: Median first (remove spikes), then exponential smooth.
    """
    def __init__(self, median_window=5, exp_alpha=0.3):
        self.median_filter = MedianFilter(median_window)
        self.exp_filter = ExponentialSmoothingFilter(exp_alpha)

    def filter(self, value):
        # Stage 1: Remove spikes with median
        despike = self.median_filter.filter(value)
        # Stage 2: Smooth remaining noise
        smooth = self.exp_filter.filter(despike)
        return smooth

# Usage
combined = CombinedFilter(median_window=5, exp_alpha=0.2)
for reading in sensor_stream:
    clean_value = combined.filter(reading)

Two-Stage Filter Pipeline

Show code

viewof comb_medWindow = Inputs.range([3, 11], {
  value: 5, step: 2,
  label: "Stage 1: Median window (odd)"
})

viewof comb_expAlpha = Inputs.range([0.05, 0.95], {
  value: 0.3, step: 0.05,
  label: "Stage 2: Exp smoothing alpha"
})

viewof comb_spikeRate = Inputs.range([0, 0.2], {
  value: 0.1, step: 0.01,
  label: "Spike probability"
})

viewof comb_noiseAmp = Inputs.range([0, 3], {
  value: 1.0, step: 0.1,
  label: "Gaussian noise amplitude"
})

Show code

{
  const prng3 = (i) => ((Math.sin(i * 83.7 + 419.3) * 27182.8182) % 1 + 1) % 1;
  const n = 60;
  const mw = comb_medWindow % 2 === 0 ? comb_medWindow + 1 : comb_medWindow;

  // Clean signal with step + sine
  const clean = Array.from({length: n}, (_, i) => {
    const base = i < 20 ? 22 : i < 40 ? 25 : 22;
    return base + 1.2 * Math.sin(i * 0.25);
  });

  // Corrupt signal
  const noisy = clean.map((v, i) => {
    let val = v + (prng3(i) - 0.5) * 2 * comb_noiseAmp;
    if (prng3(i + 300) < comb_spikeRate) {
      val += (prng3(i + 600) > 0.5 ? 1 : -1) * 12;
    }
    return val;
  });

  // Stage 1: Median only
  const medBuf2 = [];
  const afterMedian = noisy.map(v => {
    medBuf2.push(v);
    if (medBuf2.length > mw) medBuf2.shift();
    const sorted = medBuf2.slice().sort((a, b) => a - b);
    const m = Math.floor(sorted.length / 2);
    return sorted.length % 2 === 0 ? (sorted[m - 1] + sorted[m]) / 2 : sorted[m];
  });

  // Stage 2: Exponential smoothing on median output
  const combined = [];
  let sm = afterMedian[0];
  for (let i = 0; i < n; i++) {
    sm = comb_expAlpha * afterMedian[i] + (1 - comb_expAlpha) * sm;
    combined.push(sm);
  }

  // Also compute exp-only (no median stage) for comparison
  const expOnly = [];
  let se = noisy[0];
  for (let i = 0; i < n; i++) {
    se = comb_expAlpha * noisy[i] + (1 - comb_expAlpha) * se;
    expOnly.push(se);
  }

  const width = 620;
  const height = 300;
  const margin = {top: 30, right: 20, bottom: 35, left: 50};
  const plotW = width - margin.left - margin.right;
  const plotH = height - margin.top - margin.bottom;

  const allV = [...noisy, ...combined, ...expOnly];
  const yMin = Math.min(...allV) - 1;
  const yMax = Math.max(...allV) + 1;
  const xS = (i) => margin.left + (i / (n - 1)) * plotW;
  const yS = (v) => margin.top + plotH - ((v - yMin) / (yMax - yMin)) * plotH;

  const noisyPath = noisy.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const combPath = combined.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const expOnlyPath = expOnly.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");
  const cleanPath = clean.map((v, i) => `${i === 0 ? "M" : "L"}${xS(i).toFixed(1)},${yS(v).toFixed(1)}`).join(" ");

  let combErr = 0, expErr = 0;
  for (let i = 0; i < n; i++) {
    combErr += (combined[i] - clean[i]) ** 2;
    expErr += (expOnly[i] - clean[i]) ** 2;
  }
  const combRmse = Math.sqrt(combErr / n).toFixed(3);
  const expRmse = Math.sqrt(expErr / n).toFixed(3);

  return html`<div style="background: var(--bs-light, #f8f9fa); padding: 1rem; border-radius: 8px; color: var(--bs-body-color);">
    <svg viewBox="0 0 ${width} ${height}" style="width:100%; max-width:${width}px; font-family: Arial, sans-serif;">
      <text x="${width / 2}" y="16" text-anchor="middle" font-size="12" font-weight="bold" fill="var(--bs-body-color)">Combined Filter: Median(${mw}) then Exp(alpha=${comb_expAlpha.toFixed(2)})</text>
      <line x1="${margin.left}" y1="${margin.top}" x2="${margin.left}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      <line x1="${margin.left}" y1="${margin.top + plotH}" x2="${margin.left + plotW}" y2="${margin.top + plotH}" stroke="var(--bs-body-color)" stroke-width="1"/>
      ${[yMin, (yMin + yMax) / 2, yMax].map(v => `<text x="${margin.left - 5}" y="${yS(v) + 4}" text-anchor="end" font-size="9" fill="var(--bs-body-color)">${v.toFixed(1)}</text><line x1="${margin.left}" y1="${yS(v)}" x2="${margin.left + plotW}" y2="${yS(v)}" stroke="#dee2e6" stroke-width="0.5"/>`).join("")}
      <path d="${cleanPath}" fill="none" stroke="#16A085" stroke-width="1" stroke-dasharray="6,3"/>
      <path d="${noisyPath}" fill="none" stroke="#7F8C8D" stroke-width="0.6" opacity="0.4"/>
      <path d="${expOnlyPath}" fill="none" stroke="#E74C3C" stroke-width="1.2" stroke-dasharray="3,2"/>
      <path d="${combPath}" fill="none" stroke="#2C3E50" stroke-width="2.5"/>
      <text x="${margin.left + plotW / 2}" y="${height - 5}" text-anchor="middle" font-size="10" fill="var(--bs-body-color)">Sample (signal has step changes at 20 and 40)</text>
      <rect x="${margin.left + 5}" y="${margin.top + 5}" width="185" height="66" fill="var(--bs-body-bg, white)" stroke="#dee2e6" rx="3"/>
      <line x1="${margin.left + 10}" y1="${margin.top + 17}" x2="${margin.left + 28}" y2="${margin.top + 17}" stroke="#16A085" stroke-width="1" stroke-dasharray="6,3"/>
      <text x="${margin.left + 32}" y="${margin.top + 21}" font-size="9" fill="var(--bs-body-color)">Clean signal</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 31}" x2="${margin.left + 28}" y2="${margin.top + 31}" stroke="#7F8C8D" stroke-width="0.6" opacity="0.4"/>
      <text x="${margin.left + 32}" y="${margin.top + 35}" font-size="9" fill="var(--bs-body-color)">Noisy + spikes</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 45}" x2="${margin.left + 28}" y2="${margin.top + 45}" stroke="#E74C3C" stroke-width="1.2" stroke-dasharray="3,2"/>
      <text x="${margin.left + 32}" y="${margin.top + 49}" font-size="9" fill="var(--bs-body-color)">Exp smoothing only (no median)</text>
      <line x1="${margin.left + 10}" y1="${margin.top + 59}" x2="${margin.left + 28}" y2="${margin.top + 59}" stroke="#2C3E50" stroke-width="2.5"/>
      <text x="${margin.left + 32}" y="${margin.top + 63}" font-size="9" fill="var(--bs-body-color)">Combined: Median + Exp</text>
    </svg>
    <p style="margin: 0.5rem 0 0 0; font-size: 0.9em;">
      <strong>RMSE:</strong>
      Exp-only = <span style="color:#E74C3C; font-weight:bold;">${expRmse}</span> |
      Combined = <span style="color:#2C3E50; font-weight:bold;">${combRmse}</span>
      ${parseFloat(combRmse) < parseFloat(expRmse) ? " -- The two-stage pipeline outperforms single exponential smoothing." : " -- Similar performance; try increasing spike rate to see the benefit of the median stage."}
      The median stage removes spikes that would otherwise corrupt the exponential smoother's state.
    </p>
  </div>`;
}

Temperature Filter Windows

Scenario: You have a temperature sensor monitoring a cold storage facility. The sensor reports every 10 seconds. You’ve observed occasional spikes due to electrical interference when the cooling compressor starts (5-10C jumps), and you need to filter these without losing legitimate temperature trends.

Given:

Sampling rate: 0.1 Hz (1 sample per 10 seconds)
Normal temperature: -18C ± 2C
Compressor noise: Random spikes to -8C or -28C (duration: 1-2 samples)
Legitimate temperature changes: 0.5C per minute maximum

Question: Should you use a moving average or median filter, and what window size?

Solution:

Step 1: Analyze the noise characteristics

Spike duration: 1-2 samples = 10-20 seconds
Spike frequency: Approximately 5% of readings (every 200 seconds when compressor cycles)
Spike magnitude: 10C deviation (huge compared to normal 2C variation)

Step 2: Calculate required window size

For moving average: - Window needs to span spike duration - 3-sample window: averages noise into adjacent readings - Example: [-18, -28, -18] → average = -21.3C (still shows distortion)

For median filter: - Window needs odd number of samples - 3-sample window: [-18, -28, -18] → median = -18C (spike completely removed!) - 5-sample window: [-18, -18, -28, -18, -18] → median = -18C (still perfect)

Step 3: Verify edge preservation

Legitimate temperature change over 1 minute: - Rate: 0.5C/min = 0.083C per 10 seconds - Over 5 samples: [-18.0, -18.1, -18.2, -18.3, -18.4] - Median of 5: -18.2C (preserves trend!)

Step 4: Calculate latency

Window size 5 = (5-1)/2 = 2 samples delay = 20 seconds latency

For cold storage monitoring (not time-critical), 20 seconds is acceptable.

Answer: Use 5-sample median filter

Why:

Completely removes 1-2 sample spikes
Preserves legitimate temperature trends
No tuning parameters (unlike moving average weights)
Latency (20s) acceptable for this application

Implementation:

median_filter = MedianFilter(window_size=5)
for reading in sensor_stream:
    clean_temp = median_filter.filter(reading)
    if clean_temp < -20:  # After filtering, threshold check is reliable
        trigger_high_temp_alarm()

Key Insight: Median filters excel when noise is sparse spikes rather than continuous Gaussian noise. The window should be large enough to ensure spikes are minority values (< 50%) within the window.

Choose Imputation Strategy

When sensor data goes missing, selecting the correct imputation strategy depends on sensor characteristics and downstream requirements. Use this framework to guide your decision:

Sensor Characteristic	Imputation Strategy	Rationale	Max Gap Duration
Slowly changing continuous (temperature, humidity)	Forward-fill or linear interpolation	Physical inertia prevents rapid changes	10-30 minutes
Event-driven binary (motion detector, door switch)	Zero/False for missing periods	Absence of event signal = no event occurred	Unlimited (but flag gaps >1 hour)
Monotonic counter (energy meter, flow meter)	Zero increment for missing period	No reading = no consumption during gap	Up to 1 day
Periodic with known pattern (daily temperature cycle)	Seasonal decomposition	Leverage historical pattern	Up to 6 hours
High-frequency volatile (stock price, vibration)	Do not impute - mark as missing	Interpolation creates false data	N/A - preserve gaps
Redundant sensor array	Use nearby sensor + bias correction	Spatial correlation for better estimate	Depends on sensor density

Decision Tree:

Is the sensor event-driven?
- YES → Use zero/default state for missing periods
- NO → Continue to step 2
Does the signal change slowly? (Rate < 10% per time constant)
- YES → Forward-fill acceptable for gaps < 10x sampling interval
- NO → Continue to step 3
Is there a known periodic pattern?
- YES → Use seasonal decomposition fill
- NO → Continue to step 4
Are there nearby sensors measuring the same quantity?
- YES → Use spatial interpolation from neighbors
- NO → Use linear interpolation or mark as missing

Example Application:

def select_imputation_strategy(sensor_type, gap_duration_minutes):
    if sensor_type == "motion_detector":
        return "zero_fill"  # No motion during gap
    elif sensor_type == "temperature":
        if gap_duration_minutes < 30:
            return "forward_fill"
        elif gap_duration_minutes < 360:
            return "seasonal_fill"  # Use daily pattern
        else:
            return "mark_missing"  # Gap too long
    elif sensor_type == "energy_meter":
        return "zero_increment"  # No consumption
    elif sensor_type == "vibration":
        return "mark_missing"  # Cannot safely interpolate

Warning Signs of Wrong Strategy:

Motion sensor shows continuous “detected” during power outage → Used forward-fill instead of zero
Temperature shows impossible linear ramp over 6-hour gap → Used interpolation instead of seasonal pattern
Energy meter shows zero consumption for entire day → Used zero_increment for too-long gap (should alarm)

Detect Outliers Before Imputing

The Mistake: Running imputation before outlier detection, causing outliers to be forward-filled or interpolated into the data stream, permanently corrupting adjacent readings.

Why It Happens: Data quality pipelines are often built incrementally. Engineers add imputation first (to handle missing data), then later realize they need outlier detection. By then, the pipeline order is established and changing it requires refactoring.

Example of the Problem:

# WRONG: Impute first, detect outliers second
readings = [22.1, 22.3, 99.9, None, None, None, 22.8]  # 99.9 is sensor fault

# Step 1: Forward-fill (MISTAKE - happens before outlier removal)
imputed = forward_fill(readings)
# Result: [22.1, 22.3, 99.9, 99.9, 99.9, 99.9, 22.8]

# Step 2: Outlier detection
cleaned = remove_outliers(imputed, threshold=3_sigma)
# Result: [22.1, 22.3, REMOVED, REMOVED, REMOVED, REMOVED, 22.8]
# Lost 4 data points! The None values became 99.9 outliers.

The Fix: Always apply outlier detection and validation BEFORE imputation:

# CORRECT: Validate first, impute second
readings = [22.1, 22.3, 99.9, None, None, None, 22.8]

# Step 1: Outlier detection and removal (mark as None)
validated = remove_outliers(readings, threshold=3_sigma)
# Result: [22.1, 22.3, None, None, None, None, 22.8]

# Step 2: Forward-fill ONLY the validated stream
imputed = forward_fill(validated)
# Result: [22.1, 22.3, 22.3, 22.3, 22.3, 22.3, 22.8]
# Correctly preserved legitimate data!

Correct Pipeline Order:

Range Validation → Mark out-of-range values as None
Rate-of-Change Validation → Mark impossible jumps as None
Outlier Detection → Mark statistical outliers as None
Missing Value Imputation → Fill None values using appropriate strategy
Noise Filtering → Apply smoothing to cleaned data

Real-World Impact: A temperature monitoring system in a pharmaceutical warehouse experienced this bug. A faulty sensor spiked to 85C for one reading before going offline. The spike was forward-filled for 30 minutes (the gap duration), triggering false temperature excursion alarms and requiring destruction of $50,000 worth of temperature-sensitive drugs. Root cause: imputation ran before outlier removal in the data pipeline.

Prevention Checklist:

Range validation is the FIRST step in your pipeline
Outlier detection runs BEFORE any imputation
Unit tests verify corrupted readings don’t propagate through imputation
Your data quality framework explicitly enforces stage ordering

Checkpoint: Pipeline Order

You now know:

Validation must run before imputation, or a bad value can be copied into the gap.
The warehouse example shows the cost: one 85C spike was forward-filled for 30 minutes and forced a $50,000 destruction decision.
A defensible pipeline is range validation, rate-of-change validation, outlier detection, imputation, then filtering.

35.7 Knowledge Check

35.8 Quiz: Missing Data and Filtering

35.9 Interactive Quiz: Match Concepts

35.10 Interactive Quiz: Sequence the Steps

Common Pitfalls

Avoid Forward-Filling Fast Signals

Forward-filling works for temperature that changes by 0.5°C per minute but creates flat-line artefacts for high-frequency vibration data. Match the imputation method to the signal dynamics.

Flag Imputed Values

If downstream analytics cannot distinguish real readings from imputed ones, anomaly detectors may flag imputed values as anomalies or ML models may learn from artefacts. Always add an imputation flag column alongside filled values.

Avoid Mean Imputation for IoT

Global mean imputation destroys temporal patterns (seasonality, trends) that are the most valuable features in IoT data. Use time-local methods (linear interpolation, seasonal decomposition imputation) instead.

Avoid Oversized Moving Averages

A 100-point moving average on a 1 Hz sensor introduces 50-second lag — unacceptable for real-time anomaly detection. Balance noise suppression against lag, or use an exponential moving average when a lower-latency smoother is acceptable.

35.11 Label the Diagram

35.12 Code Challenge

35.13 Missing-Data Repair Contracts

The chapter above covers forward-fill, interpolation, seasonal fill, moving average filters, median filters, exponential smoothing, calculators, and practice quizzes. Continue to Missing-Data Repair and Filtering Contracts for the deeper L2 material: missingness mechanisms, repair quality flags, impulse-noise filters, held-out replay validation, and MCAR/MAR/MNAR bias boundaries.

35.14 Summary

Missing value imputation and noise filtering are essential for producing clean, complete sensor data:

Forward Fill: Simple and effective for slowly-changing continuous values (temperature, humidity)
Linear Interpolation: Better for trending data when you have values on both sides of the gap
Seasonal Fill: Use when data has known periodic patterns (daily temperature cycles)
Sensor-Specific Imputation: Motion sensors get zero, state sensors get “unknown”, counters get zero increment
Moving Average: Good for steady-state Gaussian noise, but blurs edges
Median Filter: Excellent for spike removal, preserves sharp transitions
Exponential Smoothing: Real-time with no latency, tunable responsiveness

Critical Design Principle: Always match your imputation and filtering strategy to the sensor type and data characteristics. A one-size-fits-all approach will produce incorrect results for at least some of your sensors.

Concept Relationships

Builds On:

Data Validation and Outlier Detection - Pre-imputation validation
Signal Processing - Filtering theory
Sensor Fundamentals - Sensor characteristics

Enables:

Data Normalization - Scaling clean data
Multi-Sensor Fusion - Combining preprocessed data
Stream Processing - Real-time pipelines

35.15 See Also

Data Quality Pipeline:

Data Quality Overview - Full pipeline architecture
Data Validation - Validation before imputation
Anomaly Detection - Finding patterns in clean data

Filtering Techniques:

Kalman Filters - Optimal filtering with fusion
Complementary Filters - IMU signal conditioning
Digital Signal Processing - Filter design

Applications:

Edge Compute Patterns - Real-time filtering at edge
Time Series Fundamentals - Handling temporal patterns
Sensor Calibration - Bias correction

35.16 What’s Next

If you want to…	Read this
Understand data quality validation before imputation	Data Quality Validation
Apply preprocessing in the broader pipeline context	Data Quality and Preprocessing
Dig deeper into repair contracts and missingness bias	Missing-Data Repair and Filtering Contracts
Practise normalisation techniques in the lab	Data Quality Normalisation Lab
Apply clean data to anomaly detection	Anomaly Detection Overview
Return to the module overview	Big Data Overview