4  Feature Engineering for ML

In 60 Seconds

Feature engineering transforms raw sensor data into discriminative features that ML models can learn from effectively. Physics-based features like acceleration variance and MFCC outperform raw data, and a simple decision tree with well-engineered features can outperform a deep neural network fed raw sensor readings. Reducing 36 features to 5 through importance analysis and correlation pruning often costs less than 4% accuracy while making models 6x faster.

4.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Design Discriminative Features: Create features that separate classes effectively
  • Apply Domain Knowledge: Use physics and domain expertise to engineer powerful features
  • Perform Feature Selection: Identify and remove redundant or uninformative features
  • Optimize for Edge Deployment: Balance feature quality with computational cost

Key Concepts

  • Feature engineering: The process of transforming raw sensor time series into informative numerical representations (features) that capture the patterns relevant to the prediction or classification task.
  • Time-domain features: Statistical descriptors computed directly on raw sensor values in the time domain: mean, standard deviation, skewness, kurtosis, peak-to-peak amplitude, zero-crossing rate.
  • Frequency-domain features: Descriptors of a signal’s frequency content computed via FFT: dominant frequency, spectral centroid, spectral bandwidth, spectral entropy — particularly valuable for vibration and audio signals.
  • Feature selection: The process of identifying which features in a large feature set are most informative for the task and removing redundant or irrelevant features that add noise without adding predictive power.
  • Feature importance: A metric (from tree-based models or permutation testing) quantifying how much each feature contributes to model accuracy, guiding which features to retain and which to discard.
  • Feature window: A fixed-length or event-triggered segment of the sensor time series from which a feature vector is extracted, balancing temporal resolution against feature stability.

Feature engineering is the craft of turning raw sensor numbers into meaningful inputs for machine learning. Think of it as preparing ingredients before cooking – raw data is the unpeeled potato, and features are the neatly chopped pieces ready for the recipe. Good features often matter more than fancy algorithms for getting accurate predictions.

4.2 Prerequisites

Chapter Series: Modeling and Inferencing

This is part 6 of the IoT Machine Learning series:

  1. ML Fundamentals - Core concepts
  2. Mobile Sensing - HAR, transportation
  3. IoT ML Pipeline - 7-step pipeline
  4. Edge ML & Deployment - TinyML
  5. Audio Feature Processing - MFCC
  6. Feature Engineering (this chapter) - Feature design and selection
  7. Production ML - Monitoring

4.3 What Makes a Good Feature?

Feature engineering is often more impactful than algorithm selection. A simple Decision Tree with well-engineered features can outperform a deep neural network with raw sensor data.

4.3.1 How It Works: Feature Engineering Process

Feature engineering transforms raw sensor streams into discriminative inputs through a systematic pipeline:

  1. Raw Data Collection: Accelerometer samples arrive at 50 Hz (50 readings/second) as continuous streams of X/Y/Z values
  2. Windowing: Segment continuous data into overlapping windows (e.g., 2 seconds = 100 samples per window)
  3. Statistical Feature Extraction: Calculate time-domain statistics (mean, std, min, max) that capture motion intensity
  4. Frequency Feature Extraction: Apply FFT to detect periodic patterns (e.g., walking has ~2Hz cadence, running ~3-4Hz)
  5. Domain-Specific Features: Add physics-based features like zero-crossings (direction changes) and signal magnitude area (total energy)
  6. Feature Selection: Prune redundant features using correlation analysis and importance ranking

Example Pipeline: Raw accelerometer (100 samples) → 12 engineered features → Random Forest classifier → Activity label (walk/run/sit)

The key insight is that well-designed features capture what human experts know about the domain—a running pattern has higher variance and faster frequency than walking—making them far more powerful than raw numbers.

4.3.2 Good vs Bad Features: Visual Comparison

Comparison diagram showing good features (high inter-class variance, low intra-class variance, noise robust, cheap to compute) versus bad features (overlapping distributions, noise sensitive, expensive, correlated) for IoT activity classification
Figure 4.1: Good versus Bad ML Features for Activity Classification

4.3.3 Good Feature Characteristics

High-Quality IoT Features
  1. High inter-class variance (separates classes well)
    • Walking vs Running: Mean acceleration differs by 2-5 m/s²
  2. Low intra-class variance (consistent within class)
    • Walking always ~2.5 m/s², regardless of user height/weight
  3. Robust to noise and sensor drift
    • Accelerometer calibration errors (±5%) don’t flip predictions
  4. Computationally cheap
    • Mean calculation: O(n), minimal CPU/battery impact
  5. Interpretable (helps debugging)
    • “Variance increased 3×” → clearly indicates running vs walking

4.3.4 Bad Feature Characteristics

Features to Avoid
  1. Overlapping class distributions
    • Temperature (20-25 C) same for walking/running → 50% accuracy
  2. High sensitivity to noise
    • Instantaneous accelerometer sample: ±2 m/s² jitter
  3. Expensive to compute
    • Full FFT on 1024-sample window: ~5ms on Cortex-M4 (still costly vs mean at <0.1ms)
  4. Correlated with other features (redundant)
    • Mean X + Mean Y + Mean Z vs Magnitude: Magnitude captures all
  5. Not domain-relevant
    • Battery level for activity recognition: random performance
Try It: Feature Distribution Overlap Visualizer

Adjust the mean and spread of two activity classes to see how feature overlap affects classification accuracy. A good feature separates classes clearly with minimal overlap.

4.4 Sensor-Specific Feature Engineering

Different sensors require different strategies:

Sensor Type Good Features Bad Features Why Good Works
Accelerometer Mean magnitude, Variance, Peak count Raw samples, Individual axes Noise reduction, orientation-independent
Temperature Rate of change, Slope Absolute value Location-independent, detects events
Audio MFCCs, Spectral energy Raw waveform 1000× compression, noise robust
Current Sensor RMS, Peak, Crest factor Instantaneous reading Load identification, filters transients
GPS Speed, Heading change rate Latitude, Longitude Context-independent, captures motion
Gyroscope Angular velocity variance Raw rotation matrix Captures turning behavior

4.5 HAR Feature Engineering Code

import numpy as np

def extract_har_features(accel_window):
    """
    Extract 12 efficient features for activity recognition
    Input: accel_window (N, 3) - N samples of [X, Y, Z] acceleration
    Output: 12-element feature vector
    """
    # Compute magnitude (orientation-independent)
    mag = np.linalg.norm(accel_window, axis=1)

    # Time-domain features (cheap to compute)
    features = [
        np.mean(mag),              # Mean magnitude
        np.std(mag),               # Std deviation (separates sit/walk/run)
        np.min(mag),               # Minimum (detects stationary)
        np.max(mag),               # Maximum (detects impacts)
        np.max(mag) - np.min(mag), # Range (motion intensity)

        # Zero crossings (periodicity indicator)
        np.sum(np.diff(np.sign(mag - np.mean(mag))) != 0),

        # Signal Magnitude Area (total energy)
        np.sum(np.abs(accel_window[:, 0])) +
        np.sum(np.abs(accel_window[:, 1])) +
        np.sum(np.abs(accel_window[:, 2])),

        # Vertical component (stairs detection)
        np.std(accel_window[:, 2]),  # Z-axis std deviation

        # Frequency estimation (no FFT needed)
        estimate_dominant_frequency(mag),

        # Statistical moments
        np.percentile(mag, 75) - np.percentile(mag, 25),  # IQR
        np.sum((mag - np.mean(mag)) ** 3) / (len(mag) * np.std(mag)**3 + 1e-10),  # Skewness
        np.sum((mag - np.mean(mag)) ** 4) / (len(mag) * np.std(mag)**4 + 1e-10),  # Kurtosis
    ]

    return np.array(features)

def estimate_dominant_frequency(signal, sample_rate=50):
    """Estimate frequency without FFT (autocorrelation method)"""
    autocorr = np.correlate(signal - np.mean(signal),
                            signal - np.mean(signal), mode='full')
    autocorr = autocorr[len(autocorr)//2:]

    # Find first peak after lag 0
    peaks = (autocorr[1:-1] > autocorr[:-2]) & (autocorr[1:-1] > autocorr[2:])
    if np.any(peaks):
        period = np.argmax(peaks) + 1
        return sample_rate / period  # Hz
    return 0

4.5.1 Computational Cost Comparison

Feature Approach Features Time Accuracy Model Size
Raw samples (100) 100 <1ms 65% 200 KB (CNN)
Time-domain only 6 2ms 82% 15 KB
Time + freq (12) 12 8ms 90% 25 KB
Full FFT + MFCCs 39 45ms 92% 80 KB

Sweet spot: 12 time+freq features balance accuracy (90%) with speed (8ms).

Try It: Accelerometer Feature Extraction Simulator

Simulate different activity patterns and watch how time-domain features are computed from raw accelerometer magnitude. Select an activity to see how mean, variance, and zero crossings differ across walking, running, and sitting.

4.5.2 Feature Count vs Performance Tradeoff

Explore how the number of features affects accuracy, model size, and inference time for edge deployment. Adjust the constraints to find your optimal configuration.

Quantifying Feature Quality: Inter-Class Variance vs Intra-Class Variance

The best features maximize separation between classes while minimizing variation within classes. This is captured by the Fisher Score:

Fisher Score Formula: \[ F(x_i) = \frac{\sum_{c=1}^{C} n_c (\mu_c^i - \mu^i)^2}{\sum_{c=1}^{C} n_c (\sigma_c^i)^2} \]

Where: - \(\mu_c^i\) = Mean of feature \(i\) for class \(c\) - \(\mu^i\) = Global mean of feature \(i\) - \(\sigma_c^i\) = Standard deviation of feature \(i\) for class \(c\) - \(n_c\) = Number of samples in class \(c\)

Example: Mean Acceleration Magnitude for Walking vs Running

Walking Class: \[ \mu_{\text{walk}} = 2.8 \text{ m/s}^2, \quad \sigma_{\text{walk}} = 0.4 \text{ m/s}^2, \quad n_{\text{walk}} = 1000 \]

Running Class: \[ \mu_{\text{run}} = 7.2 \text{ m/s}^2, \quad \sigma_{\text{run}} = 1.1 \text{ m/s}^2, \quad n_{\text{run}} = 1000 \]

Global Mean: \[ \mu = \frac{1000 \times 2.8 + 1000 \times 7.2}{2000} = 5.0 \text{ m/s}^2 \]

Fisher Score: \[ F = \frac{1000(2.8 - 5.0)^2 + 1000(7.2 - 5.0)^2}{1000(0.4)^2 + 1000(1.1)^2} \] \[ = \frac{1000(4.84 + 4.84)}{1000(0.16 + 1.21)} = \frac{9680}{1370} = 7.07 \]

Interpretation: High Fisher Score (7.07) means mean acceleration magnitude is an excellent feature—classes are 7× more separated between each other than they vary within themselves.

Compare to a BAD feature (device battery level): \[ F_{\text{battery}} = \frac{1000(82 - 80)^2 + 1000(78 - 80)^2}{1000(15)^2 + 1000(18)^2} = \frac{8000}{549000} = 0.015 \]

Battery level has Fisher Score of 0.015 (approximately 470x worse)—no predictive power for activity classification.

4.5.3 Interactive Fisher Score Calculator

Use this calculator to evaluate feature quality for your own IoT classification tasks. Adjust the class parameters to see how the Fisher Score changes.

4.6 Feature Selection Process

Step 1: Start with statistical features (always compute first) - Mean, Standard deviation, Min, Max, Range - Cost: O(n) single pass - Baseline: 70-80% accuracy

Step 2: Add domain-specific features (if accuracy < 85%) - Accelerometer: Zero crossings, Peak count - Audio: MFCCs, Spectral energy - Temperature: Rate of change, Slope - Accuracy boost: +10-15%

Step 3: Consider frequency domain (only if needed) - FFT dominant frequency, Spectral entropy - When: Periodic signals (walking, rotating machinery) - Cost: ~1-5ms for 128-1024 sample FFT on Cortex-M4 - Skip if: Non-periodic data or tight latency budget

Step 4: Correlation analysis (remove redundancy) - Drop features with r > 0.9 correlation - Redundant features waste compute and model capacity

Step 5: Test on held-out data

  • 80/20 split by users (not time!)
  • Ensure test users not in training set
Try It: Correlation Pruning Simulator

Explore how removing highly correlated features reduces redundancy without significant accuracy loss. Adjust the correlation threshold to see which features survive pruning and the resulting impact on model efficiency.

4.7 Common Feature Engineering Mistakes

Mistakes to Avoid

Mistake 1: Using absolute values instead of relative

  • Bad: Absolute temperature (22 C) → Location-dependent
  • Good: Temperature rate of change (2 C/min) → Detects events

Mistake 2: Including metadata as features

  • Bad: Device ID, Battery %, Wi-Fi SSID → Not causal
  • Good: Motion variance, Audio energy → Physics-based

Mistake 3: Computing expensive features when cheap ones work

  • Bad: Full FFT (~5ms) for non-periodic data when simpler features suffice
  • Good: Variance + Zero crossings (2ms)

Mistake 4: Ignoring correlation between features

  • Bad: Mean X, Mean Y, Mean Z (correlated) → 3 redundant features
  • Good: Magnitude sqrt(x²+y²+z²) → 1 orientation-independent feature

Mistake 5: Training and testing on same user’s data

  • Bad: User A: 80% train, 20% test → 95% accuracy (overfitting)
  • Good: Users A+B: train, User C: test → 85% accuracy (generalizes)

4.8 Worked Example: Smart Agriculture Soil Monitoring

Scenario: Agricultural IoT with 12 sensors per station, ESP32 edge device (320KB RAM), <50ms inference budget.

Initial: 36 features → 2.8 MB model → 180ms inference (too slow!)

Feature Selection Process:

Step Features Accuracy Model Size Inference
All 36 features 36 91.2% 2.8 MB 180ms
Top 15 by importance 15 90.8% 1.1 MB 95ms
Remove correlated 8 89.7% 420 KB 52ms
Top 6 uncorrelated 6 88.9% 180 KB 38ms
Final (top 5) 5 87.2% 95 KB 28ms

Final Feature Set:

  1. soil_moisture_30cm (primary indicator)
  2. soil_temp_15cm (evaporation driver)
  3. moisture_rate_24h (trend)
  4. humidity (atmospheric demand)
  5. evapotranspiration (physics-based derived)

Key Insight: Top 5 features captured 76% of predictive power. Correlation analysis essential—soil moisture at 15cm and 30cm were r=0.89 correlated, so keeping both wastes capacity.

4.9 Feature Importance Analysis

from sklearn.ensemble import RandomForestClassifier
import numpy as np

# Train model
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)

# Get feature importances
importances = model.feature_importances_
feature_names = ['mean', 'std', 'min', 'max', 'range', 'zero_crossings',
                 'sma', 'z_std', 'freq', 'iqr', 'skew', 'kurtosis']

# Sort by importance
indices = np.argsort(importances)[::-1]

print("Feature ranking:")
for i in range(len(feature_names)):
    print(f"{i+1}. {feature_names[indices[i]]}: {importances[indices[i]]:.3f}")

# Output example:
# 1. std: 0.342        <- Variance most discriminative
# 2. freq: 0.251       <- Frequency second
# 3. sma: 0.158        <- Signal magnitude area
# 4. mean: 0.089       <- Mean contributes but less
# ... (remaining < 5% each)

# Drop features with < 2% importance
important_features = [f for i, f in enumerate(feature_names)
                      if importances[i] > 0.02]
# Result: 12 features -> 7 features, 90% -> 89% accuracy, 40% faster

4.10 Knowledge Check

Scenario: A manufacturing plant monitors motor health using a 3-axis accelerometer sampling at 1 kHz. Initial feature extraction produces 48 features per 2-second window: mean, std, min, max, range, RMS, peak-to-peak, kurtosis, skewness, spectral centroid, spectral rolloff, and zero-crossings for each axis (X/Y/Z), plus FFT dominant frequencies.

Challenge: 48 features × 500 Hz effective rate = 24 KB/sec data stream. ESP32 (520 KB RAM) can only store 21 seconds of history before overflow. Model size: 2.1 MB (won’t fit). Inference time: 180 ms (too slow for real-time anomaly detection).

Step-by-Step Solution:

Step 1: Correlation Analysis

import numpy as np
import pandas as pd

# Correlation matrix reveals redundancies
correlation_matrix = df[feature_names].corr()

# Identify highly correlated pairs (r > 0.9)
correlated_pairs = []
for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        if abs(correlation_matrix.iloc[i, j]) > 0.9:
            correlated_pairs.append((
                correlation_matrix.columns[i],
                correlation_matrix.columns[j],
                correlation_matrix.iloc[i, j]
            ))

# Results: Mean_X, Mean_Y, Mean_Z all r>0.92 with Magnitude_Mean
# Solution: Keep only Magnitude_Mean (orientation-independent)
# Reduction: 48 → 40 features

Step 2: Feature Importance Ranking

from sklearn.ensemble import RandomForestClassifier

# Train model on all 40 remaining features
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train, y_train)  # 3 classes: Normal, Bearing Wear, Imbalance

# Extract importance scores
importances = pd.DataFrame({
    'feature': feature_names,
    'importance': model.feature_importances_
}).sort_values('importance', ascending=False)

# Top 10 features capture 89% of total importance:
# 1. RMS_Magnitude (0.18)
# 2. Spectral_Centroid_X (0.15)
# 3. Kurtosis_Y (0.12)
# 4. FFT_Peak_1_Hz (0.11)
# 5. Std_Magnitude (0.09)
# 6. Zero_Crossings_Z (0.08)
# 7. Spectral_Rolloff_Y (0.07)
# 8. Peak_to_Peak_Magnitude (0.05)
# 9. Range_X (0.03)
# 10. Skewness_Z (0.01)

Step 3: Incremental Testing

# Test accuracy vs feature count
results = []
for n in [5, 8, 10, 15, 20]:
    top_n_features = importances.head(n)['feature'].values
    X_subset = X_train[top_n_features]

    model_subset = RandomForestClassifier(n_estimators=50)
    model_subset.fit(X_subset, y_train)

    accuracy = model_subset.score(X_test[top_n_features], y_test)
    model_size = estimate_model_size(model_subset)
    inference_time = measure_inference_time(model_subset, X_test.iloc[0])

    results.append((n, accuracy, model_size, inference_time))

# Results:
# 5 features: 87.2% accuracy, 95 KB, 28 ms
# 8 features: 91.5% accuracy, 180 KB, 52 ms
# 10 features: 92.8% accuracy, 280 KB, 68 ms
# 15 features: 93.1% accuracy, 520 KB, 95 ms
# 20 features: 93.4% accuracy, 890 KB, 145 ms

Step 4: Select Optimal Configuration

Criteria Target 5 Features 8 Features 10 Features
Accuracy >90% 87.2% ❌ 91.5% ✓ 92.8% ✓
Model Size <200 KB 95 KB ✓ 180 KB ✓ 280 KB ❌
Inference <100 ms 28 ms ✓ 52 ms ✓ 68 ms ✓

Winner: 8 features – Achieves 91.5% accuracy while meeting all memory and latency constraints.

Final Feature Set:

  1. RMS_Magnitude: Overall vibration energy
  2. Spectral_Centroid_X: Frequency distribution center
  3. Kurtosis_Y: Spikiness (detects bearing wear)
  4. FFT_Peak_1_Hz: Dominant frequency (detects imbalance)
  5. Std_Magnitude: Variation intensity
  6. Zero_Crossings_Z: Axial oscillation frequency
  7. Spectral_Rolloff_Y: High-frequency content
  8. Peak_to_Peak_Magnitude: Maximum swing amplitude

Results:

  • Data rate: 48 KB/sec → 8 KB/sec (6× reduction)
  • Model size: 2.1 MB → 180 KB (12× reduction)
  • Inference: 180 ms → 52 ms (3.5× faster)
  • Accuracy: 94.2% → 91.5% (2.7% acceptable loss)
  • Memory buffer: 21 sec → 65 sec history

Key Insight: The sweet spot between accuracy and efficiency often exists around 8-12 well-chosen features. Beyond that, diminishing returns kick in—each additional feature costs RAM and compute but adds <0.5% accuracy.

Use this framework to systematically select features for your IoT ML application:

Stage Question If YES → If NO →
1. Start Do I have labeled data for this application? Proceed to Stage 2 Collect labeled data first
2. Baseline Is the data time-series (sequential)? Start with time-domain features (mean, std, min, max) Use static features appropriate to data type
3. Periodicity Does the signal have periodic patterns (vibration, walking, heartbeat)? Add frequency-domain features (FFT peaks, spectral energy) Skip frequency features; focus on statistical
4. Compute Budget Is inference time <50ms critical? Limit to 5-10 cheap features (mean, std, range) Can use 15-25 features including FFT
5. Memory Budget RAM <100 KB? Use int8 quantization, <10 features Can store 20-40 features + larger models
6. Domain Knowledge Do I understand the physics? Add domain-specific features (bearing frequencies, gait parameters) Rely on generic statistical features
7. Correlation Check Are any features correlated >0.9? Drop redundant features; keep most interpretable Proceed to Stage 8
8. Importance Ranking Do I have feature importance scores? Keep top N features capturing >85% cumulative importance Train Random Forest to get importance scores
9. Validation Does accuracy drop <5% with reduced features? Deploy optimized feature set Add features back until acceptable accuracy

Example Decision Path: Smart Agriculture Soil Moisture Prediction

Q1: Labeled data available? → YES (6 months historical data) Q2: Time-series? → YES (measurements every 15 min) Q3: Periodic patterns? → NO (slow environmental trends, not oscillations) → Skip FFT features Q4: Inference time critical? → NO (predictions every hour on gateway) Q5: Memory constraint? → YES (ESP32 edge deployment) → Limit to 8-12 features Q6: Domain knowledge? → YES (evapotranspiration physics) → Add: Rate of change features, ET calculation, humidity deficit Q7: Correlation check? → YES (soil moisture at 15cm and 30cm depth: r=0.89) → Drop 15cm depth, keep 30cm (more stable) Q8: Feature importance? → YES (trained Random Forest) → Top 5 features: soil_moisture_30cm, soil_temp_15cm, humidity, ET, moisture_rate_24h Q9: Accuracy check? → 91.2% (all 36 features) vs 87.2% (top 5) = 4% dropAcceptable! Deploy with 5 features

Feature Selection Cheat Sheet:

Sensor Type Always Include Consider Adding Usually Skip
Accelerometer Mean magnitude, Std magnitude FFT peaks, Zero crossings Individual axis means
Temperature Current value, Rate of change 24-hour trend, Daily min/max Absolute historical values
Audio MFCCs (1-13), RMS energy Spectral centroid, Zero-crossing rate Raw waveform samples
Current RMS, Mean, Std Crest factor, THD Instantaneous samples
GPS Speed, Heading change rate Distance from waypoint Raw lat/lon coordinates

When to Re-evaluate Features:

  • Every 3 months: Check for feature drift (distributions changing)
  • After accuracy drop >5%: Add features or retrain
  • When new sensor added: Correlation check with existing features
  • Before deployment: Verify inference time on target hardware
Common Mistake: Using Absolute Values Instead of Relative Features

The Mistake: An environmental monitoring system uses absolute temperature values (22.3°C, 22.7°C, 23.1°C) as features for anomaly detection. The model achieves 95% accuracy in the lab but fails completely when deployed to different geographic regions where baseline temperatures differ by 10-20°C.

Why It Happens:

  • Lab testing occurs in controlled environment (single location)
  • Absolute values work when all training and test data come from same baseline
  • Developers don’t anticipate deployment to different thermal environments
  • Model overfits to specific temperature ranges in training data

Real-World Impact:

# Lab training data (California, baseline 20-25°C)
# Model learns: temp < 18°C = "cold anomaly"
#              temp > 28°C = "hot anomaly"

# Deployment in Alaska (baseline 0-5°C)
# Normal 2°C reading → Flagged as "cold anomaly" ❌
# Model generates 1,000+ false alerts per day
# Operators disable anomaly detection → System useless

The Fix: Use relative features instead of absolute values

Bad Approach:

# Absolute temperature features
features = {
    'temp_current': 22.3,      # Meaningless without context
    'temp_avg_1h': 22.1,       # Location-dependent
    'temp_max_24h': 25.8       # Seasonal variation
}

Good Approach:

# Relative and rate-based features
features = {
    # Rate of change (location-independent)
    'temp_rate_1h': +0.5,              # Degrees per hour
    'temp_rate_6h': +2.1,              # Trend direction

    # Deviation from local baseline (adaptive)
    'temp_deviation_24h_avg': -1.2,   # Current vs 24h average
    'temp_deviation_7d_avg': +0.8,    # Current vs weekly average

    # Percentile within historical distribution (normalized)
    'temp_percentile_30d': 65,        # 65th percentile of last 30 days

    # Time-of-day normalization
    'temp_vs_expected_hour': -0.3,    # Deviation from typical 10 AM temp
}

# These features work in ANY location because they are self-referential

Validation Example:

# Test model on diverse deployments
locations = [
    ('California', baseline=22),
    ('Alaska', baseline=2),
    ('Arizona', baseline=32),
    ('Norway', baseline=-5)
]

# Absolute features model
for loc, baseline in locations:
    test_data_absolute = generate_data(baseline, absolute_features=True)
    accuracy_absolute = model_absolute.score(test_data_absolute, labels)
    print(f"{loc}: {accuracy_absolute:.1%}")

# Results:
# California: 95.2% (training location)
# Alaska: 12.3% (completely fails)
# Arizona: 18.7% (mostly false positives)
# Norway: 8.1% (unusable)

# Relative features model
for loc, baseline in locations:
    test_data_relative = generate_data(baseline, relative_features=True)
    accuracy_relative = model_relative.score(test_data_relative, labels)
    print(f"{loc}: {accuracy_relative:.1%}")

# Results:
# California: 93.8%
# Alaska: 91.2%
# Arizona: 92.5%
# Norway: 90.7%

Other Common Absolute→Relative Transformations:

Bad (Absolute) Good (Relative) Why Better
GPS latitude/longitude Distance from home/waypoint, Speed, Heading change Location-independent, privacy-preserving
Light sensor lux value % change from baseline, Rate of darkening Works indoors/outdoors, day/night
Audio volume (dB) Volume change rate, Ratio to ambient Microphone-independent, environment-adaptive
Pressure (Pa) Pressure derivative (altitude change) Works at any elevation
Battery voltage (V) % remaining, Discharge rate Device/battery-agnostic

Checklist to Avoid This Mistake:

If you answer “no” to any question, convert to relative features before training your model.

Try It: Absolute vs Relative Feature Comparison

See why absolute sensor values fail across deployment locations while relative features generalize. Select a deployment scenario and observe how absolute features produce false alerts, while relative features remain stable.

4.11 Concept Relationships

Feature engineering connects to several IoT ML topics:

  • ML Fundamentals establishes why features matter more than raw data for model learning
  • Mobile Sensing provides real-world context for activity recognition features
  • IoT ML Pipeline shows where feature engineering fits in the 7-step workflow (Step 3)
  • Edge ML & Deployment demonstrates how feature selection enables models to run on resource-constrained devices
  • Production ML covers monitoring feature drift when deployed models degrade over time

The bidirectional relationship is critical: good features enable simpler models (Random Forest vs deep learning), and deployment constraints (10ms inference budget) force aggressive feature selection.

4.12 See Also

Related Chapters:

External Resources:

  • Scikit-learn Feature Selection Guide: sklearn feature selection
  • “Feature Engineering for Machine Learning” by Alice Zheng & Amanda Casari (O’Reilly, 2018)
  • TinyML Feature Engineering Best Practices: tinyml.org

4.13 Try It Yourself

Hands-On Challenge: Build a step counter using accelerometer feature engineering

Task: Implement a simple step detection algorithm using only time-domain features (no FFT required):

  1. Collect 10 seconds of accelerometer data while walking (use your phone or simulated data)
  2. Calculate these 5 features per 1-second window:
    • Magnitude mean: Average of sqrt(x² + y² + z²)
    • Magnitude variance: Variability indicates motion intensity
    • Peak count: Number of local maxima (each step creates a peak)
    • Zero crossings: Direction changes per second
    • Signal magnitude area: Total energy across all axes
  3. Threshold the peak count feature: if peaks > 1.5 per second → walking, else stationary
  4. Compare your feature-based approach to a naive “just count when magnitude > threshold” method

What to Observe:

  • The naive threshold fails when you tilt the phone or shake it while stationary
  • Peak count + variance combination correctly distinguishes walking from random motion
  • Well-engineered features make the logic simple: one threshold vs. complex rules

Bonus: Try running while holding the phone—notice how variance increases dramatically compared to walking, enabling activity classification.

Common Pitfalls

Time-domain statistical features work well for slowly changing sensors (temperature, humidity) but miss the frequency-domain patterns that characterise vibration, audio, and RF signals. Match the feature engineering strategy to the signal characteristics.

Features computed on the full dataset (global mean, global variance) are not suitable for real-time inference on streaming data. Always engineer features within fixed-length sliding windows to enable online, real-time prediction.

Including both mean and median of the same signal, or all 13 MFCC coefficients plus their first and second derivatives without selection, can inflate the feature vector unnecessarily and cause overfitting. Apply feature selection or PCA to remove redundancy.

Computing feature normalisation parameters (mean, std) or selecting features based on their performance on the test set introduces data leakage. All feature engineering decisions must be made on the training set only.

4.14 Summary

This chapter covered feature engineering for IoT ML:

  • Good Features: High inter-class variance, low intra-class variance, cheap to compute
  • Domain Knowledge: Physics-based features (MFCC for audio, variance for motion) outperform generic statistics
  • Feature Selection: Use importance analysis and correlation pruning to reduce 36 → 5 features
  • Edge Optimization: Balance accuracy vs computational cost for deployment

Key Insight: Feature engineering contributes more to accuracy than algorithm choice—spend 80% of time on features, 20% on model selection.

Key Takeaway

Feature engineering contributes more to ML accuracy than algorithm choice – a simple decision tree with well-engineered physics-based features can outperform a deep neural network fed raw sensor data. Start with cheap statistical features (mean, variance), add domain-specific features only if accuracy is below 85%, and use correlation analysis to prune redundant features. Reducing from 36 features to 5 through importance ranking and correlation pruning costs less than 4% accuracy while making models 6x faster.

How does a smartwatch know if you are walking, running, or sitting? The Sensor Squad explains feature engineering!

Sammy the Sensor lives inside a smartwatch. Every second, he feels the wrist moving and writes down 50 tiny measurements about how fast the arm is going. But staring at 50 numbers per second is like trying to read a book that has a million pages – impossible!

“I need CLUES, not raw data!” says Max the Microcontroller.

So the Sensor Squad creates a detective toolkit:

Clue 1: Average Motion (Mean) “How bumpy is the ride overall?” If the average motion is LOW, the person is probably sitting. If it is HIGH, they are running!

Clue 2: How Much Things Change (Variance) “Is the motion smooth or jerky?” Walking has a nice steady pattern. Running is bouncier. Sitting barely moves at all.

Clue 3: How Often Direction Changes (Zero Crossings) “How many times does the arm swing back and forth?” Walking: about 2 swings per second. Running: 3-4 swings per second!

“See?” says Lila the LED. “Instead of 50 confusing numbers, we now have just 3 simple clues. And those 3 clues are WAY better at telling activities apart!”

Max runs his mini-brain (ML model) on just these 3 clues and gets it right 90% of the time. When they tried feeding all 50 raw numbers, the model only got 65% right!

Bella the Battery adds: “Plus, figuring out 3 clues uses barely any energy. Figuring out 50 raw numbers would drain me in hours!”

The lesson: Good clues (features) beat more data every time. It is like being a detective – one perfect fingerprint solves the case faster than a room full of random evidence!

4.14.1 Try This at Home!

Walk around your room for 10 seconds, then sit still for 10 seconds. Hold your hand flat and notice how it moves. When you walk, your hand bounces up and down in a rhythm. When you sit, it barely moves. Your smartwatch uses exactly these patterns (the “clues”) to know what you are doing!

4.15 What’s Next

Direction Chapter Link
Next Production ML modeling-production.html
Previous Audio Feature Processing modeling-audio-features.html
Related Edge ML and TinyML Deployment modeling-edge-deployment.html
Related ML Fundamentals modeling-ml-fundamentals.html