Feature engineering transforms raw sensor data into discriminative features that ML models can learn from effectively. Physics-based features like acceleration variance and MFCC outperform raw data, and a simple decision tree with well-engineered features can outperform a deep neural network fed raw sensor readings. Reducing 36 features to 5 through importance analysis and correlation pruning often costs less than 4% accuracy while making models 6x faster.
4.1 Learning Objectives
By the end of this chapter, you will be able to:
Design Discriminative Features: Create features that separate classes effectively
Apply Domain Knowledge: Use physics and domain expertise to engineer powerful features
Perform Feature Selection: Identify and remove redundant or uninformative features
Optimize for Edge Deployment: Balance feature quality with computational cost
Key Concepts
Feature engineering: The process of transforming raw sensor time series into informative numerical representations (features) that capture the patterns relevant to the prediction or classification task.
Time-domain features: Statistical descriptors computed directly on raw sensor values in the time domain: mean, standard deviation, skewness, kurtosis, peak-to-peak amplitude, zero-crossing rate.
Frequency-domain features: Descriptors of a signal’s frequency content computed via FFT: dominant frequency, spectral centroid, spectral bandwidth, spectral entropy — particularly valuable for vibration and audio signals.
Feature selection: The process of identifying which features in a large feature set are most informative for the task and removing redundant or irrelevant features that add noise without adding predictive power.
Feature importance: A metric (from tree-based models or permutation testing) quantifying how much each feature contributes to model accuracy, guiding which features to retain and which to discard.
Feature window: A fixed-length or event-triggered segment of the sensor time series from which a feature vector is extracted, balancing temporal resolution against feature stability.
For Beginners: Feature Engineering for IoT
Feature engineering is the craft of turning raw sensor numbers into meaningful inputs for machine learning. Think of it as preparing ingredients before cooking – raw data is the unpeeled potato, and features are the neatly chopped pieces ready for the recipe. Good features often matter more than fancy algorithms for getting accurate predictions.
Feature engineering is often more impactful than algorithm selection. A simple Decision Tree with well-engineered features can outperform a deep neural network with raw sensor data.
4.3.1 How It Works: Feature Engineering Process
Feature engineering transforms raw sensor streams into discriminative inputs through a systematic pipeline:
Raw Data Collection: Accelerometer samples arrive at 50 Hz (50 readings/second) as continuous streams of X/Y/Z values
Windowing: Segment continuous data into overlapping windows (e.g., 2 seconds = 100 samples per window)
Frequency Feature Extraction: Apply FFT to detect periodic patterns (e.g., walking has ~2Hz cadence, running ~3-4Hz)
Domain-Specific Features: Add physics-based features like zero-crossings (direction changes) and signal magnitude area (total energy)
Feature Selection: Prune redundant features using correlation analysis and importance ranking
Example Pipeline: Raw accelerometer (100 samples) → 12 engineered features → Random Forest classifier → Activity label (walk/run/sit)
The key insight is that well-designed features capture what human experts know about the domain—a running pattern has higher variance and faster frequency than walking—making them far more powerful than raw numbers.
4.3.2 Good vs Bad Features: Visual Comparison
Figure 4.1: Good versus Bad ML Features for Activity Classification
4.3.3 Good Feature Characteristics
High-Quality IoT Features
High inter-class variance (separates classes well)
Walking vs Running: Mean acceleration differs by 2-5 m/s²
Low intra-class variance (consistent within class)
Walking always ~2.5 m/s², regardless of user height/weight
Full FFT on 1024-sample window: ~5ms on Cortex-M4 (still costly vs mean at <0.1ms)
Correlated with other features (redundant)
Mean X + Mean Y + Mean Z vs Magnitude: Magnitude captures all
Not domain-relevant
Battery level for activity recognition: random performance
Try It: Feature Distribution Overlap Visualizer
Adjust the mean and spread of two activity classes to see how feature overlap affects classification accuracy. A good feature separates classes clearly with minimal overlap.
Simulate different activity patterns and watch how time-domain features are computed from raw accelerometer magnitude. Select an activity to see how mean, variance, and zero crossings differ across walking, running, and sitting.
Explore how the number of features affects accuracy, model size, and inference time for edge deployment. Adjust the constraints to find your optimal configuration.
Where: - \(\mu_c^i\) = Mean of feature \(i\) for class \(c\) - \(\mu^i\) = Global mean of feature \(i\) - \(\sigma_c^i\) = Standard deviation of feature \(i\) for class \(c\) - \(n_c\) = Number of samples in class \(c\)
Example: Mean Acceleration Magnitude for Walking vs Running
Interpretation: High Fisher Score (7.07) means mean acceleration magnitude is an excellent feature—classes are 7× more separated between each other than they vary within themselves.
Compare to a BAD feature (device battery level): \[
F_{\text{battery}} = \frac{1000(82 - 80)^2 + 1000(78 - 80)^2}{1000(15)^2 + 1000(18)^2} = \frac{8000}{549000} = 0.015
\]
Battery level has Fisher Score of 0.015 (approximately 470x worse)—no predictive power for activity classification.
4.5.3 Interactive Fisher Score Calculator
Use this calculator to evaluate feature quality for your own IoT classification tasks. Adjust the class parameters to see how the Fisher Score changes.
Step 1: Start with statistical features (always compute first) - Mean, Standard deviation, Min, Max, Range - Cost: O(n) single pass - Baseline: 70-80% accuracy
Step 2: Add domain-specific features (if accuracy < 85%) - Accelerometer: Zero crossings, Peak count - Audio: MFCCs, Spectral energy - Temperature: Rate of change, Slope - Accuracy boost: +10-15%
Step 3: Consider frequency domain (only if needed) - FFT dominant frequency, Spectral entropy - When: Periodic signals (walking, rotating machinery) - Cost: ~1-5ms for 128-1024 sample FFT on Cortex-M4 - Skip if: Non-periodic data or tight latency budget
Step 4: Correlation analysis (remove redundancy) - Drop features with r > 0.9 correlation - Redundant features waste compute and model capacity
Step 5: Test on held-out data
80/20 split by users (not time!)
Ensure test users not in training set
Try It: Correlation Pruning Simulator
Explore how removing highly correlated features reduces redundancy without significant accuracy loss. Adjust the correlation threshold to see which features survive pruning and the resulting impact on model efficiency.
Mistake 5: Training and testing on same user’s data
Bad: User A: 80% train, 20% test → 95% accuracy (overfitting)
Good: Users A+B: train, User C: test → 85% accuracy (generalizes)
4.8 Worked Example: Smart Agriculture Soil Monitoring
Scenario: Agricultural IoT with 12 sensors per station, ESP32 edge device (320KB RAM), <50ms inference budget.
Initial: 36 features → 2.8 MB model → 180ms inference (too slow!)
Feature Selection Process:
Step
Features
Accuracy
Model Size
Inference
All 36 features
36
91.2%
2.8 MB
180ms
Top 15 by importance
15
90.8%
1.1 MB
95ms
Remove correlated
8
89.7%
420 KB
52ms
Top 6 uncorrelated
6
88.9%
180 KB
38ms
Final (top 5)
5
87.2%
95 KB
28ms
Final Feature Set:
soil_moisture_30cm (primary indicator)
soil_temp_15cm (evaporation driver)
moisture_rate_24h (trend)
humidity (atmospheric demand)
evapotranspiration (physics-based derived)
Key Insight: Top 5 features captured 76% of predictive power. Correlation analysis essential—soil moisture at 15cm and 30cm were r=0.89 correlated, so keeping both wastes capacity.
4.9 Feature Importance Analysis
from sklearn.ensemble import RandomForestClassifierimport numpy as np# Train modelmodel = RandomForestClassifier(n_estimators=100)model.fit(X_train, y_train)# Get feature importancesimportances = model.feature_importances_feature_names = ['mean', 'std', 'min', 'max', 'range', 'zero_crossings','sma', 'z_std', 'freq', 'iqr', 'skew', 'kurtosis']# Sort by importanceindices = np.argsort(importances)[::-1]print("Feature ranking:")for i inrange(len(feature_names)):print(f"{i+1}. {feature_names[indices[i]]}: {importances[indices[i]]:.3f}")# Output example:# 1. std: 0.342 <- Variance most discriminative# 2. freq: 0.251 <- Frequency second# 3. sma: 0.158 <- Signal magnitude area# 4. mean: 0.089 <- Mean contributes but less# ... (remaining < 5% each)# Drop features with < 2% importanceimportant_features = [f for i, f inenumerate(feature_names)if importances[i] >0.02]# Result: 12 features -> 7 features, 90% -> 89% accuracy, 40% faster
4.10 Knowledge Check
Worked Example: Feature Selection for Industrial Vibration Monitoring
Scenario: A manufacturing plant monitors motor health using a 3-axis accelerometer sampling at 1 kHz. Initial feature extraction produces 48 features per 2-second window: mean, std, min, max, range, RMS, peak-to-peak, kurtosis, skewness, spectral centroid, spectral rolloff, and zero-crossings for each axis (X/Y/Z), plus FFT dominant frequencies.
Challenge: 48 features × 500 Hz effective rate = 24 KB/sec data stream. ESP32 (520 KB RAM) can only store 21 seconds of history before overflow. Model size: 2.1 MB (won’t fit). Inference time: 180 ms (too slow for real-time anomaly detection).
Step-by-Step Solution:
Step 1: Correlation Analysis
import numpy as npimport pandas as pd# Correlation matrix reveals redundanciescorrelation_matrix = df[feature_names].corr()# Identify highly correlated pairs (r > 0.9)correlated_pairs = []for i inrange(len(correlation_matrix.columns)):for j inrange(i+1, len(correlation_matrix.columns)):ifabs(correlation_matrix.iloc[i, j]) >0.9: correlated_pairs.append(( correlation_matrix.columns[i], correlation_matrix.columns[j], correlation_matrix.iloc[i, j] ))# Results: Mean_X, Mean_Y, Mean_Z all r>0.92 with Magnitude_Mean# Solution: Keep only Magnitude_Mean (orientation-independent)# Reduction: 48 → 40 features
Step 2: Feature Importance Ranking
from sklearn.ensemble import RandomForestClassifier# Train model on all 40 remaining featuresmodel = RandomForestClassifier(n_estimators=100)model.fit(X_train, y_train) # 3 classes: Normal, Bearing Wear, Imbalance# Extract importance scoresimportances = pd.DataFrame({'feature': feature_names,'importance': model.feature_importances_}).sort_values('importance', ascending=False)# Top 10 features capture 89% of total importance:# 1. RMS_Magnitude (0.18)# 2. Spectral_Centroid_X (0.15)# 3. Kurtosis_Y (0.12)# 4. FFT_Peak_1_Hz (0.11)# 5. Std_Magnitude (0.09)# 6. Zero_Crossings_Z (0.08)# 7. Spectral_Rolloff_Y (0.07)# 8. Peak_to_Peak_Magnitude (0.05)# 9. Range_X (0.03)# 10. Skewness_Z (0.01)
Winner: 8 features – Achieves 91.5% accuracy while meeting all memory and latency constraints.
Final Feature Set:
RMS_Magnitude: Overall vibration energy
Spectral_Centroid_X: Frequency distribution center
Kurtosis_Y: Spikiness (detects bearing wear)
FFT_Peak_1_Hz: Dominant frequency (detects imbalance)
Std_Magnitude: Variation intensity
Zero_Crossings_Z: Axial oscillation frequency
Spectral_Rolloff_Y: High-frequency content
Peak_to_Peak_Magnitude: Maximum swing amplitude
Results:
Data rate: 48 KB/sec → 8 KB/sec (6× reduction)
Model size: 2.1 MB → 180 KB (12× reduction)
Inference: 180 ms → 52 ms (3.5× faster)
Accuracy: 94.2% → 91.5% (2.7% acceptable loss)
Memory buffer: 21 sec → 65 sec history
Key Insight: The sweet spot between accuracy and efficiency often exists around 8-12 well-chosen features. Beyond that, diminishing returns kick in—each additional feature costs RAM and compute but adds <0.5% accuracy.
Decision Framework: When to Use Which Feature Type
Use this framework to systematically select features for your IoT ML application:
Stage
Question
If YES →
If NO →
1. Start
Do I have labeled data for this application?
Proceed to Stage 2
Collect labeled data first
2. Baseline
Is the data time-series (sequential)?
Start with time-domain features (mean, std, min, max)
Use static features appropriate to data type
3. Periodicity
Does the signal have periodic patterns (vibration, walking, heartbeat)?
Add frequency-domain features (FFT peaks, spectral energy)
Skip frequency features; focus on statistical
4. Compute Budget
Is inference time <50ms critical?
Limit to 5-10 cheap features (mean, std, range)
Can use 15-25 features including FFT
5. Memory Budget
RAM <100 KB?
Use int8 quantization, <10 features
Can store 20-40 features + larger models
6. Domain Knowledge
Do I understand the physics?
Add domain-specific features (bearing frequencies, gait parameters)
Rely on generic statistical features
7. Correlation Check
Are any features correlated >0.9?
Drop redundant features; keep most interpretable
Proceed to Stage 8
8. Importance Ranking
Do I have feature importance scores?
Keep top N features capturing >85% cumulative importance
Train Random Forest to get importance scores
9. Validation
Does accuracy drop <5% with reduced features?
Deploy optimized feature set
Add features back until acceptable accuracy
Example Decision Path: Smart Agriculture Soil Moisture Prediction
Q1: Labeled data available? → YES (6 months historical data) Q2: Time-series? → YES (measurements every 15 min) Q3: Periodic patterns? → NO (slow environmental trends, not oscillations) → Skip FFT featuresQ4: Inference time critical? → NO (predictions every hour on gateway) Q5: Memory constraint? → YES (ESP32 edge deployment) → Limit to 8-12 featuresQ6: Domain knowledge? → YES (evapotranspiration physics) → Add: Rate of change features, ET calculation, humidity deficitQ7: Correlation check? → YES (soil moisture at 15cm and 30cm depth: r=0.89) → Drop 15cm depth, keep 30cm (more stable)Q8: Feature importance? → YES (trained Random Forest) → Top 5 features: soil_moisture_30cm, soil_temp_15cm, humidity, ET, moisture_rate_24hQ9: Accuracy check? → 91.2% (all 36 features) vs 87.2% (top 5) = 4% drop → Acceptable! Deploy with 5 features
Feature Selection Cheat Sheet:
Sensor Type
Always Include
Consider Adding
Usually Skip
Accelerometer
Mean magnitude, Std magnitude
FFT peaks, Zero crossings
Individual axis means
Temperature
Current value, Rate of change
24-hour trend, Daily min/max
Absolute historical values
Audio
MFCCs (1-13), RMS energy
Spectral centroid, Zero-crossing rate
Raw waveform samples
Current
RMS, Mean, Std
Crest factor, THD
Instantaneous samples
GPS
Speed, Heading change rate
Distance from waypoint
Raw lat/lon coordinates
When to Re-evaluate Features:
Every 3 months: Check for feature drift (distributions changing)
After accuracy drop >5%: Add features or retrain
When new sensor added: Correlation check with existing features
Before deployment: Verify inference time on target hardware
Common Mistake: Using Absolute Values Instead of Relative Features
The Mistake: An environmental monitoring system uses absolute temperature values (22.3°C, 22.7°C, 23.1°C) as features for anomaly detection. The model achieves 95% accuracy in the lab but fails completely when deployed to different geographic regions where baseline temperatures differ by 10-20°C.
Why It Happens:
Lab testing occurs in controlled environment (single location)
Absolute values work when all training and test data come from same baseline
Developers don’t anticipate deployment to different thermal environments
Model overfits to specific temperature ranges in training data
Real-World Impact:
# Lab training data (California, baseline 20-25°C)# Model learns: temp < 18°C = "cold anomaly"# temp > 28°C = "hot anomaly"# Deployment in Alaska (baseline 0-5°C)# Normal 2°C reading → Flagged as "cold anomaly" ❌# Model generates 1,000+ false alerts per day# Operators disable anomaly detection → System useless
The Fix: Use relative features instead of absolute values
Bad Approach:
# Absolute temperature featuresfeatures = {'temp_current': 22.3, # Meaningless without context'temp_avg_1h': 22.1, # Location-dependent'temp_max_24h': 25.8# Seasonal variation}
Good Approach:
# Relative and rate-based featuresfeatures = {# Rate of change (location-independent)'temp_rate_1h': +0.5, # Degrees per hour'temp_rate_6h': +2.1, # Trend direction# Deviation from local baseline (adaptive)'temp_deviation_24h_avg': -1.2, # Current vs 24h average'temp_deviation_7d_avg': +0.8, # Current vs weekly average# Percentile within historical distribution (normalized)'temp_percentile_30d': 65, # 65th percentile of last 30 days# Time-of-day normalization'temp_vs_expected_hour': -0.3, # Deviation from typical 10 AM temp}# These features work in ANY location because they are self-referential
Validation Example:
# Test model on diverse deploymentslocations = [ ('California', baseline=22), ('Alaska', baseline=2), ('Arizona', baseline=32), ('Norway', baseline=-5)]# Absolute features modelfor loc, baseline in locations: test_data_absolute = generate_data(baseline, absolute_features=True) accuracy_absolute = model_absolute.score(test_data_absolute, labels)print(f"{loc}: {accuracy_absolute:.1%}")# Results:# California: 95.2% (training location)# Alaska: 12.3% (completely fails)# Arizona: 18.7% (mostly false positives)# Norway: 8.1% (unusable)# Relative features modelfor loc, baseline in locations: test_data_relative = generate_data(baseline, relative_features=True) accuracy_relative = model_relative.score(test_data_relative, labels)print(f"{loc}: {accuracy_relative:.1%}")# Results:# California: 93.8%# Alaska: 91.2%# Arizona: 92.5%# Norway: 90.7%
Other Common Absolute→Relative Transformations:
Bad (Absolute)
Good (Relative)
Why Better
GPS latitude/longitude
Distance from home/waypoint, Speed, Heading change
Location-independent, privacy-preserving
Light sensor lux value
% change from baseline, Rate of darkening
Works indoors/outdoors, day/night
Audio volume (dB)
Volume change rate, Ratio to ambient
Microphone-independent, environment-adaptive
Pressure (Pa)
Pressure derivative (altitude change)
Works at any elevation
Battery voltage (V)
% remaining, Discharge rate
Device/battery-agnostic
Checklist to Avoid This Mistake:
If you answer “no” to any question, convert to relative features before training your model.
Try It: Absolute vs Relative Feature Comparison
See why absolute sensor values fail across deployment locations while relative features generalize. Select a deployment scenario and observe how absolute features produce false alerts, while relative features remain stable.
absRelResult = {const baselines = {"California (baseline 22C)":22,"Alaska (baseline 2C)":2,"Arizona (baseline 35C)":35,"Norway (baseline -5C)":-5 };const baseline = baselines[absRelLocation];const n = absRelSamples;const trainLow =18, trainHigh =28;functionseededRng(seed) {let s = seed;returnfunction() { s = (s *1103515245+12345) &0x7fffffff;return s /0x7fffffff; }; }const rng =seededRng(77);functiongaussR() {const u1 =rng() +0.001, u2 =rng();returnMath.sqrt(-2*Math.log(u1)) *Math.cos(2*Math.PI* u2); }const normalReadings = [];const anomalyReadings = [];for (let i =0; i < n; i++) { normalReadings.push(baseline +gaussR() *1.5); anomalyReadings.push(baseline + absRelAnomalyTemp +gaussR() *1.5); }let absFalseAlarms =0, absCorrectAlerts =0, absMissed =0;for (const r of normalReadings) {if (r < trainLow || r > trainHigh) absFalseAlarms++; }for (const r of anomalyReadings) {if (r < trainLow || r > trainHigh) absCorrectAlerts++;else absMissed++; }const localMean = normalReadings.reduce((a, b) => a + b,0) / n;const localStd =Math.sqrt(normalReadings.reduce((a, b) => a + (b - localMean) **2,0) / n);const relThreshold =2.0;let relFalseAlarms =0, relCorrectAlerts =0, relMissed =0;for (const r of normalReadings) {const dev =Math.abs(r - localMean) / (localStd +0.01);if (dev > relThreshold) relFalseAlarms++; }const anomalyMean = anomalyReadings.reduce((a, b) => a + b,0) / n;for (const r of anomalyReadings) {const dev =Math.abs(r - localMean) / (localStd +0.01);if (dev > relThreshold) relCorrectAlerts++;else relMissed++; }const isTrainLocation = baseline >=18&& baseline <=28;const absAccuracy = isTrainLocation ? ((n - absFalseAlarms + absCorrectAlerts) / (2* n) *100) : ((n - absFalseAlarms + absCorrectAlerts) / (2* n) *100);const relAccuracy = ((n - relFalseAlarms + relCorrectAlerts) / (2* n) *100);const w =580, h =120, pad = {top:15,right:15,bottom:25,left:45};const pw = w - pad.left- pad.right;const ph = h - pad.top- pad.bottom;const allVals = [...normalReadings,...anomalyReadings];const xMin =Math.min(...allVals, trainLow) -3;const xMax =Math.max(...allVals, trainHigh) +3;functionsx(v) { return pad.left+ ((v - xMin) / (xMax - xMin)) * pw; }const dotY1 = pad.top+ ph *0.35;const dotY2 = pad.top+ ph *0.7;returnhtml`<div style="background: var(--bs-light, #f8f9fa); border-radius: 8px; padding: 16px; margin: 10px 0;"> <div style="font-weight: bold; color: #2C3E50; margin-bottom: 8px;">Absolute Feature Model (trained on California 18-28C range)</div> <svg viewBox="0 0 ${w}${h}" style="width: 100%; max-width: ${w}px; font-family: Arial, sans-serif;"> <rect x="${pad.left}" y="${pad.top}" width="${pw}" height="${ph}" fill="white" stroke="#ddd"/> <rect x="${sx(trainLow)}" y="${pad.top}" width="${sx(trainHigh) -sx(trainLow)}" height="${ph}" fill="rgba(22,160,133,0.1)" stroke="#16A085" stroke-dasharray="4,2"/> <text x="${(sx(trainLow) +sx(trainHigh)) /2}" y="${pad.top+12}" text-anchor="middle" font-size="9" fill="#16A085">Train Range</text>${normalReadings.map(r => {const outside = r < trainLow || r > trainHigh;return`<circle cx="${sx(r)}" cy="${dotY1}" r="3.5" fill="${outside ?'#E74C3C':'#16A085'}" opacity="0.7"/>`; }).join("")}${anomalyReadings.map(r => {const outside = r < trainLow || r > trainHigh;return`<circle cx="${sx(r)}" cy="${dotY2}" r="3.5" fill="${outside ?'#E67E22':'#7F8C8D'}" opacity="0.7" stroke="${outside ?'#E67E22':'#E74C3C'}" stroke-width="${outside ?0:1.5}"/>`; }).join("")} <text x="${pad.left-4}" y="${dotY1 +3}" text-anchor="end" font-size="9" fill="#7F8C8D">Normal</text> <text x="${pad.left-4}" y="${dotY2 +3}" text-anchor="end" font-size="9" fill="#7F8C8D">Anomaly</text> <text x="${w /2}" y="${h -3}" text-anchor="middle" font-size="10" fill="#2C3E50">Temperature (C)</text>${[trainLow, trainHigh,Math.round(baseline)].map(v =>`<text x="${sx(v)}" y="${h -12}" text-anchor="middle" font-size="9" fill="#7F8C8D">${v}</text>`).join("")} </svg> <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 12px; margin-top: 12px;"> <div style="padding: 12px; background: white; border-radius: 4px; border-left: 4px solid ${absAccuracy >80?'#16A085':'#E74C3C'};"> <div style="font-weight: bold; color: #2C3E50;">Absolute Features</div> <div style="font-size: 0.85em; margin-top: 6px;"> <div>Accuracy: <strong style="color: ${absAccuracy >80?'#16A085':'#E74C3C'};">${absAccuracy.toFixed(1)}%</strong></div> <div style="color: #E74C3C;">False alarms: ${absFalseAlarms}/${n} normal readings</div> <div style="color: ${absMissed >0?'#E74C3C':'#16A085'};">Missed anomalies: ${absMissed}/${n}</div> </div> </div> <div style="padding: 12px; background: white; border-radius: 4px; border-left: 4px solid ${relAccuracy >80?'#16A085':'#E67E22'};"> <div style="font-weight: bold; color: #2C3E50;">Relative Features</div> <div style="font-size: 0.85em; margin-top: 6px;"> <div>Accuracy: <strong style="color: ${relAccuracy >80?'#16A085':'#E67E22'};">${relAccuracy.toFixed(1)}%</strong></div> <div style="color: ${relFalseAlarms >2?'#E74C3C':'#16A085'};">False alarms: ${relFalseAlarms}/${n} normal readings</div> <div style="color: ${relMissed >0?'#E67E22':'#16A085'};">Missed anomalies: ${relMissed}/${n}</div> </div> </div> </div> <div style="margin-top: 8px; padding: 8px; background: white; border-radius: 4px; font-size: 0.85em; color: #7F8C8D;"> <strong>Try this:</strong> Switch to Alaska or Norway and notice how absolute features generate massive false alarms because all readings fall outside the California training range, while relative features adapt to the local baseline and correctly detect only true anomalies. </div> </div>`;}
4.11 Concept Relationships
Feature engineering connects to several IoT ML topics:
ML Fundamentals establishes why features matter more than raw data for model learning
Mobile Sensing provides real-world context for activity recognition features
IoT ML Pipeline shows where feature engineering fits in the 7-step workflow (Step 3)
Edge ML & Deployment demonstrates how feature selection enables models to run on resource-constrained devices
Production ML covers monitoring feature drift when deployed models degrade over time
The bidirectional relationship is critical: good features enable simpler models (Random Forest vs deep learning), and deployment constraints (10ms inference budget) force aggressive feature selection.
Peak count: Number of local maxima (each step creates a peak)
Zero crossings: Direction changes per second
Signal magnitude area: Total energy across all axes
Threshold the peak count feature: if peaks > 1.5 per second → walking, else stationary
Compare your feature-based approach to a naive “just count when magnitude > threshold” method
What to Observe:
The naive threshold fails when you tilt the phone or shake it while stationary
Peak count + variance combination correctly distinguishes walking from random motion
Well-engineered features make the logic simple: one threshold vs. complex rules
Bonus: Try running while holding the phone—notice how variance increases dramatically compared to walking, enabling activity classification.
Interactive Quiz: Match Concepts
Interactive Quiz: Sequence the Steps
Common Pitfalls
1. Using the same feature set for all sensor types
Time-domain statistical features work well for slowly changing sensors (temperature, humidity) but miss the frequency-domain patterns that characterise vibration, audio, and RF signals. Match the feature engineering strategy to the signal characteristics.
2. Computing features on the entire time series instead of windows
Features computed on the full dataset (global mean, global variance) are not suitable for real-time inference on streaming data. Always engineer features within fixed-length sliding windows to enable online, real-time prediction.
3. Including highly correlated features without dimensionality reduction
Including both mean and median of the same signal, or all 13 MFCC coefficients plus their first and second derivatives without selection, can inflate the feature vector unnecessarily and cause overfitting. Apply feature selection or PCA to remove redundancy.
4. Doing feature engineering on the test set
Computing feature normalisation parameters (mean, std) or selecting features based on their performance on the test set introduces data leakage. All feature engineering decisions must be made on the training set only.
Label the Diagram
4.14 Summary
This chapter covered feature engineering for IoT ML:
Good Features: High inter-class variance, low intra-class variance, cheap to compute
Domain Knowledge: Physics-based features (MFCC for audio, variance for motion) outperform generic statistics
Feature Selection: Use importance analysis and correlation pruning to reduce 36 → 5 features
Edge Optimization: Balance accuracy vs computational cost for deployment
Key Insight: Feature engineering contributes more to accuracy than algorithm choice—spend 80% of time on features, 20% on model selection.
Key Takeaway
Feature engineering contributes more to ML accuracy than algorithm choice – a simple decision tree with well-engineered physics-based features can outperform a deep neural network fed raw sensor data. Start with cheap statistical features (mean, variance), add domain-specific features only if accuracy is below 85%, and use correlation analysis to prune redundant features. Reducing from 36 features to 5 through importance ranking and correlation pruning costs less than 4% accuracy while making models 6x faster.
For Kids: Meet the Sensor Squad!
How does a smartwatch know if you are walking, running, or sitting? The Sensor Squad explains feature engineering!
Sammy the Sensor lives inside a smartwatch. Every second, he feels the wrist moving and writes down 50 tiny measurements about how fast the arm is going. But staring at 50 numbers per second is like trying to read a book that has a million pages – impossible!
“I need CLUES, not raw data!” says Max the Microcontroller.
So the Sensor Squad creates a detective toolkit:
Clue 1: Average Motion (Mean) “How bumpy is the ride overall?” If the average motion is LOW, the person is probably sitting. If it is HIGH, they are running!
Clue 2: How Much Things Change (Variance) “Is the motion smooth or jerky?” Walking has a nice steady pattern. Running is bouncier. Sitting barely moves at all.
Clue 3: How Often Direction Changes (Zero Crossings) “How many times does the arm swing back and forth?” Walking: about 2 swings per second. Running: 3-4 swings per second!
“See?” says Lila the LED. “Instead of 50 confusing numbers, we now have just 3 simple clues. And those 3 clues are WAY better at telling activities apart!”
Max runs his mini-brain (ML model) on just these 3 clues and gets it right 90% of the time. When they tried feeding all 50 raw numbers, the model only got 65% right!
Bella the Battery adds: “Plus, figuring out 3 clues uses barely any energy. Figuring out 50 raw numbers would drain me in hours!”
The lesson: Good clues (features) beat more data every time. It is like being a detective – one perfect fingerprint solves the case faster than a room full of random evidence!
4.14.1 Try This at Home!
Walk around your room for 10 seconds, then sit still for 10 seconds. Hold your hand flat and notice how it moves. When you walk, your hand bounces up and down in a rhythm. When you sit, it barely moves. Your smartwatch uses exactly these patterns (the “clues”) to know what you are doing!