%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
S1[1. Data<br/>Collection] --> S2[2. Data<br/>Cleaning]
S2 --> S3[3. Feature<br/>Engineering]
S3 --> S4[4. Train/Test<br/>Split]
S4 --> S5[5. Model<br/>Selection]
S5 --> S6[6. Evaluation<br/>& Tuning]
S6 --> S7[7. Deployment<br/>& Monitoring]
S7 -.->|Retrain| S1
style S1 fill:#2C3E50,stroke:#16A085,color:#fff
style S2 fill:#16A085,stroke:#2C3E50,color:#fff
style S3 fill:#E67E22,stroke:#2C3E50,color:#fff
style S4 fill:#9B59B6,stroke:#2C3E50,color:#fff
style S5 fill:#3498DB,stroke:#2C3E50,color:#fff
style S6 fill:#1ABC9C,stroke:#2C3E50,color:#fff
style S7 fill:#27AE60,stroke:#2C3E50,color:#fff
1350 IoT Machine Learning Pipeline
1350.1 Learning Objectives
By the end of this chapter, you will be able to:
- Design ML Pipelines: Implement a systematic 7-step ML pipeline for IoT applications
- Avoid Common Pitfalls: Recognize and address data leakage, overfitting, and class imbalance
- Select Appropriate Models: Choose ML algorithms based on accuracy, latency, and deployment constraints
- Evaluate IoT ML Systems: Use appropriate metrics for imbalanced IoT datasets
1350.2 Prerequisites
- ML Fundamentals: Understanding training vs inference and feature extraction
- Mobile Sensing: Activity recognition concepts
This is part 3 of the IoT Machine Learning series:
- ML Fundamentals - Core concepts
- Mobile Sensing - HAR, transportation
- IoT ML Pipeline (this chapter) - 7-step pipeline
- Edge ML & Deployment - TinyML
- Audio Feature Processing - MFCC
- Feature Engineering - Feature design
- Production ML - Monitoring
1350.3 The 7-Step IoT ML Pipeline
1350.4 Step 1: Data Collection
Goal: Gather representative sensor data that captures the full range of conditions your model will encounter.
| Data Source | Sampling Rate | Duration | Labels |
|---|---|---|---|
| Accelerometer | 50-100 Hz | 1-2 weeks | Activity type |
| Temperature | 1-60 Hz | 30+ days | Normal/Anomaly |
| Audio | 16 kHz | 10+ hours | Keyword/Not |
Best Practices: - Collect from diverse users, devices, and environments - Include edge cases (unusual activities, sensor noise) - Document collection conditions (timestamp, device model, location)
1350.5 Step 2: Data Cleaning
Goal: Remove noise, handle missing values, and ensure data quality.
# Common cleaning operations
def clean_sensor_data(df):
# Remove outliers (sensor glitches)
df = df[(df['accel_mag'] > 0) & (df['accel_mag'] < 50)]
# Handle missing values
df = df.interpolate(method='linear', limit=10)
df = df.dropna()
# Remove duplicates
df = df.drop_duplicates(subset=['timestamp'])
return dfKey Operations: - Outlier removal: Filter physically impossible values - Gap filling: Interpolate short gaps (< 10 samples) - Timestamp alignment: Synchronize multi-sensor data
1350.6 Step 3: Feature Engineering
Goal: Transform raw sensor data into discriminative features that capture patterns.
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
Raw[Raw Sensor Data<br/>100 Hz, 3-axis] --> Window[Sliding Window<br/>2 sec, 50% overlap]
Window --> Time[Time Domain<br/>Mean, Std, Min, Max<br/>Zero Crossings]
Window --> Freq[Frequency Domain<br/>FFT Peaks, Energy<br/>Spectral Entropy]
Time --> Vector[Feature Vector<br/>27 features/window]
Freq --> Vector
Vector --> Norm[Normalize<br/>Zero mean, unit variance]
style Raw fill:#7F8C8D,stroke:#2C3E50,color:#fff
style Window fill:#2C3E50,stroke:#16A085,color:#fff
style Time fill:#16A085,stroke:#2C3E50,color:#fff
style Freq fill:#16A085,stroke:#2C3E50,color:#fff
style Vector fill:#E67E22,stroke:#2C3E50,color:#fff
style Norm fill:#27AE60,stroke:#2C3E50,color:#fff
Feature Categories:
| Category | Features | Purpose |
|---|---|---|
| Statistical | Mean, Std, Min, Max, IQR | Central tendency, spread |
| Signal Shape | Zero crossings, Peak count | Periodicity indicators |
| Frequency | FFT peaks, Spectral energy | Periodic patterns |
| Domain-Specific | Step frequency, Bearing frequencies | Application knowledge |
1350.7 Step 4: Train/Test Split
Critical for Time-Series: Use chronological splits, NOT random splits.
Wrong: Random 80/20 split (future data leaks into training)
Right: Chronological split (train on past, test on future)
# WRONG - Data leakage!
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
# RIGHT - Chronological split
train_end = int(len(X) * 0.7)
val_end = int(len(X) * 0.85)
X_train = X[:train_end] # Days 1-21
X_val = X[train_end:val_end] # Days 22-25
X_test = X[val_end:] # Days 26-30Split Strategy:
| Split | Percentage | Purpose |
|---|---|---|
| Training | 70% | Learn patterns |
| Validation | 15% | Tune hyperparameters |
| Test | 15% | Final evaluation |
1350.8 Step 5: Model Selection
Choose based on constraints:
| Model | Accuracy | Model Size | Inference | Best For |
|---|---|---|---|---|
| Decision Tree | 80-85% | 5-50 KB | < 1ms | Interpretable, MCU |
| Random Forest | 88-93% | 200-500 KB | 5-20ms | Tabular data, ESP32 |
| SVM | 85-90% | 10-100 KB | 1-5ms | High-dimensional |
| Neural Network | 92-98% | 1-10 MB | 20-100ms | Complex patterns |
| Quantized NN | 90-95% | 50-200 KB | 5-30ms | Edge AI |
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
Start[Start Model<br/>Selection] --> RAM{RAM < 256KB?}
RAM -->|Yes| Simple[Decision Tree<br/>or Logistic Reg]
RAM -->|No| Latency{Latency < 10ms?}
Latency -->|Yes| RF[Random Forest<br/>50-100 trees]
Latency -->|No| Accuracy{Need 95%+ acc?}
Accuracy -->|Yes| NN[Neural Network<br/>Quantized]
Accuracy -->|No| RF
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style RAM fill:#E67E22,stroke:#2C3E50,color:#fff
style Latency fill:#E67E22,stroke:#2C3E50,color:#fff
style Accuracy fill:#E67E22,stroke:#2C3E50,color:#fff
style Simple fill:#27AE60,stroke:#2C3E50,color:#fff
style RF fill:#27AE60,stroke:#2C3E50,color:#fff
style NN fill:#27AE60,stroke:#2C3E50,color:#fff
1350.9 Step 6: Evaluation and Tuning
Use appropriate metrics for IoT:
| Use Case | Primary Metric | Why |
|---|---|---|
| Activity Recognition | F1-Score, Accuracy | Balanced classes |
| Fall Detection | Recall, Specificity | Rare events, false alarm cost |
| Anomaly Detection | Precision @ Recall | Class imbalance |
| Prediction | MAE, RMSE | Continuous output |
Hyperparameter Tuning:
from sklearn.model_selection import GridSearchCV
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [5, 10, 15, None],
'min_samples_leaf': [1, 2, 5]
}
grid_search = GridSearchCV(
RandomForestClassifier(),
param_grid,
cv=5, # 5-fold cross-validation
scoring='f1_macro',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_1350.10 Step 7: Deployment and Monitoring
Deployment Options:
| Location | When to Use | Tools |
|---|---|---|
| Edge (MCU) | Real-time, offline | TensorFlow Lite Micro |
| Edge (ESP32/RPi) | Moderate complexity | TensorFlow Lite |
| Cloud | Complex models, fleet analytics | AWS SageMaker, Azure ML |
Monitoring Checklist:
- Track inference latency (P50, P95, P99)
- Monitor prediction distribution
- Detect feature drift (KL divergence)
- Set alerts for accuracy degradation
1350.11 Common Pipeline Pitfalls
The Mistake: Developing ML models using carefully curated datasets collected under controlled laboratory conditions, then expecting the same performance when deployed to production environments.
Why It Happens: Lab datasets are convenient, well-labeled, and produce impressive accuracy numbers. Real-world data collection is expensive and messy.
The Fix: 1. Data augmentation: Add synthetic noise matching expected sensor characteristics 2. Domain randomization: Train on data from multiple devices and environments 3. Staged deployment: Deploy to 5% of devices first, monitor for accuracy degradation 4. Graceful degradation: Output confidence scores, reject uncertain predictions
Rule of thumb: If your lab accuracy is 95%, budget for 80-85% real-world accuracy.
The Mistake: Training on imbalanced data (95% normal, 5% anomaly) and celebrating “95% accuracy” when the model just predicts “normal” for everything.
Why It Happens: Accuracy rewards majority class prediction.
The Fix: - Use precision, recall, F1-score, and ROC-AUC - Apply class weighting or SMOTE oversampling - Set decision thresholds based on business costs
Example: Fall detection with 99% normal, 1% falls: - Naive model: 99% accuracy, 0% recall (misses all falls!) - Proper model: 95% accuracy, 90% recall (catches most falls)
The Mistake: Using random train/test splits on time-series data, allowing the model to “see the future” during training.
Why It Happens: Default sklearn train_test_split uses random sampling.
The Fix: Always use chronological splits: - Training: Past data (days 1-21) - Validation: Near future (days 22-25) - Test: Far future (days 26-30)
Impact: Models with data leakage show 10-20% higher test accuracy than real-world performance.
1350.12 Worked Example: Model Selection for Industrial Predictive Maintenance
Scenario: A manufacturing plant needs vibration-based predictive maintenance on 500 CNC machines. Each machine has a 3-axis accelerometer at 4 kHz. The edge device is an ESP32 (520KB RAM, 240 MHz).
Constraints: - Model must fit in 150KB - Inference < 100ms - Target: >90% recall for Critical class, >70% precision
Model Comparison:
| Model | Size | Latency | Critical Recall | Critical Precision |
|---|---|---|---|---|
| Random Forest (100 trees) | 2.1 MB | 45ms | 87% | 68% |
| Decision Tree Ensemble (10 trees) | 89 KB | 8ms | 82% | 71% |
| + Frequency Features | 112 KB | 35ms | 91% | 74% |
| + INT8 Quantization | 28 KB | 28ms | 90% | 73% |
Result: INT8 quantized decision tree ensemble with 16 features achieves 90% critical recall, 28ms inference, 28KB model size.
Key Insight: Start with the simplest model that fits your constraints, then add complexity only if metrics demand it.
1350.13 Knowledge Check
Question 1: What is the primary advantage of Random Forests over single Decision Trees for activity recognition?
Explanation: Random Forest = ensemble of many decision trees, each trained on random subset of data/features. Voting (classification) or averaging (regression) produces final prediction. Single tree achieves 75% test accuracy; Random Forest (100 trees) achieves 90% by correcting individual tree mistakes.
Question 2: What machine learning technique is most appropriate for real-time activity recognition on a smartphone with limited labeled data?
Explanation: Transfer learning leverages knowledge from large datasets to improve small-dataset performance. Pre-train on public dataset (100k users), freeze lower layers (general motion features), fine-tune upper layers on 1 hour of user data. Achieves 85% accuracy with 1 hour personal data vs 70% training from scratch.
1350.14 Summary
This chapter covered the systematic 7-step IoT ML pipeline:
- Data Collection: Gather diverse, representative sensor data
- Data Cleaning: Remove outliers, handle missing values, align timestamps
- Feature Engineering: Extract time-domain and frequency-domain features
- Train/Test Split: Use chronological splits to avoid data leakage
- Model Selection: Choose based on RAM, latency, and accuracy constraints
- Evaluation: Use F1-score, recall, precision for imbalanced data
- Deployment: Monitor inference latency and prediction drift
Key Takeaways: - Chronological splits are mandatory for time-series data - Class imbalance requires specialized metrics and techniques - Start simple (Decision Tree), add complexity only if needed - Real-world accuracy is typically 10-15% lower than lab accuracy
1350.15 What’s Next
Continue to Edge ML & Deployment to learn TinyML techniques for deploying ML models on resource-constrained IoT devices.