1350  IoT Machine Learning Pipeline

1350.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Design ML Pipelines: Implement a systematic 7-step ML pipeline for IoT applications
  • Avoid Common Pitfalls: Recognize and address data leakage, overfitting, and class imbalance
  • Select Appropriate Models: Choose ML algorithms based on accuracy, latency, and deployment constraints
  • Evaluate IoT ML Systems: Use appropriate metrics for imbalanced IoT datasets

1350.2 Prerequisites

NoteChapter Series: Modeling and Inferencing

This is part 3 of the IoT Machine Learning series:

  1. ML Fundamentals - Core concepts
  2. Mobile Sensing - HAR, transportation
  3. IoT ML Pipeline (this chapter) - 7-step pipeline
  4. Edge ML & Deployment - TinyML
  5. Audio Feature Processing - MFCC
  6. Feature Engineering - Feature design
  7. Production ML - Monitoring

1350.3 The 7-Step IoT ML Pipeline

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    S1[1. Data<br/>Collection] --> S2[2. Data<br/>Cleaning]
    S2 --> S3[3. Feature<br/>Engineering]
    S3 --> S4[4. Train/Test<br/>Split]
    S4 --> S5[5. Model<br/>Selection]
    S5 --> S6[6. Evaluation<br/>& Tuning]
    S6 --> S7[7. Deployment<br/>& Monitoring]

    S7 -.->|Retrain| S1

    style S1 fill:#2C3E50,stroke:#16A085,color:#fff
    style S2 fill:#16A085,stroke:#2C3E50,color:#fff
    style S3 fill:#E67E22,stroke:#2C3E50,color:#fff
    style S4 fill:#9B59B6,stroke:#2C3E50,color:#fff
    style S5 fill:#3498DB,stroke:#2C3E50,color:#fff
    style S6 fill:#1ABC9C,stroke:#2C3E50,color:#fff
    style S7 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1350.1: Seven-Step IoT ML Pipeline with Continuous Feedback Loop

1350.4 Step 1: Data Collection

Goal: Gather representative sensor data that captures the full range of conditions your model will encounter.

Data Source Sampling Rate Duration Labels
Accelerometer 50-100 Hz 1-2 weeks Activity type
Temperature 1-60 Hz 30+ days Normal/Anomaly
Audio 16 kHz 10+ hours Keyword/Not

Best Practices: - Collect from diverse users, devices, and environments - Include edge cases (unusual activities, sensor noise) - Document collection conditions (timestamp, device model, location)

1350.5 Step 2: Data Cleaning

Goal: Remove noise, handle missing values, and ensure data quality.

# Common cleaning operations
def clean_sensor_data(df):
    # Remove outliers (sensor glitches)
    df = df[(df['accel_mag'] > 0) & (df['accel_mag'] < 50)]

    # Handle missing values
    df = df.interpolate(method='linear', limit=10)
    df = df.dropna()

    # Remove duplicates
    df = df.drop_duplicates(subset=['timestamp'])

    return df

Key Operations: - Outlier removal: Filter physically impossible values - Gap filling: Interpolate short gaps (< 10 samples) - Timestamp alignment: Synchronize multi-sensor data

1350.6 Step 3: Feature Engineering

Goal: Transform raw sensor data into discriminative features that capture patterns.

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    Raw[Raw Sensor Data<br/>100 Hz, 3-axis] --> Window[Sliding Window<br/>2 sec, 50% overlap]

    Window --> Time[Time Domain<br/>Mean, Std, Min, Max<br/>Zero Crossings]
    Window --> Freq[Frequency Domain<br/>FFT Peaks, Energy<br/>Spectral Entropy]

    Time --> Vector[Feature Vector<br/>27 features/window]
    Freq --> Vector

    Vector --> Norm[Normalize<br/>Zero mean, unit variance]

    style Raw fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style Window fill:#2C3E50,stroke:#16A085,color:#fff
    style Time fill:#16A085,stroke:#2C3E50,color:#fff
    style Freq fill:#16A085,stroke:#2C3E50,color:#fff
    style Vector fill:#E67E22,stroke:#2C3E50,color:#fff
    style Norm fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1350.2: Feature Engineering Pipeline from Raw Data to Normalized Feature Vector

Feature Categories:

Category Features Purpose
Statistical Mean, Std, Min, Max, IQR Central tendency, spread
Signal Shape Zero crossings, Peak count Periodicity indicators
Frequency FFT peaks, Spectral energy Periodic patterns
Domain-Specific Step frequency, Bearing frequencies Application knowledge

1350.7 Step 4: Train/Test Split

Critical for Time-Series: Use chronological splits, NOT random splits.

WarningData Leakage Warning

Wrong: Random 80/20 split (future data leaks into training)

Right: Chronological split (train on past, test on future)

# WRONG - Data leakage!
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# RIGHT - Chronological split
train_end = int(len(X) * 0.7)
val_end = int(len(X) * 0.85)

X_train = X[:train_end]          # Days 1-21
X_val = X[train_end:val_end]     # Days 22-25
X_test = X[val_end:]             # Days 26-30

Split Strategy:

Split Percentage Purpose
Training 70% Learn patterns
Validation 15% Tune hyperparameters
Test 15% Final evaluation

1350.8 Step 5: Model Selection

Choose based on constraints:

Model Accuracy Model Size Inference Best For
Decision Tree 80-85% 5-50 KB < 1ms Interpretable, MCU
Random Forest 88-93% 200-500 KB 5-20ms Tabular data, ESP32
SVM 85-90% 10-100 KB 1-5ms High-dimensional
Neural Network 92-98% 1-10 MB 20-100ms Complex patterns
Quantized NN 90-95% 50-200 KB 5-30ms Edge AI

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    Start[Start Model<br/>Selection] --> RAM{RAM < 256KB?}

    RAM -->|Yes| Simple[Decision Tree<br/>or Logistic Reg]
    RAM -->|No| Latency{Latency < 10ms?}

    Latency -->|Yes| RF[Random Forest<br/>50-100 trees]
    Latency -->|No| Accuracy{Need 95%+ acc?}

    Accuracy -->|Yes| NN[Neural Network<br/>Quantized]
    Accuracy -->|No| RF

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style RAM fill:#E67E22,stroke:#2C3E50,color:#fff
    style Latency fill:#E67E22,stroke:#2C3E50,color:#fff
    style Accuracy fill:#E67E22,stroke:#2C3E50,color:#fff
    style Simple fill:#27AE60,stroke:#2C3E50,color:#fff
    style RF fill:#27AE60,stroke:#2C3E50,color:#fff
    style NN fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1350.3: Model Selection Decision Tree Based on Device Constraints

1350.9 Step 6: Evaluation and Tuning

Use appropriate metrics for IoT:

Use Case Primary Metric Why
Activity Recognition F1-Score, Accuracy Balanced classes
Fall Detection Recall, Specificity Rare events, false alarm cost
Anomaly Detection Precision @ Recall Class imbalance
Prediction MAE, RMSE Continuous output

Hyperparameter Tuning:

from sklearn.model_selection import GridSearchCV

param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [5, 10, 15, None],
    'min_samples_leaf': [1, 2, 5]
}

grid_search = GridSearchCV(
    RandomForestClassifier(),
    param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='f1_macro',
    n_jobs=-1
)

grid_search.fit(X_train, y_train)
best_model = grid_search.best_estimator_

1350.10 Step 7: Deployment and Monitoring

Deployment Options:

Location When to Use Tools
Edge (MCU) Real-time, offline TensorFlow Lite Micro
Edge (ESP32/RPi) Moderate complexity TensorFlow Lite
Cloud Complex models, fleet analytics AWS SageMaker, Azure ML

Monitoring Checklist:

  • Track inference latency (P50, P95, P99)
  • Monitor prediction distribution
  • Detect feature drift (KL divergence)
  • Set alerts for accuracy degradation

1350.11 Common Pipeline Pitfalls

CautionPitfall 1: Training on Clean Lab Data, Deploying to Noisy Real World

The Mistake: Developing ML models using carefully curated datasets collected under controlled laboratory conditions, then expecting the same performance when deployed to production environments.

Why It Happens: Lab datasets are convenient, well-labeled, and produce impressive accuracy numbers. Real-world data collection is expensive and messy.

The Fix: 1. Data augmentation: Add synthetic noise matching expected sensor characteristics 2. Domain randomization: Train on data from multiple devices and environments 3. Staged deployment: Deploy to 5% of devices first, monitor for accuracy degradation 4. Graceful degradation: Output confidence scores, reject uncertain predictions

Rule of thumb: If your lab accuracy is 95%, budget for 80-85% real-world accuracy.

CautionPitfall 2: Ignoring Class Imbalance

The Mistake: Training on imbalanced data (95% normal, 5% anomaly) and celebrating “95% accuracy” when the model just predicts “normal” for everything.

Why It Happens: Accuracy rewards majority class prediction.

The Fix: - Use precision, recall, F1-score, and ROC-AUC - Apply class weighting or SMOTE oversampling - Set decision thresholds based on business costs

Example: Fall detection with 99% normal, 1% falls: - Naive model: 99% accuracy, 0% recall (misses all falls!) - Proper model: 95% accuracy, 90% recall (catches most falls)

CautionPitfall 3: Data Leakage in Time-Series

The Mistake: Using random train/test splits on time-series data, allowing the model to “see the future” during training.

Why It Happens: Default sklearn train_test_split uses random sampling.

The Fix: Always use chronological splits: - Training: Past data (days 1-21) - Validation: Near future (days 22-25) - Test: Far future (days 26-30)

Impact: Models with data leakage show 10-20% higher test accuracy than real-world performance.

1350.12 Worked Example: Model Selection for Industrial Predictive Maintenance

Scenario: A manufacturing plant needs vibration-based predictive maintenance on 500 CNC machines. Each machine has a 3-axis accelerometer at 4 kHz. The edge device is an ESP32 (520KB RAM, 240 MHz).

Constraints: - Model must fit in 150KB - Inference < 100ms - Target: >90% recall for Critical class, >70% precision

Model Comparison:

Model Size Latency Critical Recall Critical Precision
Random Forest (100 trees) 2.1 MB 45ms 87% 68%
Decision Tree Ensemble (10 trees) 89 KB 8ms 82% 71%
+ Frequency Features 112 KB 35ms 91% 74%
+ INT8 Quantization 28 KB 28ms 90% 73%

Result: INT8 quantized decision tree ensemble with 16 features achieves 90% critical recall, 28ms inference, 28KB model size.

Key Insight: Start with the simplest model that fits your constraints, then add complexity only if metrics demand it.

1350.13 Knowledge Check

Question 1: What is the primary advantage of Random Forests over single Decision Trees for activity recognition?

Explanation: Random Forest = ensemble of many decision trees, each trained on random subset of data/features. Voting (classification) or averaging (regression) produces final prediction. Single tree achieves 75% test accuracy; Random Forest (100 trees) achieves 90% by correcting individual tree mistakes.

Question 2: What machine learning technique is most appropriate for real-time activity recognition on a smartphone with limited labeled data?

Explanation: Transfer learning leverages knowledge from large datasets to improve small-dataset performance. Pre-train on public dataset (100k users), freeze lower layers (general motion features), fine-tune upper layers on 1 hour of user data. Achieves 85% accuracy with 1 hour personal data vs 70% training from scratch.

1350.14 Summary

This chapter covered the systematic 7-step IoT ML pipeline:

  1. Data Collection: Gather diverse, representative sensor data
  2. Data Cleaning: Remove outliers, handle missing values, align timestamps
  3. Feature Engineering: Extract time-domain and frequency-domain features
  4. Train/Test Split: Use chronological splits to avoid data leakage
  5. Model Selection: Choose based on RAM, latency, and accuracy constraints
  6. Evaluation: Use F1-score, recall, precision for imbalanced data
  7. Deployment: Monitor inference latency and prediction drift

Key Takeaways: - Chronological splits are mandatory for time-series data - Class imbalance requires specialized metrics and techniques - Start simple (Decision Tree), add complexity only if needed - Real-world accuracy is typically 10-15% lower than lab accuracy

1350.15 What’s Next

Continue to Edge ML & Deployment to learn TinyML techniques for deploying ML models on resource-constrained IoT devices.