2  Modeling and Inferencing for IoT

Learning Objectives

After completing this chapter series, you will be able to:

  • Distinguish between training and inference phases of the IoT machine learning lifecycle
  • Design feature engineering pipelines that extract domain-specific features from raw sensor data
  • Select appropriate ML model families for common IoT problem types including classification, regression, and anomaly detection
  • Apply model optimization techniques such as quantization and pruning for deployment on constrained edge devices
  • Evaluate edge versus cloud deployment trade-offs based on latency, privacy, and connectivity requirements
In 60 Seconds

Machine learning for IoT transforms raw sensor streams into actionable intelligence through a pipeline of feature engineering, model training, validation, and edge deployment — and the unique challenge is producing models compact enough to fit on resource-constrained devices while maintaining sufficient accuracy. The key insight is that 90% of IoT ML success comes from good feature engineering, not from choosing a sophisticated algorithm.

MVU — Minimum Viable Understanding

Effective IoT machine learning depends more on well-engineered features from domain knowledge than on choosing the most complex algorithm. Getting the data pipeline right—from sensor data collection through feature engineering to model deployment—is the single biggest determinant of model accuracy in production.

Characters: Sammy the Sensor, Lila the LED, Max the Microcontroller, Bella the Battery

Sammy: “I collect numbers all day—temperature, motion, light. But what do they mean?”

Max: “That’s where machine learning comes in! Think of it like teaching a puppy. You show the puppy lots of pictures of cats, and eventually it learns to recognize them. We show a computer lots of sensor readings, and it learns patterns too!”

Lila: “So if Sammy records vibrations from a machine, the computer can learn what healthy vibrations look like and what broken vibrations look like?”

Max: “Exactly! That’s called predictive maintenance—the computer warns you before something breaks, like a doctor checking your heartbeat.”

Bella: “But I don’t have enough energy to run big programs! Can the learning happen on my tiny chip?”

Max: “Yes! That’s called TinyML—we shrink the trained model so it fits on small devices like us. It’s like summarizing a whole textbook into a one-page cheat sheet!”

Sammy: “So I collect data, Max learns from it, and Lila can flash a warning if something goes wrong? Teamwork!”

Machine learning (ML) for IoT means teaching computers to find patterns in sensor data so they can make predictions or decisions automatically. Instead of writing explicit rules like “if temperature > 80°C, send alert,” ML lets the system learn what “normal” looks like from historical data and flag anything unusual.

Three things to know:

  1. Training vs. Inference: Training is the learning phase (needs lots of data and compute). Inference is using the trained model to make predictions (can run on tiny devices).
  2. Features matter most: The way you prepare and transform raw sensor readings (called “feature engineering”) has a bigger impact on accuracy than which algorithm you pick.
  3. Edge vs. Cloud: You can run ML models on the sensor device itself (edge—fast, private, but limited) or in the cloud (powerful, but needs connectivity and adds latency).

If this is your first time, start with the ML Fundamentals chapter.

2.1 Overview

Machine learning transforms raw IoT sensor data into actionable insights—detecting activities, predicting failures, and enabling intelligent automation. This chapter series covers the complete ML lifecycle from data collection through production deployment.

Vertical process diagram showing the IoT data science pipeline from sensor data collection to cleaning, feature engineering, model training, deployment, and monitoring with a feedback loop for retraining
Figure 2.1: The data science pipeline for IoT follows a systematic progression from raw sensor streams to deployed models.

2.2 Chapter Series

This topic has been organized into seven focused chapters for easier navigation:

  1. ML Fundamentals
    Key topics: Training vs inference, feature extraction, edge vs cloud.
    Difficulty: Beginner.
  2. Mobile Sensing & Activity Recognition
    Key topics: HAR, transportation mode detection, duty cycling.
    Difficulty: Intermediate.
  3. IoT ML Pipeline
    Key topics: 7-step pipeline, data leakage, model selection.
    Difficulty: Intermediate.
  4. Edge ML & TinyML Deployment
    Key topics: Quantization, pruning, HVAC predictive control.
    Difficulty: Intermediate.
  5. Audio Feature Processing
    Key topics: MFCC extraction, wake word detection.
    Difficulty: Intermediate.
  6. Feature Engineering
    Key topics: Good vs bad features, domain knowledge.
    Difficulty: Intermediate.
  7. Production ML
    Key topics: Monitoring, anomaly detection, predictive maintenance.
    Difficulty: Advanced.

2.3 Learning Path

2.3.1 For Beginners

Start with ML Fundamentals to understand:

  • What machine learning does for IoT
  • The difference between training and inference
  • Why feature extraction matters
  • When to use edge vs cloud ML

2.3.2 For Practitioners

Follow the complete pipeline:

  1. Mobile Sensing - Real-world activity recognition
  2. IoT ML Pipeline - Systematic 7-step approach
  3. Edge Deployment - TinyML and quantization
  4. Feature Engineering - Designing discriminative features

2.3.3 For Production Engineers

Focus on deployment and operations:

2.4 Key Concepts Summary

2.4.1 The IoT ML Pipeline

Vertical flowchart showing the six-stage IoT machine learning lifecycle: collect, engineer features, train and validate, optimize, deploy, and monitor, with a feedback loop from monitoring back to data collection
Figure 2.2: The end-to-end IoT ML lifecycle from data collection through deployment.

2.4.2 Edge vs Cloud Decision

Choose edge ML when:

  • Latency must stay below 100 ms
  • Sensor data is sensitive and should remain on-device
  • Connectivity is intermittent or expensive
  • The deployed model must fit in less than 1 MB

Choose cloud ML when:

  • A response time of 1–5 seconds is acceptable
  • Data can be anonymized before upload
  • Devices have reliable network connectivity
  • The model is larger than 10 MB or needs powerful compute

Consider a predictive maintenance ML model for 500 industrial pumps:

Cloud ML approach:

  • Model: 45 MB TensorFlow SavedModel
  • Raw sensor upload: 10 kHz vibration × 2 bytes = 20 KB/s per pump \[\text{Fleet upload} = 20\text{ KB/s} \times 500 = 10{,}000\text{ KB/s} = 10\text{ MB/s}\] \[\text{Monthly bandwidth} = 10\text{ MB/s} \times 2{,}592{,}000\text{ s} = 25{,}920\text{ GB} \approx 26\text{ TB}\] \[\text{AWS cost} = 26\text{ TB} \times \text{\$0.09/GB} = \text{\$2,340/month}\]
  • Inference latency: 150-300 ms round-trip

Edge ML approach (TinyML on ESP32): - Model: 180 KB INT8 quantized TensorFlow Lite - Upload only anomaly alerts: ~5 alerts/day per pump × 100 bytes \[\text{Monthly bandwidth} = 5 \times 100\text{ bytes} \times 500 \times 30 = 7.5\text{ MB/month}\] \[\text{Cost} = 7.5\text{ MB} \times \text{\$0.09/GB} \approx \text{\$0.00} \text{ (negligible)}\] - Inference latency: 65 ms local processing

Result: Edge ML saves $2,340/month in bandwidth while reducing latency by 55-80%. Initial cost: ESP32 modules ($3 each × 500 = $1,500 one-time). Payback period: 19 days.

Edge vs Cloud Cost Explorer

Use this interactive calculator to compare edge and cloud ML deployment costs for your own IoT fleet scenario.

2.4.3 Feature Engineering Priority

Feature engineering contributes more to accuracy than algorithm choice:

  1. Domain knowledge (physics-based features) > generic statistics
  2. Time-domain features (mean, variance) are cheap and effective
  3. Frequency-domain features (FFT) add 5-10% accuracy for periodic signals
  4. Correlation analysis removes redundant features

2.4.4 ML Model Types for IoT Tasks

Understanding which model family to use for a given IoT problem is critical:

  • Classification: Suitable models include Decision Trees, Random Forest, SVM, and CNN. Typical use cases: activity recognition and fault detection.
  • Regression: Suitable models include Linear Regression, Gradient Boosted Trees, and MLP. Typical use cases: temperature prediction and energy forecasting.
  • Anomaly Detection: Suitable models include Isolation Forest, One-Class SVM, and Autoencoder. Typical use cases: equipment fault detection and intrusion detection.
  • Time-Series Forecasting: Suitable models include LSTM, GRU, Prophet, and ARIMA. Typical use cases: demand prediction and environmental monitoring.
  • Clustering: Suitable models include k-Means, DBSCAN, and Gaussian Mixture. Typical use cases: device profiling and usage pattern discovery.

2.4.5 Model Optimization for Constrained Devices

Deploying ML models on IoT devices requires shrinking them without unacceptable accuracy loss. The typical optimization pipeline progresses through several stages:

  1. Pruning: Remove low-magnitude weights. Typical size reduction: 2–4x. Typical accuracy impact: 0.5–2% loss.
  2. Quantization: Convert FP32 weights to INT8. Typical size reduction: 4x. Typical accuracy impact: 1–3% loss.
  3. Knowledge Distillation: Train a compact student model from a larger teacher. Typical size reduction: 5–20x. Typical accuracy impact: 2–5% loss.
  4. Hardware-Specific Compilation: Target runtimes such as TFLite and ONNX. Typical size reduction: 1.2–2x. Typical accuracy impact: negligible.

2.6 Videos

Common Pitfalls
  1. Skipping feature engineering and jumping to deep learning: In IoT, well-crafted domain-specific features (e.g., vibration RMS, rolling averages, frequency peaks) typically outperform throwing raw data at a neural network. A random forest with 10 good features often beats a deep model trained on raw accelerometer samples.

  2. Training on data without temporal splits: IoT data is time-ordered. If you randomly shuffle data for train/test splitting, you leak future information into the training set (“data leakage”), producing overly optimistic accuracy that collapses in production. Always split by time—train on past, test on future.

  3. Ignoring model drift after deployment: Sensor behavior changes over time due to aging, environmental shifts, and firmware updates. A model that was 95% accurate at deployment can degrade to 70% within months if you do not monitor for concept drift and retrain periodically.

  4. Over-fitting to a single device: Training a model on data from one sensor and deploying it across hundreds of identical units can fail due to manufacturing variance. Sensors of the same type can exhibit different offsets, gains, and noise characteristics. Train on data from multiple devices or apply domain adaptation techniques.

  5. Ignoring class imbalance in fault detection: In predictive maintenance, “healthy” readings vastly outnumber “fault” readings (often 99:1 or worse). A model that always predicts “healthy” achieves 99% accuracy but catches zero faults. Use precision-recall metrics, F1-score, and resampling techniques (SMOTE, class weighting) instead of raw accuracy.

2.7 Knowledge Check

Build a full ML pipeline – from raw sensor data to trained model with evaluation – using only Python’s standard library (no external dependencies). This demonstrates the 6-stage lifecycle described above.

import random
import math

random.seed(42)

# Stage 1: COLLECT - Simulate 3-axis accelerometer for activity recognition
# Activities: 0=sitting, 1=walking, 2=running
def generate_samples(n_per_class=100):
    data, labels = [], []
    for _ in range(n_per_class):
        # Sitting: low variance, near-zero mean
        data.append([random.gauss(0, 0.1) for _ in range(3)])
        labels.append(0)
        # Walking: moderate variance, rhythmic
        data.append([random.gauss(0, 0.5) + 0.3 * math.sin(random.random())
                      for _ in range(3)])
        labels.append(1)
        # Running: high variance, high magnitude
        data.append([random.gauss(0, 1.2) + 0.8 for _ in range(3)])
        labels.append(2)
    return data, labels

samples, labels = generate_samples(150)

# Stage 2: ENGINEER FEATURES - Extract domain features from raw axes
def extract_features(sample):
    magnitude = math.sqrt(sum(x**2 for x in sample))
    variance = sum((x - sum(sample)/3)**2 for x in sample) / 3
    max_val = max(abs(x) for x in sample)
    return [magnitude, variance, max_val]

features = [extract_features(s) for s in samples]

# Stage 3: TRAIN - Time-based split (first 70% train, last 30% test)
split = int(0.7 * len(features))
X_train, y_train = features[:split], labels[:split]
X_test, y_test = features[split:], labels[split:]

# Simple k-NN classifier (k=5) -- no external libraries needed
def distance(a, b):
    return math.sqrt(sum((ai - bi)**2 for ai, bi in zip(a, b)))

def knn_predict(X_train, y_train, query, k=5):
    dists = [(distance(query, x), y) for x, y in zip(X_train, y_train)]
    dists.sort(key=lambda d: d[0])
    votes = [d[1] for d in dists[:k]]
    return max(set(votes), key=votes.count)

# Stage 4: EVALUATE
correct = 0
confusion = [[0]*3 for _ in range(3)]
for x, y_true in zip(X_test, y_test):
    y_pred = knn_predict(X_train, y_train, x)
    confusion[y_true][y_pred] += 1
    if y_pred == y_true:
        correct += 1

accuracy = correct / len(X_test)
print(f"Activity Recognition Results (k-NN, k=5)")
print(f"Training samples: {len(X_train)}, Test samples: {len(X_test)}")
print(f"Accuracy: {accuracy:.1%}\n")

activity_names = ["Sitting", "Walking", "Running"]
print("Confusion Matrix:")
print(f"{'':>10} {'Pred Sit':>9} {'Pred Walk':>10} {'Pred Run':>9}")
for i, row in enumerate(confusion):
    total = sum(row)
    recall = row[i] / total if total > 0 else 0
    print(f"{activity_names[i]:>10} {row[0]:>9} {row[1]:>10} {row[2]:>9}  "
          f"(recall: {recall:.0%})")

# Stage 5: OPTIMIZE - Simulate quantization impact
print(f"\nQuantization Impact Simulation:")
print(f"  FP32 accuracy: {accuracy:.1%} (baseline)")
print(f"  INT8 estimate:  {accuracy - 0.02:.1%} (typical 1-3% loss)")
print(f"  INT4 estimate:  {accuracy - 0.06:.1%} (typical 3-8% loss)")
print(f"  Model size: FP32=12KB -> INT8=3KB (fits on ESP32)")

What to Observe:

  • Sitting is easiest to classify (low movement = distinctive features)
  • Walking vs running confusion shows the value of good feature engineering
  • The 3 extracted features (magnitude, variance, max) capture activity differences better than raw accelerometer values
  • Temporal split avoids data leakage – earlier samples train, later samples test

2.8 Worked Example: Choosing an ML Model for Smart Building Occupancy

Worked Example: Model Selection and Deployment for Office Occupancy Prediction

Scenario: WeWork operates a co-working space in San Francisco with 8 floors and 200 desks. They want to predict hourly occupancy per floor to optimize HVAC and lighting. Available sensor data includes Wi-Fi connected device counts, CO2 levels (ppm), PIR motion events, and door badge swipes.

Given:

  • Training data: 6 months of hourly readings from 4 sensor types across 8 floors (34,944 samples)
  • Features per sample: Wi-Fi count, CO2 ppm, PIR events/hour, badge swipes/hour, hour-of-day, day-of-week
  • Target: Floor occupancy (0-25 people, regression) or occupancy band (empty/low/medium/high, classification)
  • Deployment constraint: Must run on floor-level Raspberry Pi 4 (4GB RAM, no GPU)
  • Latency requirement: Prediction within 100ms for HVAC pre-conditioning

Step 1 – Compare model families on this dataset:

  • Linear Regression: Training time 2 seconds; inference time 0.1 ms on Raspberry Pi 4; MAE 4.2 people; model size 1 KB. Fast but misses nonlinear occupancy patterns.
  • Random Forest (100 trees): Training time 45 seconds; inference time 8 ms; MAE 1.8 people; model size 12 MB. Good accuracy with interpretable feature importance.
  • Gradient Boosted Trees (XGBoost): Training time 90 seconds; inference time 5 ms; MAE 1.5 people; model size 8 MB. Best accuracy, with slightly slower training.
  • Neural Network (2 hidden layers): Training time 5 minutes; inference time 3 ms; MAE 1.7 people; model size 2 MB. Comparable accuracy but needs GPU support for retraining.
  • k-NN (k=7): Training time 0 seconds because it is lazy; inference time 45 ms; MAE 2.1 people; model size 28 MB because it stores the full dataset. Too slow for the 100 ms target when the full dataset is loaded.

Step 2 – Feature importance analysis (from Random Forest):

  • Wi-Fi device count: Importance 0.42. Most direct proxy for occupancy because people carry phones.
  • Hour of day: Importance 0.22. Captures the strong daily pattern, with empty floors at night and peak occupancy from 10am to 2pm.
  • CO2 ppm: Importance 0.18. Correlates with breathing occupants, although it lags the actual count by 15–20 minutes.
  • Day of week: Importance 0.10. Fridays have 40% lower occupancy than Tuesdays.
  • Badge swipes/hour: Importance 0.05. Counts entries only and misses people already inside.
  • PIR events/hour: Importance 0.03. Saturates above 10 occupants because motion is visible everywhere.

Key insight: Wi-Fi count alone predicts occupancy with MAE of 2.8 people. Adding hour-of-day improves to 2.1. The remaining 4 features only improve from 2.1 to 1.5 – diminishing returns. For a simpler deployment, Wi-Fi + time features may be sufficient.

Step 3 – Deployment decision:

  • Selected model: XGBoost (MAE 1.5, 5ms inference, 8 MB)
  • Why not Random Forest: XGBoost is 17% more accurate with similar inference speed
  • Why not Neural Network: Comparable accuracy but requires GPU for retraining; XGBoost retrains on RPi4 in 90 seconds
  • Quantization: Not needed (5ms inference already well under 100ms limit)
  • Retraining schedule: Monthly, using previous 3 months of data (handles seasonal occupancy shifts)

Step 4 – Production monitoring metrics:

  • MAE (7-day rolling): Threshold greater than 3.0 people. Action: trigger retraining.
  • Feature drift (Wi-Fi count distribution): Threshold KL divergence greater than 0.5. Action: investigate whether a sensor changed.
  • Prediction staleness: Threshold more than 2 consecutive hours with a constant prediction. Action: check Raspberry Pi health.
  • HVAC energy waste: Threshold more than 15% over the manual baseline. Action: review prediction accuracy per floor.

Result: XGBoost model deployed on 8 Raspberry Pi 4 devices predicts floor occupancy with MAE of 1.5 people (on a 0-25 scale). HVAC pre-conditioning starts 30 minutes before predicted high occupancy, reducing energy waste by 22% compared to fixed schedule. Monthly retraining keeps accuracy stable across seasonal changes. Total deployment cost: 8 x $55 (RPi4) + $0 (open-source ML stack) = $440 hardware.

Key Insight: For IoT ML model selection, inference speed and model size on the target hardware matter more than marginal accuracy differences. XGBoost and Random Forest are the workhorses of tabular IoT data – they handle mixed feature types, require no normalization, provide feature importance for debugging, and run efficiently on ARM processors without GPU.

2.9 How It Works: The IoT ML Lifecycle

The complete IoT machine learning lifecycle operates as a continuous feedback loop with six distinct stages:

Stage 1: Collect - Data acquisition begins with properly timestamped, labeled sensor streams. For example, a smart building collects temperature (5 bytes), occupancy (binary), and HVAC state (binary) every minute from 500 sensors, generating 43,200 samples per sensor over 30 days. Critical requirement: metadata includes sensor ID, location, and calibration state to enable troubleshooting.

Stage 2: Engineer Features - Raw sensor readings transform into domain-informed features. Instead of feeding raw temperature values to a model, extract lag features (temperature 1 hour ago, 30 minutes ago), gradients (outdoor temperature change per hour), and cyclic encodings (hour-of-day as sin/cos pair). This stage contributes 60-80% of final model accuracy – more than algorithm choice.

Stage 3: Train and Validate - Split data chronologically (not randomly!) into training (70%), validation (15%), and test (15%) sets. Train multiple model families (linear regression, random forest, gradient boosted trees, neural networks) on the training set, tune hyperparameters using validation set, and report final accuracy on unseen test set. For IoT, random forests and gradient boosted trees consistently outperform neural networks on tabular data due to better handling of mixed feature types and no normalization requirements.

Stage 4: Optimize - Apply pruning (remove 70% of weights), quantization (FP32 → INT8 = 4x size reduction), and knowledge distillation (train small “student” model to mimic large “teacher” model). Result: 500KB model shrinks to 50KB with only 1-3% accuracy loss, enabling deployment on microcontrollers.

Stage 5: Deploy - Choose edge (low latency, privacy, offline capability) or cloud (powerful, flexible) based on requirements. Edge example: Alexa wake word detection runs on Cortex-M4 with 8KB model and <100ms latency. Cloud example: Full natural language understanding requires 100MB+ models impossible to fit on edge.

Stage 6: Monitor - Track model drift (accuracy degradation over time), data quality (sensor failures), and prediction staleness (frozen predictions indicate device failure). Retrain monthly using previous 3 months of data to handle seasonal shifts. Example: HVAC model MAE increases from 0.8°C to 3.0°C over 6 months without retraining due to sensor calibration drift.

The feedback loop: Production monitoring (Stage 6) identifies degraded accuracy, triggering data collection (Stage 1) with new edge cases, feature engineering improvements (Stage 2), and model retraining (Stage 3-4), completing the cycle.

2.10 Summary and Key Takeaways

Key Takeaways

Core Principle: IoT machine learning transforms raw sensor streams into actionable predictions, but success depends far more on data quality and feature engineering than on model complexity.

The IoT ML Lifecycle (6 stages, continuously iterating):

  1. Collect — Gather labeled sensor data with proper timestamping and metadata.
  2. Engineer Features — Extract domain-informed features (time-domain statistics, frequency components, cross-sensor correlations) that capture the physical phenomena of interest.
  3. Train and Validate — Use chronological train/test splits to avoid data leakage; select models appropriate for the problem type (classification, regression, anomaly detection, forecasting).
  4. Optimize — Apply pruning, quantization, and knowledge distillation to fit models onto constrained devices (50 MB to 200 KB is achievable).
  5. Deploy — Choose edge (low-latency, private, offline-capable) or cloud (powerful, flexible) deployment based on requirements.
  6. Monitor — Track model drift, data quality, and prediction accuracy in production; trigger retraining when performance degrades.

Essential Rules of Thumb:

  • Feature engineering is the highest-leverage activity: Domain-specific features grounded in physical understanding consistently outperform brute-force approaches. Invest time here before experimenting with complex algorithms.
  • Temporal integrity is non-negotiable: Always split IoT data by time, not randomly. Data leakage is the most common source of inflated accuracy in IoT ML projects.
  • Edge vs. cloud is a design decision: Latency, privacy, connectivity, and model complexity determine where inference runs. Many production systems use a hybrid approach.
  • TinyML enables on-device intelligence: Techniques like quantization (32-bit to 8-bit) and pruning can shrink models by 4–10x with minimal accuracy loss, enabling deployment on microcontrollers.
  • Production models require monitoring: Concept drift, sensor degradation, and environmental changes mean that deployed models must be continuously monitored and periodically retrained.
  • Start simple, add complexity only when needed: A logistic regression or random forest with good features is often the right starting point. Upgrade to neural networks only when simpler models demonstrably fall short.

2.11 Concept Relationships

IoT ML builds on:

IoT ML enables:

Parallel concepts:

  • Feature engineering ↔︎ Edge data reduction: Both transform raw data into compact, meaningful representations
  • Edge ML quantization ↔︎ Data compression: Both sacrifice precision for efficiency with minimal quality loss
  • ML model drift ↔︎ Sensor calibration drift: Both require periodic retraining/recalibration to maintain accuracy

2.12 See Also

Chapter series:

  1. ML Fundamentals - Training vs inference, feature extraction, edge vs cloud
  2. Mobile Sensing - HAR, transportation mode detection
  3. IoT ML Pipeline - 7-step systematic approach
  4. Edge ML & Deployment - Quantization, pruning, TinyML
  5. Audio Feature Processing - MFCC extraction for voice recognition
  6. Feature Engineering - Designing discriminative features
  7. Production ML - Monitoring, drift detection, anomaly detection

Related topics:

Cross-hub connections:

2.13 What’s Next