%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
Start[ML Deployment<br/>Decision] --> Latency{Latency < 100ms<br/>required?}
Latency -->|Yes| Privacy{Privacy<br/>critical?}
Latency -->|No| Connectivity{Reliable<br/>connectivity?}
Privacy -->|Yes| Edge[Deploy to<br/>Edge Device]
Privacy -->|No| Bandwidth{High bandwidth<br/>data? e.g. video}
Connectivity -->|Yes| Cloud[Deploy to<br/>Cloud]
Connectivity -->|No| Edge
Bandwidth -->|Yes| Edge
Bandwidth -->|No| Hybrid{Model complexity<br/>> device capacity?}
Hybrid -->|Yes| HybridDeploy[Hybrid:<br/>Edge inference +<br/>Cloud training]
Hybrid -->|No| Edge
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style Latency fill:#E67E22,stroke:#2C3E50,color:#fff
style Privacy fill:#E67E22,stroke:#2C3E50,color:#fff
style Connectivity fill:#E67E22,stroke:#2C3E50,color:#fff
style Bandwidth fill:#E67E22,stroke:#2C3E50,color:#fff
style Hybrid fill:#E67E22,stroke:#2C3E50,color:#fff
style Edge fill:#27AE60,stroke:#2C3E50,color:#fff
style Cloud fill:#3498DB,stroke:#2C3E50,color:#fff
style HybridDeploy fill:#9B59B6,stroke:#2C3E50,color:#fff
1346 Edge ML and TinyML Deployment
1346.1 Learning Objectives
By the end of this chapter, you will be able to:
- Decide Edge vs Cloud: Apply the decision framework to determine optimal ML deployment location
- Implement TinyML: Deploy quantized neural networks on microcontrollers
- Optimize Models: Use pruning, quantization, and knowledge distillation techniques
- Design Real-Time Systems: Build predictive control systems with edge ML
1346.2 Prerequisites
- ML Fundamentals: Training vs inference concepts
- IoT ML Pipeline: 7-step ML pipeline
- Basic understanding of microcontrollers (ESP32, Cortex-M4)
This is part 4 of the IoT Machine Learning series:
- ML Fundamentals - Core concepts
- Mobile Sensing - HAR, transportation
- IoT ML Pipeline - 7-step pipeline
- Edge ML & Deployment (this chapter) - TinyML, quantization
- Audio Feature Processing - MFCC
- Feature Engineering - Feature design
- Production ML - Monitoring
1346.3 Edge vs Cloud ML: Decision Framework
When to process data locally (edge) versus sending to the cloud is a critical architecture decision:
1346.3.1 Decision Factors
| Factor | Choose Edge | Choose Cloud |
|---|---|---|
| Latency | < 100ms required (safety-critical) | 1-5 seconds acceptable |
| Privacy | Sensitive data (health, financial) | Anonymous/aggregated data |
| Bandwidth | High data rate (video, audio) | Low data rate (temperature) |
| Connectivity | Intermittent or offline | Always connected |
| Model Complexity | Simple models (< 1MB) | Complex models (> 10MB) |
| Cost | High cloud costs, many devices | Expensive edge hardware |
1346.3.2 Example Scenarios
| Application | Best Deployment | Reasoning |
|---|---|---|
| Fall Detection | Edge | < 100ms latency critical for safety |
| Voice Assistant Wake Word | Edge | Privacy (always listening) |
| Full Voice Command | Cloud | Complex NLU requires large models |
| Industrial Anomaly Detection | Hybrid | Real-time alerts (edge) + root cause (cloud) |
| Smart Thermostat | Edge | Works offline, simple model |
| Fleet-wide Predictive Maintenance | Cloud | Cross-device learning required |
1346.4 TinyML and Model Quantization
TinyML enables machine learning on devices with < 1MB RAM and < 1MB storage. The key technique is quantization—reducing numerical precision.
1346.4.1 Quantization Overview
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
FP32[FP32 Model<br/>100 MB<br/>98% accuracy] --> INT8[INT8 Quantized<br/>25 MB<br/>96% accuracy]
INT8 --> INT4[INT4 Quantized<br/>12.5 MB<br/>92% accuracy]
FP32 -.->|4x smaller<br/>2% accuracy loss| INT8
INT8 -.->|2x smaller<br/>4% accuracy loss| INT4
style FP32 fill:#2C3E50,stroke:#16A085,color:#fff
style INT8 fill:#27AE60,stroke:#2C3E50,color:#fff
style INT4 fill:#E67E22,stroke:#2C3E50,color:#fff
| Precision | Bits | Size Reduction | Accuracy Loss | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | 0% | Training, cloud inference |
| FP16 | 16 | 2x | < 1% | GPU inference |
| INT8 | 8 | 4x | 1-3% | Edge devices, ESP32 |
| INT4 | 4 | 8x | 3-8% | MCUs, Cortex-M4 |
1346.4.2 TensorFlow Lite Quantization
import tensorflow as tf
# Train full-precision model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(27,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(5, activation='softmax') # 5 activity classes
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))
# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Representative dataset for calibration (required for INT8)
def representative_data_gen():
for sample in X_calibration[:100]:
yield [sample.reshape(1, -1).astype(np.float32)]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Convert and save
tflite_model = converter.convert()
with open('activity_classifier_int8.tflite', 'wb') as f:
f.write(tflite_model)
# Result: 500KB → 125KB (4x reduction)1346.4.3 Quantization Calibration
INT8 quantization requires a representative dataset to calibrate the mapping from FP32 to INT8 ranges:
# Calibration determines min/max values for each layer
# Using 100-1000 representative samples works well
def representative_data_gen():
# Use validation data (NOT training data) for calibration
for i in range(min(100, len(X_val))):
yield [X_val[i:i+1].astype(np.float32)]Calibration Best Practices: - Use 100-1000 samples from validation set - Include diverse examples covering all classes - Do NOT use training data (causes overfit to training distribution)
1346.5 Model Optimization Techniques
1346.5.1 1. Pruning
Remove unimportant weights (set to zero):
import tensorflow_model_optimization as tfmot
# Apply pruning during training
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.30,
final_sparsity=0.70,
begin_step=1000,
end_step=5000
)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
# Train with pruning
model_for_pruning.fit(X_train, y_train, epochs=10)
# Result: 70% of weights become zero → ~3x compression with <2% accuracy loss1346.5.2 2. Knowledge Distillation
Train small “student” model to mimic large “teacher” model:
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
Input[Input Data] --> Teacher[Teacher Model<br/>Large, Accurate<br/>10 MB, 98%]
Input --> Student[Student Model<br/>Small, Fast<br/>500 KB]
Teacher --> Soft[Soft Labels<br/>Probability dist]
Student --> StudentOut[Student Output]
Soft --> Loss[Distillation<br/>Loss]
StudentOut --> Loss
Loss --> Update[Update Student<br/>Weights]
Update --> Student
style Teacher fill:#2C3E50,stroke:#16A085,color:#fff
style Student fill:#27AE60,stroke:#2C3E50,color:#fff
style Soft fill:#E67E22,stroke:#2C3E50,color:#fff
style Loss fill:#16A085,stroke:#2C3E50,color:#fff
Benefits: - Student achieves 95% accuracy vs 90% when trained directly on labels - Soft labels from teacher contain more information than hard labels - Enables deployment of small models with large-model accuracy
1346.6 Worked Example: HVAC Predictive Control with Edge LSTM
Scenario: A smart building requires temperature predictions 60 minutes ahead for efficient HVAC pre-conditioning. The edge gateway has 512KB RAM and must work offline.
1346.6.1 System Architecture
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
Indoor[Indoor Temp<br/>Sensor] --> Buffer[Circular<br/>Buffer]
Outdoor[Outdoor Temp<br/>Sensor] --> Buffer
Occ[Occupancy<br/>Sensor] --> Buffer
HVAC[HVAC State<br/>Sensor] --> Buffer
Buffer --> Features[Feature<br/>Engineering]
Features --> Model[Quantized LSTM<br/>50 KB]
Model --> Predict[Predict Temp<br/>+60 min]
Predict --> Decision{Will temp<br/>be out of<br/>range?}
Decision -->|Yes, hot| PreCool[Pre-Cool<br/>Start HVAC]
Decision -->|Yes, cold| PreHeat[Pre-Heat<br/>Start HVAC]
Decision -->|No| Wait[Wait<br/>Continue monitoring]
style Indoor fill:#E67E22,stroke:#2C3E50,color:#fff
style Outdoor fill:#E67E22,stroke:#2C3E50,color:#fff
style Model fill:#16A085,stroke:#2C3E50,color:#fff
style Predict fill:#27AE60,stroke:#2C3E50,color:#fff
1346.6.2 Step 1: Data Preparation
Raw Data Collection (30 days of historical readings):
| Metric | Value | Notes |
|---|---|---|
| Total samples | 43,200 | 30 days × 24 hours × 60 minutes |
| Sampling rate | 1 minute | Standard HVAC sensor frequency |
| Sensors | 4 | Indoor temp, outdoor temp, occupancy, HVAC state |
| Storage | ~2 MB | CSV format with timestamps |
Train/Validation/Test Split (time-series aware):
| Split | Samples | Percentage | Date Range |
|---|---|---|---|
| Training | 29,750 | 70% | Days 1-21 |
| Validation | 6,345 | 15% | Days 22-25.5 |
| Test | 6,345 | 15% | Days 25.5-30 |
1346.6.3 Step 2: Feature Engineering
Feature Set (14 features from 4 raw sensors):
| Feature Name | Type | Importance | Rationale |
|---|---|---|---|
temp_lag_1h |
Numeric | 0.35 | Strong autocorrelation |
temp_lag_30m |
Numeric | 0.22 | Recent trend indicator |
outdoor_temp |
Numeric | 0.18 | Heat transfer driver |
outdoor_delta_1h |
Numeric | 0.08 | Predicts future heat load |
hour_sin |
Cyclic | 0.05 | Daily temperature cycle |
hour_cos |
Cyclic | 0.03 | Daily cycle phase |
occupancy |
Binary | 0.04 | Body heat, door openings |
Problem: Hour of day (0-23) is categorical but has circular relationship (11 PM is close to 1 AM).
Bad approach: Raw hour value (0-23) → Model thinks 0 and 23 are far apart!
Good approach: Encode as (sin, cos) pair: - Hour 0: sin=0, cos=1 - Hour 6: sin=1, cos=0 - Hour 12: sin=0, cos=-1 - Hour 23: sin≈0, cos≈0.97 (close to hour 0!)
This preserves the circular relationship.
1346.6.4 Step 3: Model Selection
| Model | MAE (deg C) | Size (KB) | Inference (ms) | Fits 512KB? |
|---|---|---|---|---|
| Linear Regression | 1.8 | 2 | 0.1 | Yes |
| Random Forest (100 trees) | 0.9 | 450 | 25 | Yes (tight) |
| LSTM (2 layers, 32 units) | 0.7 | 200 | 50 | Yes |
| Quantized LSTM (INT8) | 0.8 | 50 | 15 | Yes |
Winner: Quantized LSTM - Only 0.1 deg C accuracy loss, 4x smaller, 3x faster
1346.6.5 Step 4: Quantization
import tensorflow as tf
# Train full-precision LSTM
model = tf.keras.Sequential([
tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(60, 14)),
tf.keras.layers.LSTM(16),
tf.keras.layers.Dense(1) # Predict temperature
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))
# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()
# Save quantized model
with open('hvac_predictor_int8.tflite', 'wb') as f:
f.write(tflite_model)
# Result: 200KB → 50KB (75% reduction)1346.6.6 Step 5: Results
Deployment Validation:
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Model size | < 512KB | 50KB | Pass |
| Inference latency | < 1000ms | 15ms | Pass (66x margin) |
| MAE accuracy | < 1.0 deg C | 0.8 deg C | Pass |
| Works offline | Required | Yes | Pass |
Energy Savings:
Pre-deployment (reactive HVAC):
- HVAC cycles: 24 per day (start/stop at setpoint)
- Energy waste: HVAC runs until room reaches setpoint, then overshoots
Post-deployment (predictive HVAC):
- HVAC pre-starts 30-45 min before needed
- Reduces cycling: 24 → 18 cycles/day (25% fewer)
Monthly energy comparison:
- Baseline: 1,200 kWh/month
- Predictive: 1,020 kWh/month
- Savings: 180 kWh/month (15% reduction)
Annual savings (10-story office building):
- 180 kWh × 12 months × $0.12/kWh = $259/floor/year
- 10 floors = $2,590/year
- Edge gateway cost: $150 one-time
- Payback period: 0.7 months
- Feature engineering drives accuracy: Lag features contributed 57% of model importance
- Quantization is nearly free: INT8 reduced size by 75% with only 0.1 deg C loss
- Time-series requires chronological splits: Random splits cause data leakage
- Cyclic encoding for time: sin/cos encoding preserves circular relationships
- Business value justifies complexity: 15% energy savings pays for development in < 1 month
1346.7 Edge ML Tradeoff
Option A: High-accuracy complex models (deep neural networks, large ensembles)
Option B: Lightweight models optimized for edge deployment (decision trees, quantized CNNs)
Decision Factors: - Complex models achieve 95-98% accuracy but require MB of RAM, GPU/NPU acceleration - Lightweight models achieve 88-93% accuracy but fit in <256KB, run on basic MCUs
For battery-powered wearables with 5+ day battery life requirements: choose lightweight
For edge AI accelerators (Jetson, Coral) with consistent power: choose complex models
Sweet spot: Quantized models - INT8 quantization provides 4x size reduction with only 1-2% accuracy loss.
1346.8 Knowledge Check
Question 1: What is the main challenge in deploying neural networks for real-time inference on IoT edge devices?
Explanation: Edge devices face resource constraints: (1) Memory: MCU has 256KB RAM, (2) Processing: MCU at 80 MHz, (3) Energy: Battery-powered, can’t drain >5mW continuously. Solutions: quantization (4x smaller), pruning (3x compression), efficient architectures (MobileNet, TinyML).
Question 2: A smartphone continuously samples accelerometer at 50 Hz consuming 10 mW. Implementing duty cycling (1 sec active, 4 sec sleep) reduces power consumption to what level?
Explanation: Duty cycle = active time / total time = 1 sec / 5 sec = 20%. Average power = 10 mW × 0.2 + 0 mW × 0.8 = 2 mW (80% savings). Battery impact: Continuous = 38 days, Duty cycled = 192 days.
1346.9 Summary
This chapter covered edge ML and TinyML deployment:
- Edge vs Cloud Decision: Use edge for low latency, privacy, and offline requirements
- Quantization: INT8 reduces model size by 4x with 1-3% accuracy loss
- Pruning: Remove 70% of weights with <2% accuracy loss
- Knowledge Distillation: Train small models to match large model performance
- HVAC Example: Quantized LSTM achieves 15% energy savings with 0.7-month payback
Key Insight: Start with the simplest model that meets constraints. Quantization is nearly free—always try it before abandoning edge deployment.
1346.10 What’s Next
Continue to Audio Feature Processing to learn MFCC feature extraction for voice recognition and keyword spotting on edge devices.