1346  Edge ML and TinyML Deployment

1346.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Decide Edge vs Cloud: Apply the decision framework to determine optimal ML deployment location
  • Implement TinyML: Deploy quantized neural networks on microcontrollers
  • Optimize Models: Use pruning, quantization, and knowledge distillation techniques
  • Design Real-Time Systems: Build predictive control systems with edge ML

1346.2 Prerequisites

NoteChapter Series: Modeling and Inferencing

This is part 4 of the IoT Machine Learning series:

  1. ML Fundamentals - Core concepts
  2. Mobile Sensing - HAR, transportation
  3. IoT ML Pipeline - 7-step pipeline
  4. Edge ML & Deployment (this chapter) - TinyML, quantization
  5. Audio Feature Processing - MFCC
  6. Feature Engineering - Feature design
  7. Production ML - Monitoring

1346.3 Edge vs Cloud ML: Decision Framework

When to process data locally (edge) versus sending to the cloud is a critical architecture decision:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    Start[ML Deployment<br/>Decision] --> Latency{Latency < 100ms<br/>required?}

    Latency -->|Yes| Privacy{Privacy<br/>critical?}
    Latency -->|No| Connectivity{Reliable<br/>connectivity?}

    Privacy -->|Yes| Edge[Deploy to<br/>Edge Device]
    Privacy -->|No| Bandwidth{High bandwidth<br/>data? e.g. video}

    Connectivity -->|Yes| Cloud[Deploy to<br/>Cloud]
    Connectivity -->|No| Edge

    Bandwidth -->|Yes| Edge
    Bandwidth -->|No| Hybrid{Model complexity<br/>> device capacity?}

    Hybrid -->|Yes| HybridDeploy[Hybrid:<br/>Edge inference +<br/>Cloud training]
    Hybrid -->|No| Edge

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style Latency fill:#E67E22,stroke:#2C3E50,color:#fff
    style Privacy fill:#E67E22,stroke:#2C3E50,color:#fff
    style Connectivity fill:#E67E22,stroke:#2C3E50,color:#fff
    style Bandwidth fill:#E67E22,stroke:#2C3E50,color:#fff
    style Hybrid fill:#E67E22,stroke:#2C3E50,color:#fff
    style Edge fill:#27AE60,stroke:#2C3E50,color:#fff
    style Cloud fill:#3498DB,stroke:#2C3E50,color:#fff
    style HybridDeploy fill:#9B59B6,stroke:#2C3E50,color:#fff

Figure 1346.1: Edge vs Cloud ML Deployment Decision Framework

1346.3.1 Decision Factors

Factor Choose Edge Choose Cloud
Latency < 100ms required (safety-critical) 1-5 seconds acceptable
Privacy Sensitive data (health, financial) Anonymous/aggregated data
Bandwidth High data rate (video, audio) Low data rate (temperature)
Connectivity Intermittent or offline Always connected
Model Complexity Simple models (< 1MB) Complex models (> 10MB)
Cost High cloud costs, many devices Expensive edge hardware

1346.3.2 Example Scenarios

Application Best Deployment Reasoning
Fall Detection Edge < 100ms latency critical for safety
Voice Assistant Wake Word Edge Privacy (always listening)
Full Voice Command Cloud Complex NLU requires large models
Industrial Anomaly Detection Hybrid Real-time alerts (edge) + root cause (cloud)
Smart Thermostat Edge Works offline, simple model
Fleet-wide Predictive Maintenance Cloud Cross-device learning required

1346.4 TinyML and Model Quantization

TinyML enables machine learning on devices with < 1MB RAM and < 1MB storage. The key technique is quantization—reducing numerical precision.

1346.4.1 Quantization Overview

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    FP32[FP32 Model<br/>100 MB<br/>98% accuracy] --> INT8[INT8 Quantized<br/>25 MB<br/>96% accuracy]
    INT8 --> INT4[INT4 Quantized<br/>12.5 MB<br/>92% accuracy]

    FP32 -.->|4x smaller<br/>2% accuracy loss| INT8
    INT8 -.->|2x smaller<br/>4% accuracy loss| INT4

    style FP32 fill:#2C3E50,stroke:#16A085,color:#fff
    style INT8 fill:#27AE60,stroke:#2C3E50,color:#fff
    style INT4 fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 1346.2: Model Quantization from FP32 to INT8 and INT4
Precision Bits Size Reduction Accuracy Loss Use Case
FP32 32 Baseline 0% Training, cloud inference
FP16 16 2x < 1% GPU inference
INT8 8 4x 1-3% Edge devices, ESP32
INT4 4 8x 3-8% MCUs, Cortex-M4

1346.4.2 TensorFlow Lite Quantization

import tensorflow as tf

# Train full-precision model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(27,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')  # 5 activity classes
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))

# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for calibration (required for INT8)
def representative_data_gen():
    for sample in X_calibration[:100]:
        yield [sample.reshape(1, -1).astype(np.float32)]

converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Convert and save
tflite_model = converter.convert()
with open('activity_classifier_int8.tflite', 'wb') as f:
    f.write(tflite_model)

# Result: 500KB → 125KB (4x reduction)

1346.4.3 Quantization Calibration

INT8 quantization requires a representative dataset to calibrate the mapping from FP32 to INT8 ranges:

# Calibration determines min/max values for each layer
# Using 100-1000 representative samples works well

def representative_data_gen():
    # Use validation data (NOT training data) for calibration
    for i in range(min(100, len(X_val))):
        yield [X_val[i:i+1].astype(np.float32)]

Calibration Best Practices: - Use 100-1000 samples from validation set - Include diverse examples covering all classes - Do NOT use training data (causes overfit to training distribution)

1346.5 Model Optimization Techniques

1346.5.1 1. Pruning

Remove unimportant weights (set to zero):

import tensorflow_model_optimization as tfmot

# Apply pruning during training
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.30,
        final_sparsity=0.70,
        begin_step=1000,
        end_step=5000
    )
}

model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Train with pruning
model_for_pruning.fit(X_train, y_train, epochs=10)

# Result: 70% of weights become zero → ~3x compression with <2% accuracy loss

1346.5.2 2. Knowledge Distillation

Train small “student” model to mimic large “teacher” model:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    Input[Input Data] --> Teacher[Teacher Model<br/>Large, Accurate<br/>10 MB, 98%]
    Input --> Student[Student Model<br/>Small, Fast<br/>500 KB]

    Teacher --> Soft[Soft Labels<br/>Probability dist]
    Student --> StudentOut[Student Output]

    Soft --> Loss[Distillation<br/>Loss]
    StudentOut --> Loss

    Loss --> Update[Update Student<br/>Weights]
    Update --> Student

    style Teacher fill:#2C3E50,stroke:#16A085,color:#fff
    style Student fill:#27AE60,stroke:#2C3E50,color:#fff
    style Soft fill:#E67E22,stroke:#2C3E50,color:#fff
    style Loss fill:#16A085,stroke:#2C3E50,color:#fff

Figure 1346.3: Knowledge Distillation from Teacher to Student Model

Benefits: - Student achieves 95% accuracy vs 90% when trained directly on labels - Soft labels from teacher contain more information than hard labels - Enables deployment of small models with large-model accuracy

1346.6 Worked Example: HVAC Predictive Control with Edge LSTM

Scenario: A smart building requires temperature predictions 60 minutes ahead for efficient HVAC pre-conditioning. The edge gateway has 512KB RAM and must work offline.

1346.6.1 System Architecture

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    Indoor[Indoor Temp<br/>Sensor] --> Buffer[Circular<br/>Buffer]
    Outdoor[Outdoor Temp<br/>Sensor] --> Buffer
    Occ[Occupancy<br/>Sensor] --> Buffer
    HVAC[HVAC State<br/>Sensor] --> Buffer

    Buffer --> Features[Feature<br/>Engineering]
    Features --> Model[Quantized LSTM<br/>50 KB]
    Model --> Predict[Predict Temp<br/>+60 min]

    Predict --> Decision{Will temp<br/>be out of<br/>range?}
    Decision -->|Yes, hot| PreCool[Pre-Cool<br/>Start HVAC]
    Decision -->|Yes, cold| PreHeat[Pre-Heat<br/>Start HVAC]
    Decision -->|No| Wait[Wait<br/>Continue monitoring]

    style Indoor fill:#E67E22,stroke:#2C3E50,color:#fff
    style Outdoor fill:#E67E22,stroke:#2C3E50,color:#fff
    style Model fill:#16A085,stroke:#2C3E50,color:#fff
    style Predict fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1346.4: HVAC Predictive Control Pipeline with Edge LSTM Model

1346.6.2 Step 1: Data Preparation

Raw Data Collection (30 days of historical readings):

Metric Value Notes
Total samples 43,200 30 days × 24 hours × 60 minutes
Sampling rate 1 minute Standard HVAC sensor frequency
Sensors 4 Indoor temp, outdoor temp, occupancy, HVAC state
Storage ~2 MB CSV format with timestamps

Train/Validation/Test Split (time-series aware):

Split Samples Percentage Date Range
Training 29,750 70% Days 1-21
Validation 6,345 15% Days 22-25.5
Test 6,345 15% Days 25.5-30

1346.6.3 Step 2: Feature Engineering

Feature Set (14 features from 4 raw sensors):

Feature Name Type Importance Rationale
temp_lag_1h Numeric 0.35 Strong autocorrelation
temp_lag_30m Numeric 0.22 Recent trend indicator
outdoor_temp Numeric 0.18 Heat transfer driver
outdoor_delta_1h Numeric 0.08 Predicts future heat load
hour_sin Cyclic 0.05 Daily temperature cycle
hour_cos Cyclic 0.03 Daily cycle phase
occupancy Binary 0.04 Body heat, door openings
TipWhy Cyclic Encoding for Time?

Problem: Hour of day (0-23) is categorical but has circular relationship (11 PM is close to 1 AM).

Bad approach: Raw hour value (0-23) → Model thinks 0 and 23 are far apart!

Good approach: Encode as (sin, cos) pair: - Hour 0: sin=0, cos=1 - Hour 6: sin=1, cos=0 - Hour 12: sin=0, cos=-1 - Hour 23: sin≈0, cos≈0.97 (close to hour 0!)

This preserves the circular relationship.

1346.6.4 Step 3: Model Selection

Model MAE (deg C) Size (KB) Inference (ms) Fits 512KB?
Linear Regression 1.8 2 0.1 Yes
Random Forest (100 trees) 0.9 450 25 Yes (tight)
LSTM (2 layers, 32 units) 0.7 200 50 Yes
Quantized LSTM (INT8) 0.8 50 15 Yes

Winner: Quantized LSTM - Only 0.1 deg C accuracy loss, 4x smaller, 3x faster

1346.6.5 Step 4: Quantization

import tensorflow as tf

# Train full-precision LSTM
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(60, 14)),
    tf.keras.layers.LSTM(16),
    tf.keras.layers.Dense(1)  # Predict temperature
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))

# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()

# Save quantized model
with open('hvac_predictor_int8.tflite', 'wb') as f:
    f.write(tflite_model)
# Result: 200KB → 50KB (75% reduction)

1346.6.6 Step 5: Results

Deployment Validation:

Metric Target Achieved Status
Model size < 512KB 50KB Pass
Inference latency < 1000ms 15ms Pass (66x margin)
MAE accuracy < 1.0 deg C 0.8 deg C Pass
Works offline Required Yes Pass

Energy Savings:

Pre-deployment (reactive HVAC):
- HVAC cycles: 24 per day (start/stop at setpoint)
- Energy waste: HVAC runs until room reaches setpoint, then overshoots

Post-deployment (predictive HVAC):
- HVAC pre-starts 30-45 min before needed
- Reduces cycling: 24 → 18 cycles/day (25% fewer)

Monthly energy comparison:
- Baseline: 1,200 kWh/month
- Predictive: 1,020 kWh/month
- Savings: 180 kWh/month (15% reduction)

Annual savings (10-story office building):
- 180 kWh × 12 months × $0.12/kWh = $259/floor/year
- 10 floors = $2,590/year
- Edge gateway cost: $150 one-time
- Payback period: 0.7 months
NoteKey Takeaways from HVAC Example
  1. Feature engineering drives accuracy: Lag features contributed 57% of model importance
  2. Quantization is nearly free: INT8 reduced size by 75% with only 0.1 deg C loss
  3. Time-series requires chronological splits: Random splits cause data leakage
  4. Cyclic encoding for time: sin/cos encoding preserves circular relationships
  5. Business value justifies complexity: 15% energy savings pays for development in < 1 month

1346.7 Edge ML Tradeoff

TipTradeoff: Model Accuracy vs Edge Deployability

Option A: High-accuracy complex models (deep neural networks, large ensembles)

Option B: Lightweight models optimized for edge deployment (decision trees, quantized CNNs)

Decision Factors: - Complex models achieve 95-98% accuracy but require MB of RAM, GPU/NPU acceleration - Lightweight models achieve 88-93% accuracy but fit in <256KB, run on basic MCUs

For battery-powered wearables with 5+ day battery life requirements: choose lightweight

For edge AI accelerators (Jetson, Coral) with consistent power: choose complex models

Sweet spot: Quantized models - INT8 quantization provides 4x size reduction with only 1-2% accuracy loss.

1346.8 Knowledge Check

Question 1: What is the main challenge in deploying neural networks for real-time inference on IoT edge devices?

Explanation: Edge devices face resource constraints: (1) Memory: MCU has 256KB RAM, (2) Processing: MCU at 80 MHz, (3) Energy: Battery-powered, can’t drain >5mW continuously. Solutions: quantization (4x smaller), pruning (3x compression), efficient architectures (MobileNet, TinyML).

Question 2: A smartphone continuously samples accelerometer at 50 Hz consuming 10 mW. Implementing duty cycling (1 sec active, 4 sec sleep) reduces power consumption to what level?

Explanation: Duty cycle = active time / total time = 1 sec / 5 sec = 20%. Average power = 10 mW × 0.2 + 0 mW × 0.8 = 2 mW (80% savings). Battery impact: Continuous = 38 days, Duty cycled = 192 days.

1346.9 Summary

This chapter covered edge ML and TinyML deployment:

  • Edge vs Cloud Decision: Use edge for low latency, privacy, and offline requirements
  • Quantization: INT8 reduces model size by 4x with 1-3% accuracy loss
  • Pruning: Remove 70% of weights with <2% accuracy loss
  • Knowledge Distillation: Train small models to match large model performance
  • HVAC Example: Quantized LSTM achieves 15% energy savings with 0.7-month payback

Key Insight: Start with the simplest model that meets constraints. Quantization is nearly free—always try it before abandoning edge deployment.

1346.10 What’s Next

Continue to Audio Feature Processing to learn MFCC feature extraction for voice recognition and keyword spotting on edge devices.