1346 Edge ML and TinyML Deployment

1346.1 Learning Objectives

By the end of this chapter, you will be able to:

Decide Edge vs Cloud: Apply the decision framework to determine optimal ML deployment location
Implement TinyML: Deploy quantized neural networks on microcontrollers
Optimize Models: Use pruning, quantization, and knowledge distillation techniques
Design Real-Time Systems: Build predictive control systems with edge ML

1346.2 Prerequisites

ML Fundamentals: Training vs inference concepts
IoT ML Pipeline: 7-step ML pipeline
Basic understanding of microcontrollers (ESP32, Cortex-M4)

Chapter Series: Modeling and Inferencing

This is part 4 of the IoT Machine Learning series:

ML Fundamentals - Core concepts
Mobile Sensing - HAR, transportation
IoT ML Pipeline - 7-step pipeline
Edge ML & Deployment (this chapter) - TinyML, quantization
Audio Feature Processing - MFCC
Feature Engineering - Feature design
Production ML - Monitoring

1346.3 Edge vs Cloud ML: Decision Framework

When to process data locally (edge) versus sending to the cloud is a critical architecture decision:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    Start[ML Deployment<br/>Decision] --> Latency{Latency < 100ms<br/>required?}

    Latency -->|Yes| Privacy{Privacy<br/>critical?}
    Latency -->|No| Connectivity{Reliable<br/>connectivity?}

    Privacy -->|Yes| Edge[Deploy to<br/>Edge Device]
    Privacy -->|No| Bandwidth{High bandwidth<br/>data? e.g. video}

    Connectivity -->|Yes| Cloud[Deploy to<br/>Cloud]
    Connectivity -->|No| Edge

    Bandwidth -->|Yes| Edge
    Bandwidth -->|No| Hybrid{Model complexity<br/>> device capacity?}

    Hybrid -->|Yes| HybridDeploy[Hybrid:<br/>Edge inference +<br/>Cloud training]
    Hybrid -->|No| Edge

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style Latency fill:#E67E22,stroke:#2C3E50,color:#fff
    style Privacy fill:#E67E22,stroke:#2C3E50,color:#fff
    style Connectivity fill:#E67E22,stroke:#2C3E50,color:#fff
    style Bandwidth fill:#E67E22,stroke:#2C3E50,color:#fff
    style Hybrid fill:#E67E22,stroke:#2C3E50,color:#fff
    style Edge fill:#27AE60,stroke:#2C3E50,color:#fff
    style Cloud fill:#3498DB,stroke:#2C3E50,color:#fff
    style HybridDeploy fill:#9B59B6,stroke:#2C3E50,color:#fff

Figure 1346.1: Edge vs Cloud ML Deployment Decision Framework

1346.3.1 Decision Factors

Factor	Choose Edge	Choose Cloud
Latency	< 100ms required (safety-critical)	1-5 seconds acceptable
Privacy	Sensitive data (health, financial)	Anonymous/aggregated data
Bandwidth	High data rate (video, audio)	Low data rate (temperature)
Connectivity	Intermittent or offline	Always connected
Model Complexity	Simple models (< 1MB)	Complex models (> 10MB)
Cost	High cloud costs, many devices	Expensive edge hardware

1346.3.2 Example Scenarios

Application	Best Deployment	Reasoning
Fall Detection	Edge	< 100ms latency critical for safety
Voice Assistant Wake Word	Edge	Privacy (always listening)
Full Voice Command	Cloud	Complex NLU requires large models
Industrial Anomaly Detection	Hybrid	Real-time alerts (edge) + root cause (cloud)
Smart Thermostat	Edge	Works offline, simple model
Fleet-wide Predictive Maintenance	Cloud	Cross-device learning required

1346.4 TinyML and Model Quantization

TinyML enables machine learning on devices with < 1MB RAM and < 1MB storage. The key technique is quantization—reducing numerical precision.

1346.4.1 Quantization Overview

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    FP32[FP32 Model<br/>100 MB<br/>98% accuracy] --> INT8[INT8 Quantized<br/>25 MB<br/>96% accuracy]
    INT8 --> INT4[INT4 Quantized<br/>12.5 MB<br/>92% accuracy]

    FP32 -.->|4x smaller<br/>2% accuracy loss| INT8
    INT8 -.->|2x smaller<br/>4% accuracy loss| INT4

    style FP32 fill:#2C3E50,stroke:#16A085,color:#fff
    style INT8 fill:#27AE60,stroke:#2C3E50,color:#fff
    style INT4 fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 1346.2: Model Quantization from FP32 to INT8 and INT4

Precision	Bits	Size Reduction	Accuracy Loss	Use Case
FP32	32	Baseline	0%	Training, cloud inference
FP16	16	2x	< 1%	GPU inference
INT8	8	4x	1-3%	Edge devices, ESP32
INT4	4	8x	3-8%	MCUs, Cortex-M4

1346.4.2 TensorFlow Lite Quantization

import tensorflow as tf

# Train full-precision model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(27,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')  # 5 activity classes
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))

# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for calibration (required for INT8)
def representative_data_gen():
    for sample in X_calibration[:100]:
        yield [sample.reshape(1, -1).astype(np.float32)]

converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Convert and save
tflite_model = converter.convert()
with open('activity_classifier_int8.tflite', 'wb') as f:
    f.write(tflite_model)

# Result: 500KB → 125KB (4x reduction)

1346.4.3 Quantization Calibration

INT8 quantization requires a representative dataset to calibrate the mapping from FP32 to INT8 ranges:

# Calibration determines min/max values for each layer
# Using 100-1000 representative samples works well

def representative_data_gen():
    # Use validation data (NOT training data) for calibration
    for i in range(min(100, len(X_val))):
        yield [X_val[i:i+1].astype(np.float32)]

Calibration Best Practices: - Use 100-1000 samples from validation set - Include diverse examples covering all classes - Do NOT use training data (causes overfit to training distribution)

1346.5 Model Optimization Techniques

1346.5.1 1. Pruning

Remove unimportant weights (set to zero):

import tensorflow_model_optimization as tfmot

# Apply pruning during training
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.30,
        final_sparsity=0.70,
        begin_step=1000,
        end_step=5000
    )
}

model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Train with pruning
model_for_pruning.fit(X_train, y_train, epochs=10)

# Result: 70% of weights become zero → ~3x compression with <2% accuracy loss

1346.5.2 2. Knowledge Distillation

Train small “student” model to mimic large “teacher” model:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart LR
    Input[Input Data] --> Teacher[Teacher Model<br/>Large, Accurate<br/>10 MB, 98%]
    Input --> Student[Student Model<br/>Small, Fast<br/>500 KB]

    Teacher --> Soft[Soft Labels<br/>Probability dist]
    Student --> StudentOut[Student Output]

    Soft --> Loss[Distillation<br/>Loss]
    StudentOut --> Loss

    Loss --> Update[Update Student<br/>Weights]
    Update --> Student

    style Teacher fill:#2C3E50,stroke:#16A085,color:#fff
    style Student fill:#27AE60,stroke:#2C3E50,color:#fff
    style Soft fill:#E67E22,stroke:#2C3E50,color:#fff
    style Loss fill:#16A085,stroke:#2C3E50,color:#fff

Figure 1346.3: Knowledge Distillation from Teacher to Student Model

Benefits: - Student achieves 95% accuracy vs 90% when trained directly on labels - Soft labels from teacher contain more information than hard labels - Enables deployment of small models with large-model accuracy

1346.6 Worked Example: HVAC Predictive Control with Edge LSTM

Scenario: A smart building requires temperature predictions 60 minutes ahead for efficient HVAC pre-conditioning. The edge gateway has 512KB RAM and must work offline.

1346.6.1 System Architecture

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1'}}}%%
flowchart TB
    Indoor[Indoor Temp<br/>Sensor] --> Buffer[Circular<br/>Buffer]
    Outdoor[Outdoor Temp<br/>Sensor] --> Buffer
    Occ[Occupancy<br/>Sensor] --> Buffer
    HVAC[HVAC State<br/>Sensor] --> Buffer

    Buffer --> Features[Feature<br/>Engineering]
    Features --> Model[Quantized LSTM<br/>50 KB]
    Model --> Predict[Predict Temp<br/>+60 min]

    Predict --> Decision{Will temp<br/>be out of<br/>range?}
    Decision -->|Yes, hot| PreCool[Pre-Cool<br/>Start HVAC]
    Decision -->|Yes, cold| PreHeat[Pre-Heat<br/>Start HVAC]
    Decision -->|No| Wait[Wait<br/>Continue monitoring]

    style Indoor fill:#E67E22,stroke:#2C3E50,color:#fff
    style Outdoor fill:#E67E22,stroke:#2C3E50,color:#fff
    style Model fill:#16A085,stroke:#2C3E50,color:#fff
    style Predict fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 1346.4: HVAC Predictive Control Pipeline with Edge LSTM Model

1346.6.2 Step 1: Data Preparation

Raw Data Collection (30 days of historical readings):

Metric	Value	Notes
Total samples	43,200	30 days × 24 hours × 60 minutes
Sampling rate	1 minute	Standard HVAC sensor frequency
Sensors	4	Indoor temp, outdoor temp, occupancy, HVAC state
Storage	~2 MB	CSV format with timestamps

Train/Validation/Test Split (time-series aware):

Split	Samples	Percentage	Date Range
Training	29,750	70%	Days 1-21
Validation	6,345	15%	Days 22-25.5
Test	6,345	15%	Days 25.5-30

1346.6.3 Step 2: Feature Engineering

Feature Set (14 features from 4 raw sensors):

Feature Name	Type	Importance	Rationale
`temp_lag_1h`	Numeric	0.35	Strong autocorrelation
`temp_lag_30m`	Numeric	0.22	Recent trend indicator
`outdoor_temp`	Numeric	0.18	Heat transfer driver
`outdoor_delta_1h`	Numeric	0.08	Predicts future heat load
`hour_sin`	Cyclic	0.05	Daily temperature cycle
`hour_cos`	Cyclic	0.03	Daily cycle phase
`occupancy`	Binary	0.04	Body heat, door openings

Why Cyclic Encoding for Time?

Problem: Hour of day (0-23) is categorical but has circular relationship (11 PM is close to 1 AM).

Bad approach: Raw hour value (0-23) → Model thinks 0 and 23 are far apart!

Good approach: Encode as (sin, cos) pair: - Hour 0: sin=0, cos=1 - Hour 6: sin=1, cos=0 - Hour 12: sin=0, cos=-1 - Hour 23: sin≈0, cos≈0.97 (close to hour 0!)

This preserves the circular relationship.

1346.6.4 Step 3: Model Selection

Model	MAE (deg C)	Size (KB)	Inference (ms)	Fits 512KB?
Linear Regression	1.8	2	0.1	Yes
Random Forest (100 trees)	0.9	450	25	Yes (tight)
LSTM (2 layers, 32 units)	0.7	200	50	Yes
Quantized LSTM (INT8)	0.8	50	15	Yes

Winner: Quantized LSTM - Only 0.1 deg C accuracy loss, 4x smaller, 3x faster

1346.6.5 Step 4: Quantization

import tensorflow as tf

# Train full-precision LSTM
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(60, 14)),
    tf.keras.layers.LSTM(16),
    tf.keras.layers.Dense(1)  # Predict temperature
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))

# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()

# Save quantized model
with open('hvac_predictor_int8.tflite', 'wb') as f:
    f.write(tflite_model)
# Result: 200KB → 50KB (75% reduction)

1346.6.6 Step 5: Results

Deployment Validation:

Metric	Target	Achieved	Status
Model size	< 512KB	50KB	Pass
Inference latency	< 1000ms	15ms	Pass (66x margin)
MAE accuracy	< 1.0 deg C	0.8 deg C	Pass
Works offline	Required	Yes	Pass

Energy Savings:

Pre-deployment (reactive HVAC):
- HVAC cycles: 24 per day (start/stop at setpoint)
- Energy waste: HVAC runs until room reaches setpoint, then overshoots

Post-deployment (predictive HVAC):
- HVAC pre-starts 30-45 min before needed
- Reduces cycling: 24 → 18 cycles/day (25% fewer)

Monthly energy comparison:
- Baseline: 1,200 kWh/month
- Predictive: 1,020 kWh/month
- Savings: 180 kWh/month (15% reduction)

Annual savings (10-story office building):
- 180 kWh × 12 months × $0.12/kWh = $259/floor/year
- 10 floors = $2,590/year
- Edge gateway cost: $150 one-time
- Payback period: 0.7 months

Key Takeaways from HVAC Example

Feature engineering drives accuracy: Lag features contributed 57% of model importance
Quantization is nearly free: INT8 reduced size by 75% with only 0.1 deg C loss
Time-series requires chronological splits: Random splits cause data leakage
Cyclic encoding for time: sin/cos encoding preserves circular relationships
Business value justifies complexity: 15% energy savings pays for development in < 1 month

1346.7 Edge ML Tradeoff

Tradeoff: Model Accuracy vs Edge Deployability

Option A: High-accuracy complex models (deep neural networks, large ensembles)

Option B: Lightweight models optimized for edge deployment (decision trees, quantized CNNs)

Decision Factors: - Complex models achieve 95-98% accuracy but require MB of RAM, GPU/NPU acceleration - Lightweight models achieve 88-93% accuracy but fit in <256KB, run on basic MCUs

For battery-powered wearables with 5+ day battery life requirements: choose lightweight

For edge AI accelerators (Jetson, Coral) with consistent power: choose complex models

Sweet spot: Quantized models - INT8 quantization provides 4x size reduction with only 1-2% accuracy loss.

1346.8 Knowledge Check

Question 1: What is the main challenge in deploying neural networks for real-time inference on IoT edge devices?

Explanation: Edge devices face resource constraints: (1) Memory: MCU has 256KB RAM, (2) Processing: MCU at 80 MHz, (3) Energy: Battery-powered, can’t drain >5mW continuously. Solutions: quantization (4x smaller), pruning (3x compression), efficient architectures (MobileNet, TinyML).

Question 2: A smartphone continuously samples accelerometer at 50 Hz consuming 10 mW. Implementing duty cycling (1 sec active, 4 sec sleep) reduces power consumption to what level?

Explanation: Duty cycle = active time / total time = 1 sec / 5 sec = 20%. Average power = 10 mW × 0.2 + 0 mW × 0.8 = 2 mW (80% savings). Battery impact: Continuous = 38 days, Duty cycled = 192 days.

1346.9 Summary

This chapter covered edge ML and TinyML deployment:

Edge vs Cloud Decision: Use edge for low latency, privacy, and offline requirements
Quantization: INT8 reduces model size by 4x with 1-3% accuracy loss
Pruning: Remove 70% of weights with <2% accuracy loss
Knowledge Distillation: Train small models to match large model performance
HVAC Example: Quantized LSTM achieves 15% energy savings with 0.7-month payback

Key Insight: Start with the simplest model that meets constraints. Quantization is nearly free—always try it before abandoning edge deployment.

1346.10 What’s Next

Continue to Audio Feature Processing to learn MFCC feature extraction for voice recognition and keyword spotting on edge devices.