%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'fontSize': '14px'}}}%%
flowchart LR
subgraph Float32["Float32 Model"]
F_Size["14 MB"]
F_Speed["300ms"]
F_Acc["72.0%"]
end
subgraph Int8["Int8 Quantized"]
I_Size["3.5 MB"]
I_Speed["75ms"]
I_Acc["71.6%"]
end
Float32 -->|"4x smaller"| Int8
Float32 -->|"4x faster"| Int8
Float32 -->|"-0.4% acc"| Int8
style Float32 fill:#E67E22,stroke:#2C3E50,color:#fff
style Int8 fill:#16A085,stroke:#2C3E50,color:#fff
319 Model Optimization for Edge AI
319.1 Learning Objectives
By the end of this chapter, you will be able to:
- Apply Quantization: Convert float32 models to int8 for 4x size reduction with <1% accuracy loss
- Implement Pruning: Remove 70-90% of neural network weights while preserving accuracy
- Use Knowledge Distillation: Train compact student models to match large teacher performance
- Design Optimization Pipelines: Combine techniques for 10-100x total compression
319.2 Introduction
Running full-scale neural networks on edge devices requires aggressive optimization. Three primary techniques enable deploying powerful AI on resource-constrained hardware: quantization, pruning, and knowledge distillation.
319.3 Quantization: Reducing Precision
Concept: Reduce numerical precision from 32-bit floats to 8-bit integers, achieving 4x size reduction and 2-4x speedup with minimal accuracy loss.
319.3.1 Quantization Math
Float32 representation:
value = sign x 2^exponent x mantissa
Range: +/-3.4 x 10^38
Precision: ~7 decimal digits
Memory: 4 bytes
Int8 representation:
value = scale x (quantized_value - zero_point)
Range: -128 to 127
Precision: 256 discrete levels
Memory: 1 byte
Example: Converting weights in range [-0.5, 0.5]
scale = (max - min) / (127 - (-128)) = 1.0 / 255 = 0.00392
zero_point = 0
Float weight: 0.234 -> Int8: 0.234 / 0.00392 = 60
Float weight: -0.156 -> Int8: -0.156 / 0.00392 = -40
319.3.2 Quantization Types
| Type | When Applied | Accuracy Impact | Use Case |
|---|---|---|---|
| Post-Training Quantization (PTQ) | After training on float32 model | 0.5-2% loss | Easiest, no retraining needed |
| Quantization-Aware Training (QAT) | During training, simulates quantization | <0.5% loss | Best accuracy, requires retraining |
| Dynamic Range Quantization | Weights int8, activations float32 | 1-3% loss | Balanced approach |
319.3.3 Real-World Results
MobileNetV2 on ImageNet:
- Float32: 72.0% accuracy, 14 MB, 300ms inference (Cortex-M7)
- Int8 PTQ: 70.8% accuracy, 3.5 MB, 75ms inference (4x faster, -1.2% accuracy)
- Int8 QAT: 71.6% accuracy, 3.5 MB, 75ms inference (4x faster, -0.4% accuracy)
YOLOv4-Tiny object detection:
- Float32: 40.2% mAP, 23 MB, 180ms
- Int8 PTQ: 39.1% mAP, 6 MB, 45ms (4x faster, -1.1% mAP)
Scenario: A factory deploys vibration sensors on 500 motors for predictive maintenance. Each sensor runs a CNN-based anomaly detector on an STM32F4 microcontroller (1 MB Flash, 192 KB RAM). The original float32 model must be quantized to fit the hardware constraints.
Given: - Original model: 3-layer CNN, 890 KB float32 weights, 94.2% anomaly detection accuracy - Target hardware: STM32F4 with 1 MB Flash, 192 KB RAM - Flash budget for model: 300 KB (rest needed for firmware, buffers) - RAM budget for inference: 80 KB tensor arena - Inference time requirement: < 50 ms per vibration window - Acceptable accuracy loss: < 2%
Steps:
- Analyze model size reduction needed:
- Current size: 890 KB (float32)
- Target size: 300 KB
- Required compression: 890 KB / 300 KB = 2.97x minimum
- INT8 quantization provides: 4x compression (32-bit to 8-bit)
- Expected size after INT8: 890 KB / 4 = 222.5 KB (fits in budget)
- Calculate RAM requirements for tensor arena:
- Largest layer activation: 128 channels x 32 samples = 4,096 values
- INT8 activations: 4,096 x 1 byte = 4 KB per layer
- Peak memory (2 layers simultaneously): ~16 KB activations + 4 KB buffers = 20 KB
- With runtime overhead: ~35 KB total (well within 80 KB budget)
- Apply post-training quantization (PTQ):
- Collect 500 representative vibration samples for calibration
- Run PTQ with per-channel quantization for convolution layers
- Measure accuracy on validation set: 92.8% (1.4% drop from 94.2%)
- Evaluate if QAT is needed:
- Accuracy drop (1.4%) is within 2% tolerance
- PTQ is sufficient; QAT would add training complexity for minimal gain
- Decision: Use PTQ for faster deployment
- Measure inference performance:
- Float32 on STM32F4: 180 ms per window (too slow)
- INT8 on STM32F4 with CMSIS-NN: 38 ms per window (meets < 50 ms requirement)
- Speedup: 180 / 38 = 4.7x
Result: - Final model size: 222.5 KB (fits 300 KB Flash budget) - Inference time: 38 ms (meets 50 ms requirement) - Accuracy: 92.8% (within 2% of original 94.2%) - RAM usage: 35 KB (within 80 KB budget)
Key Insight: INT8 quantization delivers a reliable 4x size reduction and 4-5x speedup on ARM Cortex-M processors. For most industrial anomaly detection tasks, post-training quantization (PTQ) with representative calibration data achieves production-quality accuracy without the complexity of quantization-aware training. Always verify that the calibration dataset includes edge cases (unusual vibration patterns, temperature extremes) to prevent accuracy collapse in production.
319.4 Pruning: Removing Unnecessary Connections
Concept: Remove redundant neural network connections (set weights to zero), exploiting sparsity to reduce model size and computation.
319.4.1 Pruning Strategies
- Magnitude-Based Pruning: Remove weights with smallest absolute values
- Structured Pruning: Remove entire neurons, channels, or layers
- Unstructured Pruning: Remove individual weights (requires sparse matrix support)
319.4.2 Pruning Process
1. Train full model to convergence (baseline accuracy)
2. Identify low-magnitude weights (e.g., |weight| < 0.001)
3. Set these weights to zero (create sparsity)
4. Fine-tune remaining weights (recover accuracy)
5. Repeat steps 2-4 iteratively (gradual pruning)
Result: 70-90% of weights can be pruned with <1% accuracy loss
319.4.3 Real-World Example
ResNet-50 for ImageNet:
- Dense model: 25.5M parameters, 76.1% accuracy
- 70% pruned: 7.6M parameters, 75.8% accuracy (-0.3%)
- 90% pruned: 2.5M parameters, 74.2% accuracy (-1.9%)
On microcontroller:
- Dense: Doesn't fit in 512 KB Flash
- 90% pruned: 2.5M x 1 byte (int8) = 2.5 MB -> Still too large!
- 90% pruned + int8 quantization + compression: 600 KB -> Fits!
319.5 Knowledge Distillation: Teacher-Student Training
Concept: Train a small “student” model to mimic a large “teacher” model’s behavior, transferring knowledge without transferring size.
319.5.1 Process
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'fontSize': '14px'}}}%%
flowchart TD
subgraph Teacher["Teacher Model (Large)"]
T1["ResNet-50"]
T2["25M params"]
T3["76% accuracy"]
end
subgraph Student["Student Model (Small)"]
S1["MobileNetV2"]
S2["3.5M params"]
S3["72% accuracy"]
end
subgraph Training["Distillation Training"]
Soft["Soft Labels<br/>(Teacher Predictions)"]
Hard["Hard Labels<br/>(Ground Truth)"]
Loss["Combined Loss"]
end
Teacher -->|"Produces"| Soft
Soft --> Loss
Hard --> Loss
Loss -->|"Trains"| Student
style Teacher fill:#7F8C8D,stroke:#2C3E50,color:#fff
style Student fill:#16A085,stroke:#2C3E50,color:#fff
style Training fill:#E67E22,stroke:#2C3E50,color:#fff
319.5.2 Why It Works
Teacher model’s soft predictions (e.g., “80% cat, 15% dog, 5% other”) contain more information than hard labels (“cat”), teaching the student about subtle feature relationships.
319.5.3 Combined Optimization Results
Object Detection Model for Edge Camera:
- Baseline: YOLOv4 (244 MB, 500ms, 43% mAP) -> Too large for edge
- Distillation: YOLOv4-Tiny trained with YOLOv4 teacher (23 MB, 180ms, 40% mAP)
- + Quantization: Int8 (6 MB, 45ms, 39% mAP)
- + Pruning 70%: (2 MB, 30ms, 38% mAP)
FINAL: 122x smaller, 16x faster, only -5% mAP -> Deployable on ESP32-S3!
Scenario: An agricultural IoT startup needs to deploy a pest detection model on ESP32-CAM modules ($10 each) distributed across orchards. The cloud-based ResNet-50 model (100 MB, 97% accuracy) is too large for the 4 MB Flash ESP32. They must distill knowledge into a tiny MobileNetV3-Small student model.
Given: - Teacher model: ResNet-50, 100 MB float32, 97.2% pest detection accuracy - Target hardware: ESP32-CAM with 4 MB Flash, 520 KB PSRAM - Flash budget for model: 800 KB (after firmware and buffers) - Student architecture: MobileNetV3-Small backbone (1.9M parameters) - Training dataset: 50,000 labeled pest images (10 species) - Minimum acceptable accuracy: 92%
Steps:
- Baseline student model training (without distillation):
- Train MobileNetV3-Small from scratch on pest dataset
- Result: 89.3% accuracy (7.9% below teacher)
- Model size: 7.6 MB float32 (too large for ESP32)
- Apply knowledge distillation:
- Temperature T = 4 for softened probability distributions
- Loss function: 0.3 x Hard Label Loss + 0.7 x Soft Label Loss (KL divergence)
- Teacher provides soft labels showing confidence across all 10 pest species
- Train student for 50 epochs with distillation loss
- Measure distilled student accuracy:
- Distilled MobileNetV3-Small: 93.8% accuracy (only 3.4% below teacher)
- Improvement from distillation: 93.8% - 89.3% = +4.5%
- The soft labels taught the student subtle differences (e.g., aphid vs. mite shapes)
- Apply INT8 quantization to distilled student:
- Float32 size: 7.6 MB
- INT8 quantized size: 7.6 MB / 4 = 1.9 MB
- Post-quantization accuracy: 93.1% (only -0.7% from quantization)
- Still too large for 800 KB budget
- Apply structured pruning (50% channel reduction):
- Prune 50% of channels in each layer based on L1 norm
- Fine-tune pruned model for 10 epochs
- Pruned + INT8 size: 1.9 MB x 0.5 = 950 KB (close to budget)
- Accuracy after pruning + fine-tuning: 92.4%
- Final optimization with weight clustering:
- Cluster weights into 16 clusters (4-bit indices)
- Combine with Huffman coding for additional compression
- Final model size: 720 KB (fits 800 KB budget)
- Final accuracy: 92.1% (meets 92% minimum)
Result: - Final model size: 720 KB (139x smaller than 100 MB teacher) - Inference time on ESP32-CAM: 180 ms per image - Accuracy: 92.1% (only 5.1% below 97.2% teacher) - Power consumption: 0.8W during inference (battery-friendly)
Key Insight: Knowledge distillation provides “free” accuracy gains by transferring the teacher’s learned feature relationships to the student. In this example, distillation alone improved accuracy by 4.5% (89.3% to 93.8%), which created enough headroom to absorb the accuracy losses from subsequent quantization (-0.7%) and pruning (-1.4%) while still meeting the 92% target. The key is to apply distillation first, then optimize aggressively knowing you have accuracy margin to trade away.
319.6 Optimization Selection Guide
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'fontSize': '14px'}}}%%
flowchart TD
Start["Deployment<br/>Constraint?"] --> Size{Primary<br/>Issue?}
Size -->|"SIZE<br/>(Model too big)"| Q1["Apply QUANTIZATION<br/>4x size reduction"]
Q1 --> StillBig{Still<br/>too big?}
StillBig -->|"Yes"| P1["Add PRUNING<br/>+10x reduction"]
StillBig -->|"No"| Done1["DEPLOY"]
Size -->|"SPEED<br/>(Too slow)"| Q2["Apply QUANTIZATION<br/>4x speedup"]
Q2 --> StillSlow{Still<br/>too slow?}
StillSlow -->|"Yes"| D1["Use DISTILLATION<br/>Smaller architecture"]
StillSlow -->|"No"| Done2["DEPLOY"]
Size -->|"ACCURACY<br/>(Can't lose any)"| QAT["Use QAT<br/>Minimal accuracy loss"]
QAT --> NeedMore{Need more<br/>optimization?}
NeedMore -->|"Yes"| SP["Add STRUCTURED<br/>PRUNING"]
NeedMore -->|"No"| Done3["DEPLOY"]
P1 --> Done1
D1 --> Done2
SP --> Done3
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style Q1 fill:#16A085,stroke:#2C3E50,color:#fff
style Q2 fill:#16A085,stroke:#2C3E50,color:#fff
style QAT fill:#16A085,stroke:#2C3E50,color:#fff
style P1 fill:#E67E22,stroke:#2C3E50,color:#fff
style D1 fill:#E67E22,stroke:#2C3E50,color:#fff
style SP fill:#E67E22,stroke:#2C3E50,color:#fff
style Done1 fill:#27AE60,stroke:#2C3E50,color:#fff
style Done2 fill:#27AE60,stroke:#2C3E50,color:#fff
style Done3 fill:#27AE60,stroke:#2C3E50,color:#fff
Optimization Priority: Quantization first (easiest, 4x benefit), then pruning or distillation based on whether size or architecture is the bottleneck. QAT when accuracy is critical.
319.7 Knowledge Check
319.8 Summary
Optimization Techniques: - Quantization: 4x size reduction (float32 -> int8), 2-4x speedup, <1% accuracy loss with QAT - Pruning: 70-90% of weights removable with <1% accuracy impact - Knowledge Distillation: Small student model achieves 90-95% of large teacher’s accuracy
Optimization Priority: 1. Start with PTQ (simplest, reliable 4x improvement) 2. Use QAT if accuracy loss exceeds tolerance 3. Add pruning for additional compression 4. Use distillation when architecture is fundamentally too large
Combined Results: - 10-100x total compression achievable - Enables deployment of powerful AI on $5 microcontrollers - Real-time inference on milliwatt power budgets
319.9 What’s Next
Now that you understand model optimization techniques, continue to:
- Hardware Accelerators for Edge AI - Choose the right NPU, TPU, GPU, or FPGA for your optimized model
- Edge AI Applications - See optimization techniques applied to real-world use cases
- Edge AI Deployment Pipeline - Learn end-to-end workflows from training to production