319  Model Optimization for Edge AI

319.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Apply Quantization: Convert float32 models to int8 for 4x size reduction with <1% accuracy loss
  • Implement Pruning: Remove 70-90% of neural network weights while preserving accuracy
  • Use Knowledge Distillation: Train compact student models to match large teacher performance
  • Design Optimization Pipelines: Combine techniques for 10-100x total compression

319.2 Introduction

Running full-scale neural networks on edge devices requires aggressive optimization. Three primary techniques enable deploying powerful AI on resource-constrained hardware: quantization, pruning, and knowledge distillation.

319.3 Quantization: Reducing Precision

Concept: Reduce numerical precision from 32-bit floats to 8-bit integers, achieving 4x size reduction and 2-4x speedup with minimal accuracy loss.

319.3.1 Quantization Math

Float32 representation:
  value = sign x 2^exponent x mantissa
  Range: +/-3.4 x 10^38
  Precision: ~7 decimal digits
  Memory: 4 bytes

Int8 representation:
  value = scale x (quantized_value - zero_point)
  Range: -128 to 127
  Precision: 256 discrete levels
  Memory: 1 byte

Example: Converting weights in range [-0.5, 0.5]
  scale = (max - min) / (127 - (-128)) = 1.0 / 255 = 0.00392
  zero_point = 0

  Float weight: 0.234 -> Int8: 0.234 / 0.00392 = 60
  Float weight: -0.156 -> Int8: -0.156 / 0.00392 = -40

319.3.2 Quantization Types

Type When Applied Accuracy Impact Use Case
Post-Training Quantization (PTQ) After training on float32 model 0.5-2% loss Easiest, no retraining needed
Quantization-Aware Training (QAT) During training, simulates quantization <0.5% loss Best accuracy, requires retraining
Dynamic Range Quantization Weights int8, activations float32 1-3% loss Balanced approach

319.3.3 Real-World Results

MobileNetV2 on ImageNet:
- Float32: 72.0% accuracy, 14 MB, 300ms inference (Cortex-M7)
- Int8 PTQ: 70.8% accuracy, 3.5 MB, 75ms inference (4x faster, -1.2% accuracy)
- Int8 QAT: 71.6% accuracy, 3.5 MB, 75ms inference (4x faster, -0.4% accuracy)

YOLOv4-Tiny object detection:
- Float32: 40.2% mAP, 23 MB, 180ms
- Int8 PTQ: 39.1% mAP, 6 MB, 45ms (4x faster, -1.1% mAP)

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'fontSize': '14px'}}}%%
flowchart LR
    subgraph Float32["Float32 Model"]
        F_Size["14 MB"]
        F_Speed["300ms"]
        F_Acc["72.0%"]
    end

    subgraph Int8["Int8 Quantized"]
        I_Size["3.5 MB"]
        I_Speed["75ms"]
        I_Acc["71.6%"]
    end

    Float32 -->|"4x smaller"| Int8
    Float32 -->|"4x faster"| Int8
    Float32 -->|"-0.4% acc"| Int8

    style Float32 fill:#E67E22,stroke:#2C3E50,color:#fff
    style Int8 fill:#16A085,stroke:#2C3E50,color:#fff

Figure 319.1: Quantization tradeoffs: Float32 to Int8 conversion achieves 4x improvements with minimal accuracy loss
NoteWorked Example: INT8 Quantization for Vibration Anomaly Detection

Scenario: A factory deploys vibration sensors on 500 motors for predictive maintenance. Each sensor runs a CNN-based anomaly detector on an STM32F4 microcontroller (1 MB Flash, 192 KB RAM). The original float32 model must be quantized to fit the hardware constraints.

Given: - Original model: 3-layer CNN, 890 KB float32 weights, 94.2% anomaly detection accuracy - Target hardware: STM32F4 with 1 MB Flash, 192 KB RAM - Flash budget for model: 300 KB (rest needed for firmware, buffers) - RAM budget for inference: 80 KB tensor arena - Inference time requirement: < 50 ms per vibration window - Acceptable accuracy loss: < 2%

Steps:

  1. Analyze model size reduction needed:
    • Current size: 890 KB (float32)
    • Target size: 300 KB
    • Required compression: 890 KB / 300 KB = 2.97x minimum
    • INT8 quantization provides: 4x compression (32-bit to 8-bit)
    • Expected size after INT8: 890 KB / 4 = 222.5 KB (fits in budget)
  2. Calculate RAM requirements for tensor arena:
    • Largest layer activation: 128 channels x 32 samples = 4,096 values
    • INT8 activations: 4,096 x 1 byte = 4 KB per layer
    • Peak memory (2 layers simultaneously): ~16 KB activations + 4 KB buffers = 20 KB
    • With runtime overhead: ~35 KB total (well within 80 KB budget)
  3. Apply post-training quantization (PTQ):
    • Collect 500 representative vibration samples for calibration
    • Run PTQ with per-channel quantization for convolution layers
    • Measure accuracy on validation set: 92.8% (1.4% drop from 94.2%)
  4. Evaluate if QAT is needed:
    • Accuracy drop (1.4%) is within 2% tolerance
    • PTQ is sufficient; QAT would add training complexity for minimal gain
    • Decision: Use PTQ for faster deployment
  5. Measure inference performance:
    • Float32 on STM32F4: 180 ms per window (too slow)
    • INT8 on STM32F4 with CMSIS-NN: 38 ms per window (meets < 50 ms requirement)
    • Speedup: 180 / 38 = 4.7x

Result: - Final model size: 222.5 KB (fits 300 KB Flash budget) - Inference time: 38 ms (meets 50 ms requirement) - Accuracy: 92.8% (within 2% of original 94.2%) - RAM usage: 35 KB (within 80 KB budget)

Key Insight: INT8 quantization delivers a reliable 4x size reduction and 4-5x speedup on ARM Cortex-M processors. For most industrial anomaly detection tasks, post-training quantization (PTQ) with representative calibration data achieves production-quality accuracy without the complexity of quantization-aware training. Always verify that the calibration dataset includes edge cases (unusual vibration patterns, temperature extremes) to prevent accuracy collapse in production.

319.4 Pruning: Removing Unnecessary Connections

Concept: Remove redundant neural network connections (set weights to zero), exploiting sparsity to reduce model size and computation.

319.4.1 Pruning Strategies

  1. Magnitude-Based Pruning: Remove weights with smallest absolute values
  2. Structured Pruning: Remove entire neurons, channels, or layers
  3. Unstructured Pruning: Remove individual weights (requires sparse matrix support)

319.4.2 Pruning Process

1. Train full model to convergence (baseline accuracy)
2. Identify low-magnitude weights (e.g., |weight| < 0.001)
3. Set these weights to zero (create sparsity)
4. Fine-tune remaining weights (recover accuracy)
5. Repeat steps 2-4 iteratively (gradual pruning)

Result: 70-90% of weights can be pruned with <1% accuracy loss

319.4.3 Real-World Example

ResNet-50 for ImageNet:
- Dense model: 25.5M parameters, 76.1% accuracy
- 70% pruned: 7.6M parameters, 75.8% accuracy (-0.3%)
- 90% pruned: 2.5M parameters, 74.2% accuracy (-1.9%)

On microcontroller:
- Dense: Doesn't fit in 512 KB Flash
- 90% pruned: 2.5M x 1 byte (int8) = 2.5 MB -> Still too large!
- 90% pruned + int8 quantization + compression: 600 KB -> Fits!

319.5 Knowledge Distillation: Teacher-Student Training

Concept: Train a small “student” model to mimic a large “teacher” model’s behavior, transferring knowledge without transferring size.

319.5.1 Process

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'fontSize': '14px'}}}%%
flowchart TD
    subgraph Teacher["Teacher Model (Large)"]
        T1["ResNet-50"]
        T2["25M params"]
        T3["76% accuracy"]
    end

    subgraph Student["Student Model (Small)"]
        S1["MobileNetV2"]
        S2["3.5M params"]
        S3["72% accuracy"]
    end

    subgraph Training["Distillation Training"]
        Soft["Soft Labels<br/>(Teacher Predictions)"]
        Hard["Hard Labels<br/>(Ground Truth)"]
        Loss["Combined Loss"]
    end

    Teacher -->|"Produces"| Soft
    Soft --> Loss
    Hard --> Loss
    Loss -->|"Trains"| Student

    style Teacher fill:#7F8C8D,stroke:#2C3E50,color:#fff
    style Student fill:#16A085,stroke:#2C3E50,color:#fff
    style Training fill:#E67E22,stroke:#2C3E50,color:#fff

Figure 319.2: Knowledge distillation: Large teacher trains small student using soft probability labels

319.5.2 Why It Works

Teacher model’s soft predictions (e.g., “80% cat, 15% dog, 5% other”) contain more information than hard labels (“cat”), teaching the student about subtle feature relationships.

319.5.3 Combined Optimization Results

Object Detection Model for Edge Camera:
- Baseline: YOLOv4 (244 MB, 500ms, 43% mAP) -> Too large for edge
- Distillation: YOLOv4-Tiny trained with YOLOv4 teacher (23 MB, 180ms, 40% mAP)
- + Quantization: Int8 (6 MB, 45ms, 39% mAP)
- + Pruning 70%: (2 MB, 30ms, 38% mAP)
FINAL: 122x smaller, 16x faster, only -5% mAP -> Deployable on ESP32-S3!
NoteWorked Example: Knowledge Distillation for Smart Agriculture Pest Detection

Scenario: An agricultural IoT startup needs to deploy a pest detection model on ESP32-CAM modules ($10 each) distributed across orchards. The cloud-based ResNet-50 model (100 MB, 97% accuracy) is too large for the 4 MB Flash ESP32. They must distill knowledge into a tiny MobileNetV3-Small student model.

Given: - Teacher model: ResNet-50, 100 MB float32, 97.2% pest detection accuracy - Target hardware: ESP32-CAM with 4 MB Flash, 520 KB PSRAM - Flash budget for model: 800 KB (after firmware and buffers) - Student architecture: MobileNetV3-Small backbone (1.9M parameters) - Training dataset: 50,000 labeled pest images (10 species) - Minimum acceptable accuracy: 92%

Steps:

  1. Baseline student model training (without distillation):
    • Train MobileNetV3-Small from scratch on pest dataset
    • Result: 89.3% accuracy (7.9% below teacher)
    • Model size: 7.6 MB float32 (too large for ESP32)
  2. Apply knowledge distillation:
    • Temperature T = 4 for softened probability distributions
    • Loss function: 0.3 x Hard Label Loss + 0.7 x Soft Label Loss (KL divergence)
    • Teacher provides soft labels showing confidence across all 10 pest species
    • Train student for 50 epochs with distillation loss
  3. Measure distilled student accuracy:
    • Distilled MobileNetV3-Small: 93.8% accuracy (only 3.4% below teacher)
    • Improvement from distillation: 93.8% - 89.3% = +4.5%
    • The soft labels taught the student subtle differences (e.g., aphid vs. mite shapes)
  4. Apply INT8 quantization to distilled student:
    • Float32 size: 7.6 MB
    • INT8 quantized size: 7.6 MB / 4 = 1.9 MB
    • Post-quantization accuracy: 93.1% (only -0.7% from quantization)
    • Still too large for 800 KB budget
  5. Apply structured pruning (50% channel reduction):
    • Prune 50% of channels in each layer based on L1 norm
    • Fine-tune pruned model for 10 epochs
    • Pruned + INT8 size: 1.9 MB x 0.5 = 950 KB (close to budget)
    • Accuracy after pruning + fine-tuning: 92.4%
  6. Final optimization with weight clustering:
    • Cluster weights into 16 clusters (4-bit indices)
    • Combine with Huffman coding for additional compression
    • Final model size: 720 KB (fits 800 KB budget)
    • Final accuracy: 92.1% (meets 92% minimum)

Result: - Final model size: 720 KB (139x smaller than 100 MB teacher) - Inference time on ESP32-CAM: 180 ms per image - Accuracy: 92.1% (only 5.1% below 97.2% teacher) - Power consumption: 0.8W during inference (battery-friendly)

Key Insight: Knowledge distillation provides “free” accuracy gains by transferring the teacher’s learned feature relationships to the student. In this example, distillation alone improved accuracy by 4.5% (89.3% to 93.8%), which created enough headroom to absorb the accuracy losses from subsequent quantization (-0.7%) and pruning (-1.4%) while still meeting the 92% target. The key is to apply distillation first, then optimize aggressively knowing you have accuracy margin to trade away.

319.6 Optimization Selection Guide

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'fontSize': '14px'}}}%%
flowchart TD
    Start["Deployment<br/>Constraint?"] --> Size{Primary<br/>Issue?}

    Size -->|"SIZE<br/>(Model too big)"| Q1["Apply QUANTIZATION<br/>4x size reduction"]
    Q1 --> StillBig{Still<br/>too big?}
    StillBig -->|"Yes"| P1["Add PRUNING<br/>+10x reduction"]
    StillBig -->|"No"| Done1["DEPLOY"]

    Size -->|"SPEED<br/>(Too slow)"| Q2["Apply QUANTIZATION<br/>4x speedup"]
    Q2 --> StillSlow{Still<br/>too slow?}
    StillSlow -->|"Yes"| D1["Use DISTILLATION<br/>Smaller architecture"]
    StillSlow -->|"No"| Done2["DEPLOY"]

    Size -->|"ACCURACY<br/>(Can't lose any)"| QAT["Use QAT<br/>Minimal accuracy loss"]
    QAT --> NeedMore{Need more<br/>optimization?}
    NeedMore -->|"Yes"| SP["Add STRUCTURED<br/>PRUNING"]
    NeedMore -->|"No"| Done3["DEPLOY"]

    P1 --> Done1
    D1 --> Done2
    SP --> Done3

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style Q1 fill:#16A085,stroke:#2C3E50,color:#fff
    style Q2 fill:#16A085,stroke:#2C3E50,color:#fff
    style QAT fill:#16A085,stroke:#2C3E50,color:#fff
    style P1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style D1 fill:#E67E22,stroke:#2C3E50,color:#fff
    style SP fill:#E67E22,stroke:#2C3E50,color:#fff
    style Done1 fill:#27AE60,stroke:#2C3E50,color:#fff
    style Done2 fill:#27AE60,stroke:#2C3E50,color:#fff
    style Done3 fill:#27AE60,stroke:#2C3E50,color:#fff

Figure 319.3: Model optimization technique selection based on deployment constraints

Optimization Priority: Quantization first (easiest, 4x benefit), then pruning or distillation based on whether size or architecture is the bottleneck. QAT when accuracy is critical.

319.7 Knowledge Check

319.8 Summary

Optimization Techniques: - Quantization: 4x size reduction (float32 -> int8), 2-4x speedup, <1% accuracy loss with QAT - Pruning: 70-90% of weights removable with <1% accuracy impact - Knowledge Distillation: Small student model achieves 90-95% of large teacher’s accuracy

Optimization Priority: 1. Start with PTQ (simplest, reliable 4x improvement) 2. Use QAT if accuracy loss exceeds tolerance 3. Add pruning for additional compression 4. Use distillation when architecture is fundamentally too large

Combined Results: - 10-100x total compression achievable - Enables deployment of powerful AI on $5 microcontrollers - Real-time inference on milliwatt power budgets

319.9 What’s Next

Now that you understand model optimization techniques, continue to: