23  Model Optimization for Edge AI

In 60 Seconds

Quantization converts float32 neural network weights to int8 for 4x size reduction and 2-4x inference speedup with under 2% accuracy loss – always try this first. Neural networks contain 70-90% redundant weights removable via pruning, and knowledge distillation trains a student model at 10-50x smaller size to achieve 90-95% of the teacher’s accuracy. Combined, these techniques deliver 10-100x total compression for edge deployment.

Key Concepts
  • Post-Training Quantization (PTQ): Technique converting trained float32 model weights and activations to int8 without retraining, achieving 4x model size reduction with <2% accuracy loss
  • Quantization-Aware Training (QAT): Training procedure inserting fake quantization operations to teach the model to tolerate quantization noise, recovering accuracy lost by aggressive PTQ
  • Model Pruning: Removing neural network weights below a magnitude threshold (unstructured) or eliminating entire filters (structured), reducing model size by 50-90% with fine-tuning
  • Knowledge Distillation: Training a small student model to mimic the output distribution of a large teacher model, achieving better accuracy than training the small model from scratch
  • TensorFlow Lite (TFLite): Google’s framework for converting and running optimized neural network models on edge devices and microcontrollers
  • ONNX Runtime: Cross-platform inference engine executing ONNX-format models on diverse edge hardware with hardware-specific optimization backends
  • Operator Fusion: Compiler optimization combining consecutive neural network operations (Conv + BN + ReLU) into a single kernel, reducing memory bandwidth and kernel launch overhead
  • Calibration Dataset: Representative sample of real-world input data used during PTQ to determine optimal int8 scale factors for each layer’s activation distribution

23.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Apply Quantization: Convert float32 models to int8 for 4x size reduction with <1% accuracy loss
  • Implement Pruning: Remove 70-90% of neural network weights while preserving accuracy
  • Apply Knowledge Distillation: Train compact student models to match large teacher performance
  • Design Optimization Pipelines: Combine techniques for 10-100x total compression
  • Select the Right Technique: Choose between PTQ, QAT, pruning, and distillation based on constraints
Minimum Viable Understanding

Before diving into the details, here are the three things you absolutely must know about edge AI model optimization:

  • Quantization delivers 4x compression reliably: Converting neural network weights from 32-bit floating point to 8-bit integers cuts model size by 4x and speeds inference 2-4x on ARM Cortex-M processors, with typical accuracy loss under 2%. Post-training quantization (PTQ) requires no retraining and is always the first technique to try.
  • Neural networks are massively over-parameterized: Trained models contain 70-90% redundant weights that can be pruned (set to zero) with less than 1% accuracy loss after fine-tuning. This means most of the “brain” is doing nothing useful and can be safely removed.
  • Knowledge distillation transfers intelligence, not architecture: A large “teacher” model can train a small “student” model to achieve 90-95% of the teacher’s accuracy at 10-50x smaller size. The student learns from the teacher’s soft probability outputs, which encode rich inter-class relationships that hard labels cannot capture.

23.2 Introduction

Running full-scale neural networks on edge devices requires aggressive optimization. A cloud-trained model might consume 200 MB of memory and require a powerful GPU, but the target edge device – an ESP32 or STM32 microcontroller – has only 1-4 MB of Flash and a few hundred kilobytes of RAM. Bridging this gap requires model optimization: techniques that compress and accelerate neural networks while preserving their accuracy.

Three primary techniques enable deploying powerful AI on resource-constrained hardware: quantization (reducing numerical precision), pruning (removing unnecessary connections), and knowledge distillation (training smaller models to mimic larger ones). These techniques can be applied individually or combined for maximum compression.

Overview diagram showing three model optimization techniques for edge AI: quantization reduces numerical precision from float32 to int8, pruning removes redundant neural network connections, and knowledge distillation transfers knowledge from a large teacher to a small student model. All three feed into a combined optimization pipeline achieving 10-100x compression.

Hey Sensor Squad! Imagine you have a super-smart robot brain that can recognize every type of bug in a garden. The problem? That brain is as big as a refrigerator and needs a huge battery to run!

Now imagine you want to put that bug-detecting brain into a tiny camera the size of a coin, running on a watch battery. You need to make the brain much, much smaller – but still just as smart.

That is exactly what model optimization does for IoT devices:

  • Quantization is like writing your math homework in shorthand instead of full sentences. It takes less space but means the same thing!
  • Pruning is like trimming a bush – you cut off the branches that are not doing anything useful, and the bush still looks great.
  • Knowledge Distillation is like a wise teacher (the big brain) tutoring a young student (the tiny brain). The student learns all the important lessons without needing to be as big as the teacher.

Real-world example: A pest detection camera on a farm needs to tell the difference between a helpful ladybug and a harmful aphid. The big cloud model knows this perfectly, but it is too large for the tiny camera. By optimizing it, we shrink the model from the size of a dictionary to the size of a sticky note – and it still knows its bugs!

When engineers build AI systems in the cloud, they use powerful computers with lots of memory and fast processors. But IoT devices – like a smart camera on a farm or a vibration sensor in a factory – have tiny processors and very little memory. A cloud AI model might be 200 megabytes, while the device only has 1 megabyte of space.

Model optimization means making that AI model much smaller and faster so it can run on tiny devices, without losing too much of its “smartness.”

There are three main approaches:

  • Quantization means using less precise numbers. Instead of storing a weight as 0.23456789 (which takes 4 bytes), you store it as roughly 60 out of 255 (which takes only 1 byte). You lose a tiny bit of precision, but the model still works well and is 4 times smaller.
  • Pruning means removing parts of the neural network that are not contributing much. Research shows that 70-90% of the connections in a trained model are redundant. Removing them is like clearing out unused apps from your phone – everything still works, but with more free space.
  • Knowledge distillation means training a small, efficient model to copy the behavior of a large, accurate model. The large model acts as a “teacher” and the small one acts as a “student.” The student never becomes quite as smart as the teacher, but it gets remarkably close while being 10-50 times smaller.

These techniques can be combined. In practice, engineers often use all three together to shrink a model by 100 times or more – enough to run on a chip that costs less than a dollar.

23.3 Quantization: Reducing Precision

Concept: Reduce numerical precision from 32-bit floats to 8-bit integers, achieving 4x size reduction and 2-4x speedup with minimal accuracy loss.

23.3.1 How Quantization Works

Quantization maps floating-point values to a smaller set of discrete integer values. The key formula is:

\[\text{quantized\_value} = \text{round}\left(\frac{\text{float\_value}}{\text{scale}}\right) + \text{zero\_point}\]

where:

\[\text{scale} = \frac{\text{max} - \text{min}}{2^{bits} - 1}\]

Quantization error accumulates through network layers but remains bounded. \[\text{Quantization error} = \frac{\text{scale}}{2} = \frac{\text{range}}{2 \times (2^{bits} - 1)}\] Worked example: Weight range [-0.5, +0.5] quantized to int8: scale = 1.0 ÷ 255 = 0.00392. Max error per weight = 0.00392 ÷ 2 = 0.00196 (0.4% relative error). For a 10-layer network with 1M weights, cumulative error is NOT 10×0.4% = 4%, but rather √10 × 0.4% = 1.26% due to error averaging (central limit theorem). This explains why deep networks tolerate quantization well despite many layers.

Property Float32 Int8
Memory per weight 4 bytes 1 byte
Range +/-3.4 x 1038 -128 to 127
Precision ~7 decimal digits 256 discrete levels
Representation sign x 2exponent x mantissa scale x (value - zero_point)
Quantization Worked Example

Converting weights in range [-0.5, 0.5] to Int8:

  1. Calculate scale: scale = (0.5 - (-0.5)) / (127 - (-128)) = 1.0 / 255 = 0.00392
  2. Zero point: 0 (symmetric range)
  3. Convert float weight 0.234: 0.234 / 0.00392 = 60 (Int8)
  4. Convert float weight -0.156: -0.156 / 0.00392 = -40 (Int8)
  5. Reconstruction: 60 x 0.00392 = 0.235 (error: 0.001, or 0.4%)

The quantization error is tiny because 256 discrete levels are sufficient to represent neural network weight distributions.

23.3.2 Quantization Types

Decision tree diagram showing three quantization types: Post-Training Quantization for quick deployment with 0.5-2% accuracy loss, Quantization-Aware Training for best accuracy with less than 0.5% loss but requiring retraining, and Dynamic Range Quantization as a balanced approach with 1-3% loss where weights are int8 but activations remain float32.

Type When Applied Accuracy Impact Use Case
Post-Training Quantization (PTQ) After training on float32 model 0.5-2% loss Easiest, no retraining needed
Quantization-Aware Training (QAT) During training, simulates quantization <0.5% loss Best accuracy, requires retraining
Dynamic Range Quantization Weights int8, activations float32 1-3% loss Balanced approach

23.3.3 Real-World Results

Model Configuration Accuracy Size Inference Time Notes
MobileNetV2 (ImageNet) Float32 72.0% 14 MB 300ms (Cortex-M7) Baseline
MobileNetV2 (ImageNet) Int8 PTQ 70.8% 3.5 MB 75ms 4x faster, -1.2%
MobileNetV2 (ImageNet) Int8 QAT 71.6% 3.5 MB 75ms 4x faster, -0.4%
YOLOv4-Tiny (Detection) Float32 40.2% mAP 23 MB 180ms Baseline
YOLOv4-Tiny (Detection) Int8 PTQ 39.1% mAP 6 MB 45ms 4x faster, -1.1%

Post-training quantization requires a representative calibration dataset (typically 100-500 samples) to determine the optimal scale and zero-point for each layer. If your calibration data does not cover edge cases (e.g., unusual sensor readings, extreme temperatures), the quantized model may suffer severe accuracy degradation on those inputs – even if average accuracy looks acceptable. Always include boundary conditions in your calibration set.

Quantization trade-offs chart comparing float32, float16, and int8 precision with model size and accuracy impact
Figure 23.1: Quantization tradeoffs: Float32 to Int8 conversion achieves 4x improvements with minimal accuracy loss
Worked Example: INT8 Quantization for Vibration Anomaly Detection

Scenario: A factory deploys vibration sensors on 500 motors for predictive maintenance. Each sensor runs a CNN-based anomaly detector on an STM32F4 microcontroller (1 MB Flash, 192 KB RAM). The original float32 model must be quantized to fit the hardware constraints.

Given:

  • Original model: 3-layer CNN, 890 KB float32 weights, 94.2% anomaly detection accuracy
  • Target hardware: STM32F4 with 1 MB Flash, 192 KB RAM
  • Flash budget for model: 300 KB (rest needed for firmware, buffers)
  • RAM budget for inference: 80 KB tensor arena
  • Inference time requirement: < 50 ms per vibration window
  • Acceptable accuracy loss: < 2%

Steps:

  1. Analyze model size reduction needed:
    • Current size: 890 KB (float32)
    • Target size: 300 KB
    • Required compression: 890 KB / 300 KB = 2.97x minimum
    • INT8 quantization provides: 4x compression (32-bit to 8-bit)
    • Expected size after INT8: 890 KB / 4 = 222.5 KB (fits in budget)
  2. Calculate RAM requirements for tensor arena:
    • Largest layer activation: 128 channels x 32 samples = 4,096 values
    • INT8 activations: 4,096 x 1 byte = 4 KB per layer
    • Peak memory (2 layers simultaneously): ~16 KB activations + 4 KB buffers = 20 KB
    • With runtime overhead: ~35 KB total (well within 80 KB budget)
  3. Apply post-training quantization (PTQ):
    • Collect 500 representative vibration samples for calibration
    • Run PTQ with per-channel quantization for convolution layers
    • Measure accuracy on validation set: 92.8% (1.4% drop from 94.2%)
  4. Evaluate if QAT is needed:
    • Accuracy drop (1.4%) is within 2% tolerance
    • PTQ is sufficient; QAT would add training complexity for minimal gain
    • Decision: Use PTQ for faster deployment
  5. Measure inference performance:
    • Float32 on STM32F4: 180 ms per window (too slow)
    • INT8 on STM32F4 with CMSIS-NN: 38 ms per window (meets < 50 ms requirement)
    • Speedup: 180 / 38 = 4.7x

Result:

  • Final model size: 222.5 KB (fits 300 KB Flash budget)
  • Inference time: 38 ms (meets 50 ms requirement)
  • Accuracy: 92.8% (within 2% of original 94.2%)
  • RAM usage: 35 KB (within 80 KB budget)

Key Insight: INT8 quantization delivers a reliable 4x size reduction and 4-5x speedup on ARM Cortex-M processors. For most industrial anomaly detection tasks, post-training quantization (PTQ) with representative calibration data achieves production-quality accuracy without the complexity of quantization-aware training. Always verify that the calibration dataset includes edge cases (unusual vibration patterns, temperature extremes) to prevent accuracy collapse in production.

23.4 Pruning: Removing Unnecessary Connections

Concept: Remove redundant neural network connections (set weights to zero), exploiting the inherent sparsity in trained models to reduce size and computation.

23.4.1 Pruning Strategies

Strategy What It Removes Hardware Support Compression Best For
Magnitude-Based Smallest absolute value weights Sparse matrix libraries 2-10x General use
Structured Pruning Entire neurons, channels, or layers Standard hardware (no sparse support needed) 2-5x MCU deployment
Unstructured Pruning Individual weights anywhere Requires sparse matrix acceleration 5-20x GPUs with sparse support

Structured vs. Unstructured: Structured pruning is preferred for microcontrollers because it produces dense (non-sparse) sub-networks that run efficiently on standard hardware. Unstructured pruning creates sparse matrices that require specialized hardware or libraries (like XNNPACK) to achieve actual speedup.

23.4.2 Pruning Process

Iterative pruning process flowchart showing five steps in a cycle: train full model to convergence, identify low-magnitude weights below a threshold, set those weights to zero creating sparsity, fine-tune remaining weights to recover accuracy, then repeat until reaching 70-90% sparsity with less than 1% accuracy loss.

Key principle: Gradual pruning (removing 10-20% of weights per iteration, then fine-tuning) consistently outperforms one-shot pruning. The fine-tuning step allows remaining weights to compensate for removed connections.

23.4.3 Real-World Example: ResNet-50 on ImageNet

Configuration Parameters Accuracy Size (Int8) Fits 512 KB MCU?
Dense (baseline) 25.5M 76.1% 25.5 MB No
70% pruned 7.6M 75.8% (-0.3%) 7.6 MB No
90% pruned 2.5M 74.2% (-1.9%) 2.5 MB No
90% pruned + Int8 + compression 2.5M sparse 73.8% (-2.3%) 600 KB Yes

This example illustrates why combining techniques is essential for extreme compression targets. Pruning alone reduced parameters by 10x, but the model still did not fit. Adding quantization (4x) and compression (further 1.5x) achieved the 42x total compression needed.

23.5 Knowledge Distillation: Teacher-Student Training

Concept: Train a small “student” model to mimic a large “teacher” model’s behavior, transferring learned knowledge through soft probability outputs rather than transferring the architecture or weights directly.

23.5.1 The Distillation Process

Knowledge distillation process diagram showing a large teacher model producing soft probability labels with temperature scaling, which are used alongside hard ground truth labels to train a small student model. The loss function combines soft label loss weighted at 0.7 with hard label loss weighted at 0.3, producing a compact student that achieves 90-95% of the teacher's accuracy.
Figure 23.2: Knowledge distillation: Large teacher trains small student using soft probability labels

23.5.2 Why Soft Labels Work

The teacher model’s soft predictions contain dark knowledge – information about class relationships that hard labels lack:

Label Type Example Output Information Content
Hard label “cat” Binary: correct or incorrect
Soft label (T=1) 95% cat, 3% dog, 2% other Some class similarity info
Soft label (T=4) 80% cat, 15% dog, 5% other Rich inter-class relationships

The temperature parameter T controls how much information the soft labels reveal. Higher temperatures (T=3-5) “soften” the probability distribution, revealing the teacher’s learned understanding that cats look somewhat like dogs but not at all like cars. The student learns these subtle relationships that would take far more training data to discover independently.

Temperature Intuition

At T=1 (standard softmax), the teacher confidently says “99% cat.” At T=4, it reveals nuance: “80% cat, 15% dog.” The student learns that when unsure, cat-like features overlap with dog-like features – knowledge that improves generalization on unseen data. After training, the student uses T=1 for inference (sharp predictions).

23.5.3 Combined Optimization Pipeline

Combining all three techniques achieves extreme compression:

Combined optimization pipeline for an edge camera object detection model showing four sequential stages: baseline YOLOv4 at 244MB and 43% accuracy, then knowledge distillation producing YOLOv4-Tiny at 23MB and 40% accuracy, then int8 quantization reducing to 6MB and 39% accuracy, then 70% pruning achieving 2MB and 38% accuracy. The final result is 122x smaller and 16x faster, deployable on ESP32-S3.

Result: 122x smaller, 16x faster, only -5% mAP – deployable on ESP32-S3!

The ordering matters: distillation first (changes architecture), then quantization (reduces precision), then pruning (removes redundancy). Each stage builds on the previous one’s output.

Worked Example: Knowledge Distillation for Smart Agriculture Pest Detection

Scenario: An agricultural IoT startup needs to deploy a pest detection model on ESP32-CAM modules ($10 each) distributed across orchards. The cloud-based ResNet-50 model (100 MB, 97% accuracy) is too large for the 4 MB Flash ESP32. They must distill knowledge into a tiny MobileNetV3-Small student model.

Given:

  • Teacher model: ResNet-50, 100 MB float32, 97.2% pest detection accuracy
  • Target hardware: ESP32-CAM with 4 MB Flash, 520 KB PSRAM
  • Flash budget for model: 800 KB (after firmware and buffers)
  • Student architecture: MobileNetV3-Small backbone (1.9M parameters)
  • Training dataset: 50,000 labeled pest images (10 species)
  • Minimum acceptable accuracy: 92%

Steps:

  1. Baseline student model training (without distillation):
    • Train MobileNetV3-Small from scratch on pest dataset
    • Result: 89.3% accuracy (7.9% below teacher)
    • Model size: 7.6 MB float32 (too large for ESP32)
  2. Apply knowledge distillation:
    • Temperature T = 4 for softened probability distributions
    • Loss function: 0.3 x Hard Label Loss + 0.7 x Soft Label Loss (KL divergence)
    • Teacher provides soft labels showing confidence across all 10 pest species
    • Train student for 50 epochs with distillation loss
  3. Measure distilled student accuracy:
    • Distilled MobileNetV3-Small: 93.8% accuracy (only 3.4% below teacher)
    • Improvement from distillation: 93.8% - 89.3% = +4.5%
    • The soft labels taught the student subtle differences (e.g., aphid vs. mite shapes)
  4. Apply INT8 quantization to distilled student:
    • Float32 size: 7.6 MB
    • INT8 quantized size: 7.6 MB / 4 = 1.9 MB
    • Post-quantization accuracy: 93.1% (only -0.7% from quantization)
    • Still too large for 800 KB budget
  5. Apply structured pruning (50% channel reduction):
    • Prune 50% of channels in each layer based on L1 norm
    • Fine-tune pruned model for 10 epochs
    • Pruned + INT8 size: 1.9 MB x 0.5 = 950 KB (close to budget)
    • Accuracy after pruning + fine-tuning: 92.4%
  6. Final optimization with weight clustering:
    • Cluster weights into 16 clusters (4-bit indices)
    • Combine with Huffman coding for additional compression
    • Final model size: 720 KB (fits 800 KB budget)
    • Final accuracy: 92.1% (meets 92% minimum)

Result:

  • Final model size: 720 KB (139x smaller than 100 MB teacher)
  • Inference time on ESP32-CAM: 180 ms per image
  • Accuracy: 92.1% (only 5.1% below 97.2% teacher)
  • Power consumption: 0.8W during inference (battery-friendly)

Key Insight: Knowledge distillation provides “free” accuracy gains by transferring the teacher’s learned feature relationships to the student. In this example, distillation alone improved accuracy by 4.5% (89.3% to 93.8%), which created enough headroom to absorb the accuracy losses from subsequent quantization (-0.7%) and pruning (-1.4%) while still meeting the 92% target. The key is to apply distillation first, then optimize aggressively knowing you have accuracy margin to trade away.

23.6 Optimization Selection Guide

Choosing the right optimization technique depends on your constraints. Use this decision framework:

Decision flowchart for selecting model optimization techniques. Start by checking if the model fits in target memory with 4x reduction; if yes use quantization alone. If not, check if a smaller architecture exists; if yes use knowledge distillation plus quantization. If no smaller architecture exists, apply pruning plus quantization. If accuracy requirements are strict, add quantization-aware training. All paths lead to deployment validation.
Figure 23.3: Model optimization technique selection based on deployment constraints

23.6.1 Quick Reference: Optimization Techniques Compared

Technique Compression Accuracy Impact Effort When to Use
PTQ 4x 0.5-2% loss Low (hours) Always start here
QAT 4x <0.5% loss Medium (days) When PTQ accuracy is insufficient
Structured Pruning 2-5x 0.5-2% loss Medium (days) Model slightly too large after quantization
Unstructured Pruning 5-20x 1-3% loss Medium (days) GPU/NPU with sparse support
Knowledge Distillation 10-50x 3-7% loss High (weeks) Model fundamentally too large
Combined Pipeline 10-100x 3-8% loss High (weeks) Extreme compression needed

Optimization Priority: Quantization first (easiest, 4x benefit), then pruning or distillation based on whether size or architecture is the bottleneck. QAT when accuracy is critical.

Common Mistake: Over-Optimizing

Applying aggressive pruning (>90%) and extreme quantization (4-bit) simultaneously often causes accuracy collapse. A better approach is to optimize incrementally, validating accuracy at each step. If one technique causes unacceptable loss, back off and try a different combination rather than pushing all techniques to their limits.

23.7 Knowledge Check

23.8 Summary

This chapter covered three fundamental techniques for deploying neural networks on resource-constrained IoT devices.

23.8.1 Key Techniques

Technique Mechanism Typical Compression Accuracy Impact Effort
Quantization (PTQ) Float32 to Int8 precision 4x 0.5-2% loss Low
Quantization (QAT) Train with simulated quantization 4x <0.5% loss Medium
Pruning Remove redundant weights/channels 2-10x 0.5-2% loss Medium
Knowledge Distillation Teacher trains compact student 10-50x 3-7% loss High
Combined Pipeline All techniques sequenced 10-100x 3-8% loss High

23.8.2 Optimization Priority

  1. Start with PTQ – simplest, reliable 4x improvement with no retraining
  2. Use QAT if PTQ accuracy loss exceeds tolerance (requires retraining access)
  3. Add structured pruning for additional 2-5x compression on MCU targets
  4. Use knowledge distillation when the original architecture is fundamentally too large (>10x compression needed)
  5. Combine techniques for extreme compression (100x+), applying distillation first to create accuracy headroom

23.8.3 Key Takeaways

  • Neural networks are massively over-parameterized – 70-90% of weights can be pruned with minimal accuracy impact
  • Quantization from float32 to int8 delivers a reliable 4x compression and 2-4x speedup on ARM Cortex-M processors
  • Knowledge distillation provides “free” accuracy gains (4-5%) that create headroom for subsequent aggressive compression
  • The sequencing of optimization matters: distillation first (change architecture), then quantization (reduce precision), then pruning (remove redundancy)
  • Combined pipelines enable deploying powerful AI models on devices costing under $10 with milliwatt power budgets

23.9 Knowledge Check

Common Pitfalls

Post-training quantization without a representative calibration dataset (100-1000 samples from the target deployment environment) causes asymmetric scale factor errors that degrade accuracy by 5-15%. Always collect calibration data from the same hardware and conditions where the model will run.

Pruning a model that has not yet fully converged removes weights that would have become significant with more training. Prune only after the model reaches plateau accuracy, and always fine-tune for 5-10 epochs after each pruning round to allow remaining weights to compensate.

A model that is 4x smaller is not automatically 4x faster on every hardware target. DRAM access patterns, operator fusion, and SIMD availability all affect latency independently of model size. Always benchmark actual inference time on the target hardware after each optimization step.

Exporting a model to TFLite or ONNX with unfrozen batch normalization layers produces models that behave differently in inference mode (single-sample batch) versus training mode (mini-batch). Always call model.eval() (PyTorch) or set training=False (TensorFlow) and fuse BN layers before export.

23.10 What’s Next

Now that you can apply model optimization techniques, continue to:

Topic Chapter Description
Hardware Accelerators for Edge AI edge-ai-ml-hardware.html Choose the right NPU, TPU, GPU, or FPGA for your optimized model and understand how CMSIS-NN accelerates quantized inference on ARM Cortex-M
Edge AI Applications edge-ai-ml-applications.html Analyze optimization techniques applied to real-world use cases including predictive maintenance, visual inspection, and environmental monitoring
Edge AI Deployment Pipeline edge-ai-ml-applications.html Design end-to-end workflows from cloud training through optimization to production deployment on edge devices
TinyML and Ultra-Low-Power AI edge-ai-ml-tinyml.html Implement machine learning on microcontrollers with sub-milliwatt power budgets