%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'fontSize': '14px'}}}%%
flowchart TD
Start["Model Size?"] --> Size{< 200 KB?}
Size -->|"Yes"| MCU["Microcontroller<br/>STM32, ESP32<br/>< $25, < 50mW"]
Size -->|"No"| Power{Power Budget?}
Power -->|"< 5W"| NPU["NPU/TPU<br/>Coral, Movidius<br/>$25-100, 2-5W"]
Power -->|"5-60W"| Latency{Latency<br/>Requirement?}
Latency -->|"< 10ms"| Custom{Custom<br/>Operations?}
Latency -->|"> 10ms OK"| GPU["Edge GPU<br/>Jetson Nano/Xavier<br/>$99-399, 5-15W"]
Custom -->|"Standard Ops"| NPU
Custom -->|"Custom Ops"| FPGA["FPGA<br/>Zynq, Cyclone<br/>$200-500, 5-20W"]
style Start fill:#2C3E50,stroke:#16A085,color:#fff
style MCU fill:#16A085,stroke:#2C3E50,color:#fff
style NPU fill:#16A085,stroke:#2C3E50,color:#fff
style GPU fill:#E67E22,stroke:#2C3E50,color:#fff
style FPGA fill:#7F8C8D,stroke:#2C3E50,color:#fff
317 Hardware Accelerators for Edge AI
317.1 Learning Objectives
By the end of this chapter, you will be able to:
- Compare NPUs, GPUs, and FPGAs: Understand the tradeoffs between different hardware accelerators
- Select Appropriate Hardware: Choose optimal edge AI hardware based on model size, power, latency, and cost constraints
- Understand TOPS vs GFLOPS: Interpret accelerator metrics and apply them to real workloads
- Apply Selection Framework: Use decision trees to guide hardware selection for different use cases
317.2 Introduction
While microcontrollers handle simple models, complex AI requires specialized hardware: Neural Processing Units (NPUs), Tensor Processing Units (TPUs), GPUs, or FPGAs. Each accelerator type has different strengths for edge AI deployment.
317.3 Neural Processing Units (NPUs)
NPUs are application-specific integrated circuits (ASICs) optimized for neural network inference, offering high throughput at low power.
317.3.1 Popular Edge NPUs
| NPU | TOPS | Power | Cost | Use Case | Example Device |
|---|---|---|---|---|---|
| Google Coral Edge TPU | 4 TOPS (int8) | 2W | $25 | USB accelerator, dev board | Coral Dev Board, USB Accelerator |
| Intel Movidius Myriad X | 1 TOPS (fp16) | 1-2W | $50 | Vision processing | Intel Neural Compute Stick 2 |
| Arm Ethos-U55 | 0.5 TOPS (int8) | 0.5W | Integrated | Microcontroller-class ML | Arm Cortex-M55 + Ethos-U55 |
| Qualcomm Hexagon 690 | 15 TOPS (int8) | 3-5W | Integrated | Smartphone AI | Snapdragon 888 |
| Apple Neural Engine | 11 TOPS (A14) | 5W | Integrated | iOS ML, privacy-focused | iPhone 12/13/14 |
317.3.2 Coral Edge TPU Example
# TensorFlow Lite + Coral Edge TPU Acceleration
from pycoral.utils import edgetpu
from pycoral.adapters import common
from pycoral.adapters import classify
# Load model compiled for Edge TPU
interpreter = edgetpu.make_interpreter('mobilenet_v2_edgetpu.tflite')
interpreter.allocate_tensors()
# Run inference (accelerated on TPU)
input_tensor = interpreter.tensor(interpreter.get_input_details()[0]['index'])
input_tensor()[0] = image # 224x224x3 image
interpreter.invoke() # Runs on Edge TPU in 5ms!
# Get results
output = classify.get_classes(interpreter, top_k=5)
# Result: 5ms inference for MobileNetV2 (vs 75ms on Cortex-M7)317.3.3 Performance Comparison
Image Classification (MobileNetV2, 224x224 input):
- Cortex-M7 CPU (216 MHz): 300ms per image (3.3 fps)
- Cortex-M7 + int8: 75ms per image (13 fps)
- Raspberry Pi 4 CPU: 50ms per image (20 fps)
- Coral Edge TPU: 5ms per image (200 fps, 40x faster than Pi)
317.4 GPU at the Edge
GPUs excel at parallel processing, making them ideal for running larger models that don’t fit on NPUs.
317.4.1 Edge GPU Options
| Device | GPU | GFLOPS | Power | Cost | Use Case |
|---|---|---|---|---|---|
| NVIDIA Jetson Nano | 128-core Maxwell | 472 | 5-10W | $99 | Entry-level edge AI, robotics |
| NVIDIA Jetson Xavier NX | 384-core Volta | 21,000 | 10-15W | $399 | Industrial AI, autonomous machines |
| NVIDIA Jetson AGX Orin | 2048-core Ampere | 275,000 | 15-60W | $1,999 | Autonomous vehicles, advanced robotics |
| Raspberry Pi 5 | VideoCore VII | ~50 | 3-5W | $80 | Hobby edge AI, education |
317.4.2 When to Use GPU vs NPU
Choose NPU (Coral, Movidius) when:
- Running well-optimized int8 models (MobileNet, EfficientNet)
- Need lowest power consumption (<5W)
- Budget-constrained ($25-100)
- Inference-only (no training)
Choose GPU (Jetson) when:
- Running custom/complex models not optimized for NPU
- Need floating-point precision (fp16/fp32)
- Multi-model inference pipeline (object detection + tracking + pose estimation)
- Edge training/fine-tuning required
- Have power budget (10-60W)
317.5 FPGA: Programmable Acceleration
FPGAs (Field-Programmable Gate Arrays) offer reconfigurable hardware, enabling custom acceleration for specific models.
317.5.1 Advantages
- Ultra-low latency (<1ms) for critical control loops
- Flexible: reprogram for different models
- Deterministic timing (important for safety-critical)
317.5.2 Disadvantages
- Complex programming (Verilog/VHDL or high-level synthesis)
- Higher cost than NPUs
- Lower peak performance than GPUs
317.5.3 Edge FPGA Examples
- Intel/Altera Cyclone V: Industrial IoT, motor control
- Xilinx Zynq UltraScale+: Autonomous drones, medical devices
- Lattice iCE40 UltraPlus: Ultra-low-power (tens of mW) always-on AI
317.6 Hardware Selection Decision Tree
317.7 Understanding TOPS vs GFLOPS
Key insight about accelerator metrics:
- TOPS (Tera Operations Per Second): Measures int8 neural network operations (convolutions, activations). Used for NPUs.
- GFLOPS (Giga Floating-point Operations Per Second): Measures general float32 math. Used for GPUs.
Why Coral TPU (4 TOPS) outperforms Jetson Nano (472 GFLOPS) for int8 models:
| Metric | Coral Edge TPU | Jetson Nano |
|---|---|---|
| MobileNetV2 int8 | 5ms | 10ms |
| Peak Compute | 4 TOPS (int8) | 472 GFLOPS (fp32) |
| Power | 2W | 10W |
| Cost | $60 | $99 |
The Coral’s specialized int8 inference engine beats the Jetson’s general-purpose GPU despite 100x lower “headline” compute numbers. Specialized hardware wins for standardized workloads.
317.8 Knowledge Check
317.9 Summary
Hardware Options:
| Type | TOPS/GFLOPS | Power | Cost | Best For |
|---|---|---|---|---|
| Microcontroller | ~0.1 TOPS | <50mW | <$25 | TinyML (<200 KB models) |
| NPU/TPU | 1-15 TOPS | 2-5W | $25-100 | Optimized int8 inference |
| Edge GPU | 500-275K GFLOPS | 5-60W | $99-2000 | Custom/complex models |
| FPGA | Custom | 5-20W | $200-500 | Custom ops, deterministic latency |
Selection Principles: - Specialized hardware (NPU) beats general hardware (GPU) for standard workloads - TOPS matters for int8 models; GFLOPS for float32 - FPGAs for custom operations and hard real-time requirements - Match hardware to model size, power budget, and latency needs
317.10 What’s Next
Now that you understand edge AI hardware options, continue to:
- Edge AI Applications - See hardware applied to real-world use cases like visual inspection and predictive maintenance
- Edge AI Deployment Pipeline - Learn end-to-end workflows from model training to hardware deployment
- Edge AI Lab - Build hands-on projects with TinyML and edge accelerators