317  Hardware Accelerators for Edge AI

317.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Compare NPUs, GPUs, and FPGAs: Understand the tradeoffs between different hardware accelerators
  • Select Appropriate Hardware: Choose optimal edge AI hardware based on model size, power, latency, and cost constraints
  • Understand TOPS vs GFLOPS: Interpret accelerator metrics and apply them to real workloads
  • Apply Selection Framework: Use decision trees to guide hardware selection for different use cases

317.2 Introduction

While microcontrollers handle simple models, complex AI requires specialized hardware: Neural Processing Units (NPUs), Tensor Processing Units (TPUs), GPUs, or FPGAs. Each accelerator type has different strengths for edge AI deployment.

317.3 Neural Processing Units (NPUs)

NPUs are application-specific integrated circuits (ASICs) optimized for neural network inference, offering high throughput at low power.

317.3.2 Coral Edge TPU Example

# TensorFlow Lite + Coral Edge TPU Acceleration
from pycoral.utils import edgetpu
from pycoral.adapters import common
from pycoral.adapters import classify

# Load model compiled for Edge TPU
interpreter = edgetpu.make_interpreter('mobilenet_v2_edgetpu.tflite')
interpreter.allocate_tensors()

# Run inference (accelerated on TPU)
input_tensor = interpreter.tensor(interpreter.get_input_details()[0]['index'])
input_tensor()[0] = image  # 224x224x3 image

interpreter.invoke()  # Runs on Edge TPU in 5ms!

# Get results
output = classify.get_classes(interpreter, top_k=5)
# Result: 5ms inference for MobileNetV2 (vs 75ms on Cortex-M7)

317.3.3 Performance Comparison

Image Classification (MobileNetV2, 224x224 input):
- Cortex-M7 CPU (216 MHz): 300ms per image (3.3 fps)
- Cortex-M7 + int8: 75ms per image (13 fps)
- Raspberry Pi 4 CPU: 50ms per image (20 fps)
- Coral Edge TPU: 5ms per image (200 fps, 40x faster than Pi)

317.4 GPU at the Edge

GPUs excel at parallel processing, making them ideal for running larger models that don’t fit on NPUs.

317.4.1 Edge GPU Options

Device GPU GFLOPS Power Cost Use Case
NVIDIA Jetson Nano 128-core Maxwell 472 5-10W $99 Entry-level edge AI, robotics
NVIDIA Jetson Xavier NX 384-core Volta 21,000 10-15W $399 Industrial AI, autonomous machines
NVIDIA Jetson AGX Orin 2048-core Ampere 275,000 15-60W $1,999 Autonomous vehicles, advanced robotics
Raspberry Pi 5 VideoCore VII ~50 3-5W $80 Hobby edge AI, education

317.4.2 When to Use GPU vs NPU

Choose NPU (Coral, Movidius) when:
- Running well-optimized int8 models (MobileNet, EfficientNet)
- Need lowest power consumption (<5W)
- Budget-constrained ($25-100)
- Inference-only (no training)

Choose GPU (Jetson) when:
- Running custom/complex models not optimized for NPU
- Need floating-point precision (fp16/fp32)
- Multi-model inference pipeline (object detection + tracking + pose estimation)
- Edge training/fine-tuning required
- Have power budget (10-60W)

317.5 FPGA: Programmable Acceleration

FPGAs (Field-Programmable Gate Arrays) offer reconfigurable hardware, enabling custom acceleration for specific models.

317.5.1 Advantages

  • Ultra-low latency (<1ms) for critical control loops
  • Flexible: reprogram for different models
  • Deterministic timing (important for safety-critical)

317.5.2 Disadvantages

  • Complex programming (Verilog/VHDL or high-level synthesis)
  • Higher cost than NPUs
  • Lower peak performance than GPUs

317.5.3 Edge FPGA Examples

  • Intel/Altera Cyclone V: Industrial IoT, motor control
  • Xilinx Zynq UltraScale+: Autonomous drones, medical devices
  • Lattice iCE40 UltraPlus: Ultra-low-power (tens of mW) always-on AI

317.6 Hardware Selection Decision Tree

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#7F8C8D', 'fontSize': '14px'}}}%%
flowchart TD
    Start["Model Size?"] --> Size{< 200 KB?}

    Size -->|"Yes"| MCU["Microcontroller<br/>STM32, ESP32<br/>< $25, < 50mW"]
    Size -->|"No"| Power{Power Budget?}

    Power -->|"< 5W"| NPU["NPU/TPU<br/>Coral, Movidius<br/>$25-100, 2-5W"]
    Power -->|"5-60W"| Latency{Latency<br/>Requirement?}

    Latency -->|"< 10ms"| Custom{Custom<br/>Operations?}
    Latency -->|"> 10ms OK"| GPU["Edge GPU<br/>Jetson Nano/Xavier<br/>$99-399, 5-15W"]

    Custom -->|"Standard Ops"| NPU
    Custom -->|"Custom Ops"| FPGA["FPGA<br/>Zynq, Cyclone<br/>$200-500, 5-20W"]

    style Start fill:#2C3E50,stroke:#16A085,color:#fff
    style MCU fill:#16A085,stroke:#2C3E50,color:#fff
    style NPU fill:#16A085,stroke:#2C3E50,color:#fff
    style GPU fill:#E67E22,stroke:#2C3E50,color:#fff
    style FPGA fill:#7F8C8D,stroke:#2C3E50,color:#fff

Figure 317.1: Edge AI hardware selection decision tree based on model size, power, and latency requirements

317.7 Understanding TOPS vs GFLOPS

Key insight about accelerator metrics:

  • TOPS (Tera Operations Per Second): Measures int8 neural network operations (convolutions, activations). Used for NPUs.
  • GFLOPS (Giga Floating-point Operations Per Second): Measures general float32 math. Used for GPUs.

Why Coral TPU (4 TOPS) outperforms Jetson Nano (472 GFLOPS) for int8 models:

Metric Coral Edge TPU Jetson Nano
MobileNetV2 int8 5ms 10ms
Peak Compute 4 TOPS (int8) 472 GFLOPS (fp32)
Power 2W 10W
Cost $60 $99

The Coral’s specialized int8 inference engine beats the Jetson’s general-purpose GPU despite 100x lower “headline” compute numbers. Specialized hardware wins for standardized workloads.

317.8 Knowledge Check

317.9 Summary

Hardware Options:

Type TOPS/GFLOPS Power Cost Best For
Microcontroller ~0.1 TOPS <50mW <$25 TinyML (<200 KB models)
NPU/TPU 1-15 TOPS 2-5W $25-100 Optimized int8 inference
Edge GPU 500-275K GFLOPS 5-60W $99-2000 Custom/complex models
FPGA Custom 5-20W $200-500 Custom ops, deterministic latency

Selection Principles: - Specialized hardware (NPU) beats general hardware (GPU) for standard workloads - TOPS matters for int8 models; GFLOPS for float32 - FPGAs for custom operations and hard real-time requirements - Match hardware to model size, power budget, and latency needs

317.10 What’s Next

Now that you understand edge AI hardware options, continue to:

  • Edge AI Applications - See hardware applied to real-world use cases like visual inspection and predictive maintenance
  • Edge AI Deployment Pipeline - Learn end-to-end workflows from model training to hardware deployment
  • Edge AI Lab - Build hands-on projects with TinyML and edge accelerators