22  TinyML on Microcontrollers

In 60 Seconds

TinyML brings machine learning to ultra-low-power microcontrollers with as little as 1 KB RAM, enabling intelligent inference on battery-powered devices that last months or years without recharging. By compressing cloud-scale models (100 MB) down to 20-200 KB through int8 quantization and architecture optimization, TinyML runs on $5 microcontrollers consuming just 2-50 mW – a 2,000-50,000x power reduction compared to GPU inference. Key frameworks include TensorFlow Lite Micro for general-purpose deployment, Edge Impulse for end-to-end no-code workflows, and CMSIS-NN for ARM-optimized smallest footprint.

MVU: Minimum Viable Understanding

In 60 seconds, understand TinyML:

TinyML brings machine learning to ultra-low-power microcontrollers with as little as 1 KB RAM, enabling intelligent inference on battery-powered devices that last months or years without recharging.

The compression magic:

From (Cloud) To (TinyML) Reduction
100 MB model 20-200 KB 500-5000x smaller
Float32 (32-bit) Int8 (8-bit) 4x smaller
GPU required $5 MCU works 1000x cheaper
100W power 2-50 mW 2000-50000x less

The “3-3-3 TinyML Rule”:

  • 3 KB minimum RAM for simple models (anomaly detection)
  • 30 KB typical RAM for useful models (gesture recognition)
  • 300 KB maximum RAM for complex models (keyword spotting, image classification)

Key frameworks:

  • TensorFlow Lite Micro (TFLM) - General-purpose, most popular
  • Edge Impulse - End-to-end platform, no-code option
  • CMSIS-NN - ARM-optimized, smallest footprint

Read on for hardware selection guidance and implementation details, or jump to Knowledge Check to test your understanding.

Key Concepts
  • TinyML: Machine learning inference on microcontroller-class hardware (Cortex-M, RISC-V) with <1MB flash, <256KB RAM, running at milliwatt power levels
  • TFLite Micro: Google’s framework for deploying quantized TFLite models on bare-metal microcontrollers without OS dependencies, fitting in 16KB RAM
  • Cortex-M: ARM processor family (M0 through M85) commonly used for TinyML; M4/M7 include DSP extensions and FPU that accelerate int8 neural network inference
  • CMSIS-NN: ARM’s optimized neural network kernels for Cortex-M processors using SIMD instructions to achieve 5-10x inference speedup over naive implementation
  • Memory Footprint: Critical TinyML constraint — model size must fit in flash (typically 256KB-1MB) while activation buffers must fit in SRAM (typically 32KB-256KB)
  • Wake Word Detection: TinyML application running a 10-30KB model continuously at 0.1-1mW to detect a specific audio trigger phrase before activating a larger system
  • Anomaly Detection on MCU: TinyML technique using small autoencoder or one-class classifier to identify unusual sensor patterns without transmitting data to cloud
  • Always-On Inference: Continuous ML inference running on an MCU in low-power mode (1-10mW), waking higher-power systems only when a target event is detected

22.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Select TinyML Platforms: Choose appropriate microcontrollers based on model size, RAM, and power requirements
  • Analyze Model Constraints: Calculate memory budgets for weights, activations, and runtime overhead
  • Configure TensorFlow Lite Micro: Deploy ML models on microcontrollers using the TFLM framework
  • Implement Edge Impulse Workflows: Build end-to-end TinyML solutions from data collection to deployment
  • Calculate Memory Budgets: Determine if a model fits on target hardware
  • Select Appropriate Runtimes: Choose between TFLM, CMSIS-NN, Edge Impulse SDK based on constraints

Key Business Value: TinyML enables intelligent devices without cloud connectivity costs, achieving 90-99% reduction in data transmission while maintaining real-time decision-making capability. This transforms product economics for wearables, industrial sensors, and consumer electronics.

Decision Framework:

Factor Consideration Typical Range
Hardware Cost MCU with ML capability $4-25 per unit
Development Cost Model training, optimization, deployment $20,000-100,000
Power Savings Battery life extension vs cloud-connected 10-100x longer
Accuracy Trade-off TinyML vs cloud inference 90-98% of cloud accuracy

When to Choose TinyML:

  • Product requires months/years of battery life
  • Privacy-sensitive data that should not leave device
  • Real-time response needed (< 100ms) without network dependency
  • High-volume deployment where cloud costs would be prohibitive
  • Offline operation in remote or connectivity-limited environments

When NOT to Choose TinyML:

  • Complex models requiring > 512 KB RAM
  • Continuous learning/retraining needed on device
  • Accuracy requirements cannot tolerate quantization loss
  • Single/low-volume deployment where cloud is cost-effective

Competitive Landscape: Edge Impulse dominates rapid prototyping. Google (TensorFlow Lite Micro) leads open-source. ARM (CMSIS-NN) provides optimized kernels. Specialized silicon emerging from Syntiant, Eta Compute, and others.

Implementation Timeline:

  1. Phase 1 (Week 1-2): Proof of concept - Use Edge Impulse to validate feasibility
  2. Phase 2 (Week 3-6): Model development - Train, quantize, and optimize for target hardware
  3. Phase 3 (Week 7-10): Integration - Deploy to production hardware, validate power/performance
  4. Phase 4 (Week 11-12): Production readiness - Testing, certification, manufacturing handoff

22.2 Introduction

TinyML brings machine learning to ultra-low-power microcontrollers with as little as 1 KB of RAM. This enables intelligent inference on battery-powered devices lasting months or years.

The challenge is fitting powerful neural networks into devices with less computational power than a 1990s calculator. This chapter explores the hardware platforms, software frameworks, and design patterns that make TinyML possible.

22.3 What Fits on a Microcontroller?

22.3.2 Hardware Selection Decision Tree

Choosing the right TinyML platform depends on your model requirements and power constraints:

Flowchart decision tree for selecting TinyML hardware platforms. Starting point asks if model requires more than 300 KB RAM - if yes, recommends ESP32-S3 with 512 KB RAM for complex models. If no, asks about battery life requirements over 1 year. For ultra-low power needs, branches to Nordic nRF52840 or STM32L4 based on model size. For budget-constrained projects, recommends Raspberry Pi Pico at $4. For best ecosystem support, recommends Arduino Nano 33 BLE.

TinyML hardware selection decision tree based on model and power requirements
Figure 22.1: TinyML hardware selection decision tree based on model and power requirements

22.3.3 Model Size Constraints

Typical TinyML Memory Budget:

Component Size Range Storage Notes
Model weights 20-200 KB Flash Int8 quantized
Activation tensors 10-50 KB RAM Intermediate layer outputs
Input buffer 5-20 KB RAM Sensor data window
Framework overhead 50-100 KB Flash TFLM runtime
TOTAL 100-500 KB Flash 50-150 KB RAM

Real-World TinyML Model Examples:

Application Model Size RAM Required Typical Accuracy
Wake word detection 18 KB 35 KB 92%
Gesture recognition 45 KB 65 KB 88%
Anomaly detection 30 KB 40 KB 96%
Image classification 150 KB 100 KB 94%

Think of TinyML like a smart insect brain versus a human brain.

A human brain (cloud AI) has billions of neurons and consumes 20 watts of power. An insect brain (TinyML) has only thousands of neurons but can fly, navigate, and avoid predators on microwatts of power.

TinyML lets tiny devices make smart decisions:

  • Your fitness tracker detects when you’re running vs walking using a 30 KB model
  • Smart earbuds recognize “Hey Siri” using 18 KB of neural network
  • A wildlife camera classifies animals without internet using 150 KB of vision AI

The magic is compression: We take big neural networks trained in the cloud and shrink them 10-100x to fit on $5 microcontrollers. The accuracy drops a little (maybe 95% instead of 98%), but the device works without internet, without batteries dying, and without sending your data anywhere.

Hey kids! Have you ever wondered how your smart watch knows you’re running without being connected to the internet? Let’s meet Tiny the TinyML Brain!

22.3.4 The Sensor Squad Adventure: Tiny Saves the Day

One sunny morning at the Smart Forest Wildlife Preserve, Tiny the TinyML Brain woke up inside a small camera box attached to a tree. Tiny was no bigger than a sugar cube, but inside that tiny chip was a whole neural network - like a miniature brain with thousands of tiny thinking pathways!

“Good morning, world!” chirped Tiny. “Time to watch for wildlife!” Unlike the big AI computers in the city that need tons of electricity, Tiny ran on just a tiny battery that could last for TWO WHOLE YEARS!

Suddenly, a deer walked past the camera. Clicky the Camera Sensor snapped a picture. “Hey Tiny! Is this important?”

Tiny’s neural network sprang into action. In just 50 milliseconds - faster than you can blink - Tiny examined the image using millions of tiny math calculations. “That’s a deer! Recording it!” Tiny saved the picture and went back to sleep to save battery power.

An hour later, a leaf blew past the camera. Clicky took another picture. But Tiny was smart! “That’s just a leaf, not an animal. No need to save that!” By ignoring the boring stuff, Tiny saved battery power and memory space.

Max the Motion Sensor was impressed. “Tiny, how did you get so smart on such a small brain?”

Tiny explained: “My creators taught a HUGE brain in the cloud everything about animals. Then they squeezed all that knowledge down small enough to fit inside me! It’s like taking a library full of books and shrinking it down to fit in your pocket. I might not know quite as much as the big brain, but I know enough to do my job perfectly!”

22.3.5 Key Words for Kids

Word What It Means
TinyML Machine learning (making computers smart) on teeny-tiny computer chips
Neural Network A computer program that thinks a little bit like a brain, with connected pathways
Quantization Shrinking a big brain’s knowledge to fit in a tiny chip (like compressing a photo)
Milliwatt A super tiny amount of power - TinyML uses so little it can run on a battery for years!
Inference When Tiny looks at something and decides what it is (like recognizing a deer)

22.3.6 Fun Facts

  • A TinyML chip uses 1000x LESS power than your phone!
  • Some TinyML devices can run for 10 years on a single battery!
  • Your smart watch uses TinyML to know if you’re walking, running, or sleeping!

22.4 TinyML Deployment Pipeline

The journey from a full neural network to a TinyML model follows a systematic compression and deployment pipeline:

Left-to-right pipeline diagram showing four stages of TinyML deployment. Stage 1 Cloud Training: dataset of 10K+ samples flows through TensorFlow/PyTorch training to produce a 100+ MB Float32 model. Stage 2 Model Optimization: pruning removes redundant weights, quantization converts Float32 to Int8, knowledge distillation compresses further. Stage 3 Conversion: TFLite Converter creates Flatbuffer format, then C Array byte representation. Stage 4 Microcontroller: model weights stored in Flash memory, tensor arena allocated in RAM, TFLM runtime executes inference.

TinyML deployment pipeline: from cloud training to microcontroller inference
Figure 22.2: TinyML deployment pipeline: from cloud training to microcontroller inference

Key compression stages:

Stage Technique Typical Reduction Trade-off
Pruning Remove near-zero weights 2-10x smaller Minor accuracy loss
Quantization Float32 to Int8 4x smaller 1-3% accuracy loss
Knowledge Distillation Train smaller student model 10-100x smaller Model-dependent
Architecture Search Find efficient network topology Variable Development time

22.5 TensorFlow Lite Micro

TensorFlow Lite Micro (TFLM) is a lightweight inference framework designed for microcontrollers, no operating system required.

22.5.1 Architecture

The TFLM architecture separates model storage from runtime execution:

Layered architecture diagram showing TensorFlow Lite Micro components on a microcontroller. Flash Memory section contains model weights (20-200 KB) and firmware code including TFLM runtime (50-200 KB). RAM section contains tensor arena for input/output buffers and intermediate activations (30-100 KB), plus stack and heap for runtime variables (10-30 KB). TFLM Interpreter section shows Op Resolver for supported operations, Execution Engine for layer-by-layer processing, and Memory Allocator for arena management. Arrows show data flow between components.

TensorFlow Lite Micro runtime architecture on a microcontroller
Figure 22.3: TensorFlow Lite Micro runtime architecture on a microcontroller

How It Works:

  1. Cloud Training: Train a full TensorFlow model (typically 100 MB+ with float32 weights)
  2. Conversion: Use TFLite Converter to optimize for mobile/embedded
  3. Quantization: Convert float32 to int8 (4x size reduction)
  4. Deployment: Copy model bytes to microcontroller flash memory
  5. Inference: TFLM interpreter loads model, allocates tensor arena in RAM, runs inference

22.5.2 Supported Operations (Subset)

  • Convolutional layers: Conv2D, DepthwiseConv2D
  • Activation functions: ReLU, ReLU6, Sigmoid, Tanh
  • Pooling: MaxPool2D, AveragePool2D
  • Fully connected: Dense
  • Normalization: BatchNormalization
  • Utilities: Reshape, Concatenate, Add, Multiply

22.5.3 Example: Wake Word Detection

// TensorFlow Lite Micro - Arduino Example (Conceptual)
#include <TensorFlowLite.h>
#include "model.h" // 18 KB wake word model

// Allocate memory for inference
constexpr int kTensorArenaSize = 35 * 1024;
uint8_t tensor_arena[kTensorArenaSize];

// Initialize interpreter
tflite::MicroInterpreter interpreter(model, ops_resolver,
                                     tensor_arena, kTensorArenaSize);
interpreter.AllocateTensors();

// Get input tensor (audio features: 40 MFCC coefficients x 49 frames)
TfLiteTensor* input = interpreter.input(0);

// Run inference on audio buffer
void detectWakeWord(float* audio_features) {
  // Copy features to input tensor
  for (int i = 0; i < 1960; i++) {
    input->data.f[i] = audio_features[i];
  }

  // Invoke inference (takes ~20ms)
  interpreter.Invoke();

  // Get output (probability of wake word)
  TfLiteTensor* output = interpreter.output(0);
  float confidence = output->data.f[0];

  if (confidence > 0.85) {
    // Wake word detected! Start streaming to cloud
    Serial.println("Wake word detected!");
  }
}

22.6 Edge Impulse: End-to-End TinyML Platform

Edge Impulse provides a complete no-code/low-code platform for building TinyML solutions: data collection -> labeling -> training -> deployment.

22.6.1 Workflow

Five-stage horizontal workflow diagram for Edge Impulse TinyML development. Stage 1 Data Collection: smartphone app, development board, or file upload (CSV/WAV). Stage 2 Data Labeling: audio segmentation, image bounding boxes, or time-series annotation. Stage 3 Signal Processing: MFCC for audio features, FFT for vibration, or image preprocessing. Stage 4 Model Training: AutoML architecture search, quantization optimization, and validation testing. Stage 5 Deployment: outputs to Arduino library, ESP32 firmware, or WebAssembly for browser.

Edge Impulse end-to-end TinyML development workflow
Figure 22.4: Edge Impulse end-to-end TinyML development workflow

Detailed workflow steps:

  1. Data Collection: Use smartphone or hardware to collect sensor data (audio, accelerometer, images)
  2. Data Labeling: Draw bounding boxes, segment audio, label time-series
  3. Signal Processing: Extract features (MFCC for audio, FFT for vibration, image resize)
  4. Model Training: AutoML selects architecture, trains on cloud, optimizes for edge
  5. Deployment: One-click export to Arduino, ESP32, Raspberry Pi, or custom hardware

22.6.2 Example Use Cases

Application Sensor Model Type Model Size Accuracy
Keyword Spotting Microphone 1D CNN 18 KB 92%
Gesture Recognition Accelerometer LSTM 45 KB 88%
Visual Inspection Camera MobileNetV2 150 KB 94%
Predictive Maintenance Vibration Anomaly Detection 30 KB 96%

22.6.3 Advantages

  • Rapid prototyping (hours instead of weeks)
  • Automatic feature engineering and model optimization
  • Integrated data versioning and experiment tracking
  • Hardware abstraction (same model runs on Arduino or ESP32)

22.6.4 TinyML Runtime Selection

Choosing the right runtime depends on your hardware constraints and model complexity:

Runtime Binary Size Features Best For
TensorFlow Lite Micro 50-200 KB Full inference, many ops General TinyML, complex models
CMSIS-NN 10-50 KB ARM Cortex-M optimized Ultra-low power, ARM MCUs
X-CUBE-AI 50-100 KB STM32 optimized, hardware accelerated STM32 family devices
Edge Impulse SDK 30-100 KB AutoML generated, optimized Rapid prototyping, production
microTVM 20-80 KB Compiler-based, any target Custom hardware, maximum efficiency

22.7 Knowledge Check

22.8 Summary

TinyML enables machine learning on ultra-low-power devices:

Hardware Options:

  • $4-25 microcontrollers with 128-512 KB RAM
  • Power consumption: 2-50 mW
  • Model budgets: 20-200 KB weights, 30-100 KB RAM

Software Frameworks:

  • TensorFlow Lite Micro for general TinyML
  • Edge Impulse for rapid prototyping
  • CMSIS-NN for ARM optimization

Typical Applications:

  • Keyword spotting: 18 KB model, 92% accuracy
  • Gesture recognition: 45 KB model, 88% accuracy
  • Visual inspection: 150 KB model, 94% accuracy

22.9 Worked Example: TinyML vs Cloud Inference for Predictive Maintenance

Scenario: A wind farm in Aberdeenshire, Scotland has 60 turbines. Each turbine’s gearbox has 3 accelerometers sampling vibration at 4 kHz. The operator wants to detect bearing faults 2 weeks before failure using ML-based vibration analysis.

Option A – Cloud Inference:

  • Raw data rate per turbine: 3 accelerometers x 4,000 samples/sec x 2 bytes = 24 KB/sec = 2.07 GB/day
  • Total farm: 60 turbines x 2.07 GB = 124.2 GB/day
  • 4G cellular backhaul cost: 124,200 MB/day x 30 days x GBP 0.02/MB = GBP 74,520/month
  • Cloud inference (AWS SageMaker): GBP 2,800/month
  • Inference latency: 200 ms (network) + 50 ms (inference) = 250 ms
  • Total annual cost: GBP 928,000/year

Option B – TinyML Edge Inference:

  • Each turbine gets an ESP32-S3 (240 MHz, 512 KB SRAM, 8 MB PSRAM)
  • Vibration model: 1D CNN, 180 KB (INT8 quantised), trained on 6 months of gearbox data
  • On-device processing: FFT + feature extraction + inference = 45 ms per 1-second window
  • Only anomaly scores transmitted (not raw data): 1 byte per second (anomaly flag) = 86.4 KB/day per turbine
  • Total farm data transmitted: 60 x 86.4 KB = 5.18 MB/day (24,000x reduction)
  • 4G cost: 5.18 MB/day x 30 x GBP 0.02/MB = GBP 3.11/month
  • Hardware: 60 x ESP32-S3 boards x GBP 8 = GBP 480 (one-time)
  • Model retraining (quarterly, cloud): GBP 200/quarter
  • Total annual cost: GBP 1,317/year

Comparison:

Metric Cloud Inference TinyML Edge Difference
Annual cost GBP 928,000 GBP 1,317 704x cheaper
Data transmitted/day 124.2 GB 5.18 MB 24,000x less
Detection latency 250 ms 45 ms 5.6x faster
Works during 4G outage? No (blind) Yes (fully autonomous)
Detection accuracy 97.2% (full model, ResNet-18) 94.8% (1D CNN, INT8) 2.4% lower
Privacy Raw vibration data leaves site Only anomaly scores leave Better

Detection Accuracy Trade-off:

The 2.4% accuracy difference means the TinyML model misses ~1.4 additional faults per year across 60 turbines. Each missed fault costs approximately GBP 85,000 (emergency repair + lost generation). So:

  • Cloud: 97.2% detection → misses 1.7 faults/year → GBP 144,500 in undetected failures
  • TinyML: 94.8% detection → misses 3.1 faults/year → GBP 263,500 in undetected failures
  • Accuracy cost penalty: GBP 119,000/year
  • But TinyML saves GBP 926,720/year in infrastructure costs
  • Net TinyML advantage: GBP 807,720/year

Key Insight: TinyML does not need to match cloud accuracy to be the better choice. The 2.4% accuracy penalty costs GBP 119K/year in additional missed faults, but the 24,000x data reduction saves GBP 927K/year in cellular and cloud costs. The economics are overwhelmingly in favour of edge inference for high-frequency sensor data.

TinyML memory budgeting determines deployment viability on resource-constrained MCUs. \(\text{Total Flash needed} = \text{model weights} + \text{TFLM runtime} + \text{firmware code}\) and \(\text{Total RAM needed} = \text{tensor arena} + \text{firmware variables}\) Worked example: Flash: 180 KB model + 50 KB TFLM runtime + 150 KB firmware = 380 KB, well under ESP32-S3’s 8 MB limit. RAM: 65 KB tensor arena + 80 KB firmware variables = 145 KB, fitting in ESP32-S3’s 512 KB SRAM, enabling months of battery-powered predictive maintenance.

22.10 Knowledge Check

Common Pitfalls

Many Cortex-M0/M0+ microcontrollers lack hardware floating-point units. Deploying a float32 TFLite model on these devices causes software float emulation, making inference 10-100x slower than an equivalent int8 model. Always check target MCU datasheet for FPU presence and default to int8 quantization for TinyML deployments.

TinyML model files fit in flash, but inference also requires SRAM for activation buffers. A 20KB model may need 50KB of SRAM for intermediate activations — exceeding the 32KB SRAM on an Arduino Uno. Always profile peak SRAM usage with RecordingAllocator in TFLite Micro before targeting a specific MCU.

Training a vibration classifier on 16-bit PC audio samples then deploying to a 12-bit MCU ADC introduces quantization noise that shifts input distributions. The model sees different data than it was trained on, causing silent accuracy degradation. Always train on data collected with the same sensor chain (ADC, anti-aliasing filter, sampling rate) as the deployment hardware.

Running a Cortex-M4 at 168 MHz to hit inference latency targets while powered from a 200mAh coin cell exhausts the battery in 8 hours. TinyML deployment requires co-optimization: use the minimum clock speed that meets the latency requirement, then validate battery life with a current profiler to confirm the power budget is met.

22.11 What’s Next

Now that you can configure TinyML platforms and frameworks, continue to:

Topic Chapter Description
Model Optimization Techniques edge-ai-ml-optimization.html Apply quantization, pruning, and knowledge distillation to compress models 10-100x for edge deployment
Hardware Accelerators edge-ai-ml-hardware.html Evaluate NPUs, TPUs, and GPUs for accelerating edge inference on different hardware platforms
Edge AI Lab: TinyML Gesture Recognition edge-ai-ml-lab.html Implement a working TinyML application with hands-on gesture recognition exercises