22 TinyML on Microcontrollers
In 60 seconds, understand TinyML:
TinyML brings machine learning to ultra-low-power microcontrollers with as little as 1 KB RAM, enabling intelligent inference on battery-powered devices that last months or years without recharging.
The compression magic:
| From (Cloud) | To (TinyML) | Reduction |
|---|---|---|
| 100 MB model | 20-200 KB | 500-5000x smaller |
| Float32 (32-bit) | Int8 (8-bit) | 4x smaller |
| GPU required | $5 MCU works | 1000x cheaper |
| 100W power | 2-50 mW | 2000-50000x less |
The “3-3-3 TinyML Rule”:
- 3 KB minimum RAM for simple models (anomaly detection)
- 30 KB typical RAM for useful models (gesture recognition)
- 300 KB maximum RAM for complex models (keyword spotting, image classification)
Key frameworks:
- TensorFlow Lite Micro (TFLM) - General-purpose, most popular
- Edge Impulse - End-to-end platform, no-code option
- CMSIS-NN - ARM-optimized, smallest footprint
Read on for hardware selection guidance and implementation details, or jump to Knowledge Check to test your understanding.
- TinyML: Machine learning inference on microcontroller-class hardware (Cortex-M, RISC-V) with <1MB flash, <256KB RAM, running at milliwatt power levels
- TFLite Micro: Google’s framework for deploying quantized TFLite models on bare-metal microcontrollers without OS dependencies, fitting in 16KB RAM
- Cortex-M: ARM processor family (M0 through M85) commonly used for TinyML; M4/M7 include DSP extensions and FPU that accelerate int8 neural network inference
- CMSIS-NN: ARM’s optimized neural network kernels for Cortex-M processors using SIMD instructions to achieve 5-10x inference speedup over naive implementation
- Memory Footprint: Critical TinyML constraint — model size must fit in flash (typically 256KB-1MB) while activation buffers must fit in SRAM (typically 32KB-256KB)
- Wake Word Detection: TinyML application running a 10-30KB model continuously at 0.1-1mW to detect a specific audio trigger phrase before activating a larger system
- Anomaly Detection on MCU: TinyML technique using small autoencoder or one-class classifier to identify unusual sensor patterns without transmitting data to cloud
- Always-On Inference: Continuous ML inference running on an MCU in low-power mode (1-10mW), waking higher-power systems only when a target event is detected
22.1 Learning Objectives
By the end of this chapter, you will be able to:
- Select TinyML Platforms: Choose appropriate microcontrollers based on model size, RAM, and power requirements
- Analyze Model Constraints: Calculate memory budgets for weights, activations, and runtime overhead
- Configure TensorFlow Lite Micro: Deploy ML models on microcontrollers using the TFLM framework
- Implement Edge Impulse Workflows: Build end-to-end TinyML solutions from data collection to deployment
- Calculate Memory Budgets: Determine if a model fits on target hardware
- Select Appropriate Runtimes: Choose between TFLM, CMSIS-NN, Edge Impulse SDK based on constraints
Key Business Value: TinyML enables intelligent devices without cloud connectivity costs, achieving 90-99% reduction in data transmission while maintaining real-time decision-making capability. This transforms product economics for wearables, industrial sensors, and consumer electronics.
Decision Framework:
| Factor | Consideration | Typical Range |
|---|---|---|
| Hardware Cost | MCU with ML capability | $4-25 per unit |
| Development Cost | Model training, optimization, deployment | $20,000-100,000 |
| Power Savings | Battery life extension vs cloud-connected | 10-100x longer |
| Accuracy Trade-off | TinyML vs cloud inference | 90-98% of cloud accuracy |
When to Choose TinyML:
- Product requires months/years of battery life
- Privacy-sensitive data that should not leave device
- Real-time response needed (< 100ms) without network dependency
- High-volume deployment where cloud costs would be prohibitive
- Offline operation in remote or connectivity-limited environments
When NOT to Choose TinyML:
- Complex models requiring > 512 KB RAM
- Continuous learning/retraining needed on device
- Accuracy requirements cannot tolerate quantization loss
- Single/low-volume deployment where cloud is cost-effective
Competitive Landscape: Edge Impulse dominates rapid prototyping. Google (TensorFlow Lite Micro) leads open-source. ARM (CMSIS-NN) provides optimized kernels. Specialized silicon emerging from Syntiant, Eta Compute, and others.
Implementation Timeline:
- Phase 1 (Week 1-2): Proof of concept - Use Edge Impulse to validate feasibility
- Phase 2 (Week 3-6): Model development - Train, quantize, and optimize for target hardware
- Phase 3 (Week 7-10): Integration - Deploy to production hardware, validate power/performance
- Phase 4 (Week 11-12): Production readiness - Testing, certification, manufacturing handoff
22.2 Introduction
TinyML brings machine learning to ultra-low-power microcontrollers with as little as 1 KB of RAM. This enables intelligent inference on battery-powered devices lasting months or years.
The challenge is fitting powerful neural networks into devices with less computational power than a 1990s calculator. This chapter explores the hardware platforms, software frameworks, and design patterns that make TinyML possible.
22.3 What Fits on a Microcontroller?
22.3.1 Popular TinyML Hardware Platforms
| Device | Flash | RAM | CPU | Power | Cost | Typical Use |
|---|---|---|---|---|---|---|
| Arduino Nano 33 BLE | 1 MB | 256 KB | 64 MHz Cortex-M4 | 5 mW | $25 | Keyword spotting, gesture recognition |
| ESP32-S3 | 8 MB | 512 KB | 240 MHz Xtensa dual-core | 20-50 mW | $8 | Smart home, audio processing |
| Raspberry Pi Pico | 2 MB | 264 KB | 133 MHz Cortex-M0+ | 30 mW | $4 | Sensor fusion, vibration analysis |
| STM32L4 | 512 KB | 128 KB | 80 MHz Cortex-M4 | 3 mW | $5 | Ultra-low-power sensing |
| Nordic nRF52840 | 1 MB | 256 KB | 64 MHz Cortex-M4 | 2-8 mW | $7 | BLE + ML wearables |
22.3.2 Hardware Selection Decision Tree
Choosing the right TinyML platform depends on your model requirements and power constraints:
22.3.3 Model Size Constraints
Typical TinyML Memory Budget:
| Component | Size Range | Storage | Notes |
|---|---|---|---|
| Model weights | 20-200 KB | Flash | Int8 quantized |
| Activation tensors | 10-50 KB | RAM | Intermediate layer outputs |
| Input buffer | 5-20 KB | RAM | Sensor data window |
| Framework overhead | 50-100 KB | Flash | TFLM runtime |
| TOTAL | 100-500 KB Flash | 50-150 KB RAM |
Real-World TinyML Model Examples:
| Application | Model Size | RAM Required | Typical Accuracy |
|---|---|---|---|
| Wake word detection | 18 KB | 35 KB | 92% |
| Gesture recognition | 45 KB | 65 KB | 88% |
| Anomaly detection | 30 KB | 40 KB | 96% |
| Image classification | 150 KB | 100 KB | 94% |
Think of TinyML like a smart insect brain versus a human brain.
A human brain (cloud AI) has billions of neurons and consumes 20 watts of power. An insect brain (TinyML) has only thousands of neurons but can fly, navigate, and avoid predators on microwatts of power.
TinyML lets tiny devices make smart decisions:
- Your fitness tracker detects when you’re running vs walking using a 30 KB model
- Smart earbuds recognize “Hey Siri” using 18 KB of neural network
- A wildlife camera classifies animals without internet using 150 KB of vision AI
The magic is compression: We take big neural networks trained in the cloud and shrink them 10-100x to fit on $5 microcontrollers. The accuracy drops a little (maybe 95% instead of 98%), but the device works without internet, without batteries dying, and without sending your data anywhere.
Hey kids! Have you ever wondered how your smart watch knows you’re running without being connected to the internet? Let’s meet Tiny the TinyML Brain!
22.3.4 The Sensor Squad Adventure: Tiny Saves the Day
One sunny morning at the Smart Forest Wildlife Preserve, Tiny the TinyML Brain woke up inside a small camera box attached to a tree. Tiny was no bigger than a sugar cube, but inside that tiny chip was a whole neural network - like a miniature brain with thousands of tiny thinking pathways!
“Good morning, world!” chirped Tiny. “Time to watch for wildlife!” Unlike the big AI computers in the city that need tons of electricity, Tiny ran on just a tiny battery that could last for TWO WHOLE YEARS!
Suddenly, a deer walked past the camera. Clicky the Camera Sensor snapped a picture. “Hey Tiny! Is this important?”
Tiny’s neural network sprang into action. In just 50 milliseconds - faster than you can blink - Tiny examined the image using millions of tiny math calculations. “That’s a deer! Recording it!” Tiny saved the picture and went back to sleep to save battery power.
An hour later, a leaf blew past the camera. Clicky took another picture. But Tiny was smart! “That’s just a leaf, not an animal. No need to save that!” By ignoring the boring stuff, Tiny saved battery power and memory space.
Max the Motion Sensor was impressed. “Tiny, how did you get so smart on such a small brain?”
Tiny explained: “My creators taught a HUGE brain in the cloud everything about animals. Then they squeezed all that knowledge down small enough to fit inside me! It’s like taking a library full of books and shrinking it down to fit in your pocket. I might not know quite as much as the big brain, but I know enough to do my job perfectly!”
22.3.5 Key Words for Kids
| Word | What It Means |
|---|---|
| TinyML | Machine learning (making computers smart) on teeny-tiny computer chips |
| Neural Network | A computer program that thinks a little bit like a brain, with connected pathways |
| Quantization | Shrinking a big brain’s knowledge to fit in a tiny chip (like compressing a photo) |
| Milliwatt | A super tiny amount of power - TinyML uses so little it can run on a battery for years! |
| Inference | When Tiny looks at something and decides what it is (like recognizing a deer) |
22.3.6 Fun Facts
- A TinyML chip uses 1000x LESS power than your phone!
- Some TinyML devices can run for 10 years on a single battery!
- Your smart watch uses TinyML to know if you’re walking, running, or sleeping!
22.4 TinyML Deployment Pipeline
The journey from a full neural network to a TinyML model follows a systematic compression and deployment pipeline:
Key compression stages:
| Stage | Technique | Typical Reduction | Trade-off |
|---|---|---|---|
| Pruning | Remove near-zero weights | 2-10x smaller | Minor accuracy loss |
| Quantization | Float32 to Int8 | 4x smaller | 1-3% accuracy loss |
| Knowledge Distillation | Train smaller student model | 10-100x smaller | Model-dependent |
| Architecture Search | Find efficient network topology | Variable | Development time |
22.5 TensorFlow Lite Micro
TensorFlow Lite Micro (TFLM) is a lightweight inference framework designed for microcontrollers, no operating system required.
22.5.1 Architecture
The TFLM architecture separates model storage from runtime execution:
How It Works:
- Cloud Training: Train a full TensorFlow model (typically 100 MB+ with float32 weights)
- Conversion: Use TFLite Converter to optimize for mobile/embedded
- Quantization: Convert float32 to int8 (4x size reduction)
- Deployment: Copy model bytes to microcontroller flash memory
- Inference: TFLM interpreter loads model, allocates tensor arena in RAM, runs inference
22.5.2 Supported Operations (Subset)
- Convolutional layers: Conv2D, DepthwiseConv2D
- Activation functions: ReLU, ReLU6, Sigmoid, Tanh
- Pooling: MaxPool2D, AveragePool2D
- Fully connected: Dense
- Normalization: BatchNormalization
- Utilities: Reshape, Concatenate, Add, Multiply
22.5.3 Example: Wake Word Detection
// TensorFlow Lite Micro - Arduino Example (Conceptual)
#include <TensorFlowLite.h>
#include "model.h" // 18 KB wake word model
// Allocate memory for inference
constexpr int kTensorArenaSize = 35 * 1024;
uint8_t tensor_arena[kTensorArenaSize];
// Initialize interpreter
tflite::MicroInterpreter interpreter(model, ops_resolver,
tensor_arena, kTensorArenaSize);
interpreter.AllocateTensors();
// Get input tensor (audio features: 40 MFCC coefficients x 49 frames)
TfLiteTensor* input = interpreter.input(0);
// Run inference on audio buffer
void detectWakeWord(float* audio_features) {
// Copy features to input tensor
for (int i = 0; i < 1960; i++) {
input->data.f[i] = audio_features[i];
}
// Invoke inference (takes ~20ms)
interpreter.Invoke();
// Get output (probability of wake word)
TfLiteTensor* output = interpreter.output(0);
float confidence = output->data.f[0];
if (confidence > 0.85) {
// Wake word detected! Start streaming to cloud
Serial.println("Wake word detected!");
}
}22.6 Edge Impulse: End-to-End TinyML Platform
Edge Impulse provides a complete no-code/low-code platform for building TinyML solutions: data collection -> labeling -> training -> deployment.
22.6.1 Workflow
Detailed workflow steps:
- Data Collection: Use smartphone or hardware to collect sensor data (audio, accelerometer, images)
- Data Labeling: Draw bounding boxes, segment audio, label time-series
- Signal Processing: Extract features (MFCC for audio, FFT for vibration, image resize)
- Model Training: AutoML selects architecture, trains on cloud, optimizes for edge
- Deployment: One-click export to Arduino, ESP32, Raspberry Pi, or custom hardware
22.6.2 Example Use Cases
| Application | Sensor | Model Type | Model Size | Accuracy |
|---|---|---|---|---|
| Keyword Spotting | Microphone | 1D CNN | 18 KB | 92% |
| Gesture Recognition | Accelerometer | LSTM | 45 KB | 88% |
| Visual Inspection | Camera | MobileNetV2 | 150 KB | 94% |
| Predictive Maintenance | Vibration | Anomaly Detection | 30 KB | 96% |
22.6.3 Advantages
- Rapid prototyping (hours instead of weeks)
- Automatic feature engineering and model optimization
- Integrated data versioning and experiment tracking
- Hardware abstraction (same model runs on Arduino or ESP32)
22.6.4 TinyML Runtime Selection
Choosing the right runtime depends on your hardware constraints and model complexity:
| Runtime | Binary Size | Features | Best For |
|---|---|---|---|
| TensorFlow Lite Micro | 50-200 KB | Full inference, many ops | General TinyML, complex models |
| CMSIS-NN | 10-50 KB | ARM Cortex-M optimized | Ultra-low power, ARM MCUs |
| X-CUBE-AI | 50-100 KB | STM32 optimized, hardware accelerated | STM32 family devices |
| Edge Impulse SDK | 30-100 KB | AutoML generated, optimized | Rapid prototyping, production |
| microTVM | 20-80 KB | Compiler-based, any target | Custom hardware, maximum efficiency |
22.7 Knowledge Check
22.8 Summary
TinyML enables machine learning on ultra-low-power devices:
Hardware Options:
- $4-25 microcontrollers with 128-512 KB RAM
- Power consumption: 2-50 mW
- Model budgets: 20-200 KB weights, 30-100 KB RAM
Software Frameworks:
- TensorFlow Lite Micro for general TinyML
- Edge Impulse for rapid prototyping
- CMSIS-NN for ARM optimization
Typical Applications:
- Keyword spotting: 18 KB model, 92% accuracy
- Gesture recognition: 45 KB model, 88% accuracy
- Visual inspection: 150 KB model, 94% accuracy
22.9 Worked Example: TinyML vs Cloud Inference for Predictive Maintenance
Scenario: A wind farm in Aberdeenshire, Scotland has 60 turbines. Each turbine’s gearbox has 3 accelerometers sampling vibration at 4 kHz. The operator wants to detect bearing faults 2 weeks before failure using ML-based vibration analysis.
Option A – Cloud Inference:
- Raw data rate per turbine: 3 accelerometers x 4,000 samples/sec x 2 bytes = 24 KB/sec = 2.07 GB/day
- Total farm: 60 turbines x 2.07 GB = 124.2 GB/day
- 4G cellular backhaul cost: 124,200 MB/day x 30 days x GBP 0.02/MB = GBP 74,520/month
- Cloud inference (AWS SageMaker): GBP 2,800/month
- Inference latency: 200 ms (network) + 50 ms (inference) = 250 ms
- Total annual cost: GBP 928,000/year
Option B – TinyML Edge Inference:
- Each turbine gets an ESP32-S3 (240 MHz, 512 KB SRAM, 8 MB PSRAM)
- Vibration model: 1D CNN, 180 KB (INT8 quantised), trained on 6 months of gearbox data
- On-device processing: FFT + feature extraction + inference = 45 ms per 1-second window
- Only anomaly scores transmitted (not raw data): 1 byte per second (anomaly flag) = 86.4 KB/day per turbine
- Total farm data transmitted: 60 x 86.4 KB = 5.18 MB/day (24,000x reduction)
- 4G cost: 5.18 MB/day x 30 x GBP 0.02/MB = GBP 3.11/month
- Hardware: 60 x ESP32-S3 boards x GBP 8 = GBP 480 (one-time)
- Model retraining (quarterly, cloud): GBP 200/quarter
- Total annual cost: GBP 1,317/year
Comparison:
| Metric | Cloud Inference | TinyML Edge | Difference |
|---|---|---|---|
| Annual cost | GBP 928,000 | GBP 1,317 | 704x cheaper |
| Data transmitted/day | 124.2 GB | 5.18 MB | 24,000x less |
| Detection latency | 250 ms | 45 ms | 5.6x faster |
| Works during 4G outage? | No (blind) | Yes (fully autonomous) | |
| Detection accuracy | 97.2% (full model, ResNet-18) | 94.8% (1D CNN, INT8) | 2.4% lower |
| Privacy | Raw vibration data leaves site | Only anomaly scores leave | Better |
Detection Accuracy Trade-off:
The 2.4% accuracy difference means the TinyML model misses ~1.4 additional faults per year across 60 turbines. Each missed fault costs approximately GBP 85,000 (emergency repair + lost generation). So:
- Cloud: 97.2% detection → misses 1.7 faults/year → GBP 144,500 in undetected failures
- TinyML: 94.8% detection → misses 3.1 faults/year → GBP 263,500 in undetected failures
- Accuracy cost penalty: GBP 119,000/year
- But TinyML saves GBP 926,720/year in infrastructure costs
- Net TinyML advantage: GBP 807,720/year
Key Insight: TinyML does not need to match cloud accuracy to be the better choice. The 2.4% accuracy penalty costs GBP 119K/year in additional missed faults, but the 24,000x data reduction saves GBP 927K/year in cellular and cloud costs. The economics are overwhelmingly in favour of edge inference for high-frequency sensor data.
TinyML memory budgeting determines deployment viability on resource-constrained MCUs. \(\text{Total Flash needed} = \text{model weights} + \text{TFLM runtime} + \text{firmware code}\) and \(\text{Total RAM needed} = \text{tensor arena} + \text{firmware variables}\) Worked example: Flash: 180 KB model + 50 KB TFLM runtime + 150 KB firmware = 380 KB, well under ESP32-S3’s 8 MB limit. RAM: 65 KB tensor arena + 80 KB firmware variables = 145 KB, fitting in ESP32-S3’s 512 KB SRAM, enabling months of battery-powered predictive maintenance.
22.10 Knowledge Check
Common Pitfalls
Many Cortex-M0/M0+ microcontrollers lack hardware floating-point units. Deploying a float32 TFLite model on these devices causes software float emulation, making inference 10-100x slower than an equivalent int8 model. Always check target MCU datasheet for FPU presence and default to int8 quantization for TinyML deployments.
TinyML model files fit in flash, but inference also requires SRAM for activation buffers. A 20KB model may need 50KB of SRAM for intermediate activations — exceeding the 32KB SRAM on an Arduino Uno. Always profile peak SRAM usage with RecordingAllocator in TFLite Micro before targeting a specific MCU.
Training a vibration classifier on 16-bit PC audio samples then deploying to a 12-bit MCU ADC introduces quantization noise that shifts input distributions. The model sees different data than it was trained on, causing silent accuracy degradation. Always train on data collected with the same sensor chain (ADC, anti-aliasing filter, sampling rate) as the deployment hardware.
Running a Cortex-M4 at 168 MHz to hit inference latency targets while powered from a 200mAh coin cell exhausts the battery in 8 hours. TinyML deployment requires co-optimization: use the minimum clock speed that meets the latency requirement, then validate battery life with a current profiler to confirm the power budget is met.
22.11 What’s Next
Now that you can configure TinyML platforms and frameworks, continue to:
| Topic | Chapter | Description |
|---|---|---|
| Model Optimization Techniques | edge-ai-ml-optimization.html | Apply quantization, pruning, and knowledge distillation to compress models 10-100x for edge deployment |
| Hardware Accelerators | edge-ai-ml-hardware.html | Evaluate NPUs, TPUs, and GPUs for accelerating edge inference on different hardware platforms |
| Edge AI Lab: TinyML Gesture Recognition | edge-ai-ml-lab.html | Implement a working TinyML application with hands-on gesture recognition exercises |