13 TinyML on Microcontrollers

edge-fog

tinyml

In 60 Seconds

TinyML puts trained machine-learning inference on microcontroller-class devices. The design problem is not simply shrinking a model. A deployable TinyML system must fit flash, SRAM, sensor buffers, operator support, latency, energy, update storage, and field validation. Treat the model, firmware, sensor chain, and power policy as one product.

Phoebe’s Field Notes: Why The Audio Buffer Cannot Hold Raw Sound

Phoebe the physics guide

Phoebe’s Why

This chapter’s own rule says the sensor pipeline is part of the model: sampling rate and ADC resolution must match what the network was trained on. That rule has two separate physical reasons behind it. First, a sampled signal can only represent frequency content up to half the sample rate – anything faster folds down and corrupts the recording, so the sample rate sets a hard ceiling on what the microphone or accelerometer can honestly report. Second, every ADC code is a rounded version of a continuous voltage, and that rounding behaves like an added noise floor whose size is fixed by the bit depth, not by anything downstream. Neither limit is a firmware bug to optimize away; both are physical floors a TinyML budget has to plan around before the model ever sees a byte.

The Derivation

Nyquist criterion – the minimum sample rate that avoids spectral folding (aliasing) of content up to \(f_{max}\):

\[f_s \ge 2f_{max}\]

Quantization step for an \(N\)-bit ADC over full-scale range \(FSR\), and the RMS noise a uniformly distributed rounding error produces:

\[q = \frac{FSR}{2^N} \qquad \sigma_q = \frac{q}{\sqrt{12}}\]

The resulting quantizer signal-to-noise ratio in decibels:

\[\mathrm{SNR}(\text{dB}) = 6.02N + 1.76\]

Worked Numbers: This Chapter’s Audio Feature Buffer

Speech content of interest tops out near 4 kHz, so Nyquist sets a floor of \(f_s\ge2\times4{,}000=8{,}000\) Hz; a catalog-typical keyword-spotting front end oversamples to \(f_s=16\) kHz for anti-alias filter margin.
One second of raw audio at 16 kHz, 16-bit PCM is \(16{,}000\times2=32{,}000\) bytes \(=31.3\) KB – already bigger than this chapter’s entire worked “18 KB tensor arena plus an 8 KB audio feature buffer” example, before the neural network sees a single tensor.
Inverting that budget the other way: an 8 KB buffer of raw 16-bit, 16 kHz PCM holds only \(8{,}192/2=4{,}096\) samples, or \(4{,}096/16{,}000=256\) ms – a fraction of a spoken keyword. That gap is exactly why this chapter’s own “Audio Event Models” note says feature extraction and frame buffers dominate memory, not the network: MFCC/mel features compress a full utterance into far fewer bytes than the raw waveform.
Quantizer SNR at the two bit depths this chapter’s budget table implies: 16-bit gives \(\mathrm{SNR}=6.02\times16+1.76=98.1\) dB; a coarser 8-bit microphone ADC gives only \(6.02\times8+1.76=49.9\) dB – a 48.2 dB gap, which is why the “microphone front end” and “ADC resolution” get their own line in this chapter’s Minimum Viable Understanding rather than being folded into “sensor sampling.”

13.1 Start Simple

Picture a battery-powered sensor that has to recognize one pattern without asking the network every time. The core idea is that the model, firmware, sensor sampling, memory budget, and sleep policy all ship together. Everyday IoT TinyML work starts with one constrained microcontroller, one input window, and one acceptance test. Build the smallest classifier that fits, then prove its latency, SRAM, flash, energy, update, and fallback behavior before adding more classes.

Minimum Viable Understanding

TinyML is constrained inference. Most microcontroller deployments run a pre-trained model; training and retraining normally happen off-device.
Flash and RAM are separate budgets. Weights usually live in flash, while activations, input buffers, stacks, and framework memory consume SRAM.
The sensor pipeline is part of the model. Sampling rate, ADC resolution, microphone front end, feature extraction, and windowing must match training data.
Runtime support decides feasibility. A model that converts successfully can still fail because an operator, tensor shape, or memory plan is unsupported on the target runtime.
Power is a schedule, not a single number. Sleep current, sensor duty cycle, inference bursts, radio uploads, and firmware updates all affect battery life.

13.2 Learning Objectives

By the end of this chapter, you will be able to:

Build a TinyML deployment budget that separates flash, SRAM, latency, energy, and update headroom.
Explain how TensorFlow Lite Micro, vendor runtimes, CMSIS-NN kernels, and Edge Impulse-style workflows fit into the deployment path.
Decide whether an application is suitable for microcontroller inference or should move to a gateway, accelerator, or cloud tier.
Identify common failures caused by unsupported operators, activation memory, sensor mismatch, and unmeasured power policy.
Write a TinyML decision record that can survive hardware, firmware, and model revisions.

Most Valuable Understanding

TinyML succeeds when the whole embedded system is measured together: sensor input, preprocessing, model invocation, memory peak, clock state, radio behavior, and update path. A small model file alone is not proof that the product will work.

13.3 Prerequisites

Edge AI Fundamentals: Review edge placement and data minimization.
Model Optimization for Edge AI: Review quantization, calibration, and validation gates.
Edge AI Hardware: Review runtime, operator, thermal, and lifecycle constraints.

13.4 TinyML Fit Gate

Start with a fit gate before selecting a board or framework.

Task fit: The inference output is local, narrow, and stable enough for a compact model: wake word, gesture, anomaly score, simple classification, or thresholded event detection.

Memory fit: Model weights, firmware, runtime code, tensor arena, feature buffers, stack, logs, and update staging all fit with deliberate headroom.

Runtime fit: The target runtime supports the converted model operators, tensor shapes, quantization mode, and any optimized kernels used in production.

Sensor fit: Training and validation data come from the same sensor path, sampling policy, preprocessing code, and mounting conditions expected in the field.

Power fit: Sleep state, sensing cadence, inference duty cycle, wake sources, and radio transmissions meet the battery or thermal budget under real workloads.

Operations fit: The product has a model version, signed update path, rollback plan, threshold tuning process, and field monitoring signal.

A TinyML fit gate showing task, memory, runtime, sensor, power, and operations checks before deployment — Figure 13.1: TinyML deployment fit gate

13.4.1 Interactive: Choose a Quantization Path

Knowledge Check: Fit Gate

13.5 What Fits on a Microcontroller?

Microcontrollers vary by family and board configuration, so always check the exact datasheet and firmware build. The patterns below are useful starting points.

13.5.1 Small Sensor Models

Simple anomaly detection, gesture windows, and low-rate sensor classification often fit when the model, feature buffer, and tensor arena are intentionally small.

13.5.2 Audio Event Models

Keyword spotting and audio-event detection usually spend significant memory on feature extraction and frame buffers, not only the neural network.

13.5.3 Tiny Vision Models

Image tasks are possible on selected MCUs, but input resolution, color format, camera buffer, and activation memory usually dominate the budget.

13.5.4 Budget Components

Component	Usually Stored In	What to Measure
Model weights	Flash	Converted model size, alignment, and embedded C array overhead.
Runtime and operators	Flash	Only include operators used by the model; unused kernels waste program memory.
Tensor arena	SRAM	Peak activation memory after allocation on the actual runtime build.
Sensor and feature buffers	SRAM	Audio frames, FFT/MFCC buffers, image buffers, accelerometer windows, and preprocessing scratch space.
Firmware state	Flash and SRAM	Drivers, protocol stacks, logging, stack/heap, safety state, and application code.
Update staging	Flash or external storage	Room for signed firmware or model updates, rollback image, and version metadata.

Use two separate equations during design review:

Flash needed = firmware code + runtime kernels + model weights + metadata + update headroom.

SRAM needed = tensor arena + feature buffers + input/output tensors + stack + heap + communication buffers + logging headroom.

13.5.5 Example Memory Review

Suppose a gesture classifier has a 42 KB int8 model, a measured 58 KB tensor arena, a 16 KB accelerometer window and feature buffer, and firmware that uses 170 KB flash plus 52 KB SRAM. The flash budget is not the main concern if the board has 1 MB of program flash. The SRAM budget is the tighter gate:

Tensor arena 58 KB measured after allocation

Feature buffers 16 KB input window and preprocessing scratch

Firmware SRAM 52 KB stack, heap, drivers, and application state

Decision 126 KB before logs, radio buffers, and safety headroom

The example may fit a 256 KB SRAM device, but it should still be measured with logging, radio traffic, and worst-case sensor windows enabled.

13.6 Runtime Choices

Different TinyML workflows solve different parts of the deployment problem.

13.6.1 TensorFlow Lite Micro

Use it when you need a portable C++ inference runtime for microcontrollers without a full operating system. The deployment work is explicit: convert the model, include only required operators, allocate the tensor arena, and validate on the target.

13.6.2 CMSIS-NN and Vendor Kernels

Use optimized kernels when the target processor and model operators match the library. They can reduce latency and energy, but only the supported operations benefit.

13.6.3 Edge Impulse-Style Workflow

Use an integrated platform when the project needs fast data collection, labeling, feature extraction, model training, quantization, and firmware export. Validate the generated code like any other production dependency.

13.6.4 Vendor Code Generators

Use board-vendor tools when they match your MCU family and production toolchain. They can simplify integration, but they can also increase lock-in around model format, compiler, or board support package.

13.7 Deep Dive: Runtime and Memory Fit

TinyML is machine-learning inference on microcontroller-class devices: kilobytes to a few megabytes of memory, milliwatt power budgets, often no operating system, and sometimes no floating-point unit. That is different from an edge gateway with Linux, large RAM, and a filesystem. On an MCU, the model must be self-contained with its runtime, sensor buffers, tensor arena, firmware stack, and update path. The product constraint, not the demo model, decides feasibility.

A concrete budget pass turns that constraint into numbers. On a board with 256 KB flash and 64 KB SRAM, a 60,000-weight INT8 model consumes about 60 KB of flash. If the runtime and required operators take 92 KB, the application and radio stack take 74 KB, and update metadata takes 12 KB, the flash total is 238 KB, leaving 18 KB of headroom. SRAM is separate: an 18 KB tensor arena plus an 8 KB audio feature buffer and 12 KB for stack, logs, and radio state uses 38 KB, leaving 26 KB. That deployment is plausible; doubling the feature buffer requires a new fit check even though the model file did not change.

13.7.1 TensorFlow Lite Micro Fit Ledger

Runtime Piece	Role	Fit Evidence
Model FlatBuffer	Stores the trained graph and quantized weights in flash.	Converted artifact size, checksum, operator list, and version metadata.
Interpreter	Walks the graph and calls each operator kernel in order.	Exact runtime build, compiler flags, and target-board initialization result.
Tensor arena	Pre-allocated SRAM used for tensors and intermediate activations.	Smallest successful arena plus headroom for stack, logs, buffers, and radio state.
Operator resolver	Registers only the kernels the model needs.	Required kernels included; unused kernels excluded to save flash.

Under the hood, flash and SRAM are separate release gates. An INT8 model with 80,000 weights needs roughly 80 KB of flash for those weights because each weight is one byte. Peak live activations drive the tensor arena in SRAM; that peak is not the same as the weight count. Pruning or quantizing weights can reduce flash pressure while leaving the activation peak unchanged. Changing an input window, layer shape, or feature buffer can increase RAM even when the model file gets smaller.

weights: 80,000 INT8 params -> about 80 KB in flash
peak live activations       -> tensor arena in SRAM
sensor and feature buffers  -> additional SRAM outside the arena
release decision            -> approve only if both budgets keep headroom

On Arm Cortex-M devices, CMSIS-NN kernels can accelerate supported INT8 operators by using the core’s SIMD and multiply-accumulate instructions. That benefit is not automatic: the converted model must use supported operators, the resolver must include the right kernels, and the measured firmware build must still fit flash and SRAM with the sensor path enabled.

flowchart LR
  Model[Converted INT8 model] --> Flash[Code, constants, and weights in flash]
  Firmware[Target firmware build] --> Resolver[Register required operators]
  Resolver --> Init[Interpreter initialization]
  Flash --> Init
  Arena[Pre-allocated tensor arena in SRAM] --> Init
  Init --> Allocate{AllocateTensors succeeds?}
  Allocate -- yes --> Inference[Run inference with peak activations in arena]
  Allocate -- no --> Revise[Revise model, operators, buffers, or board]
  Inference --> Measure[Measure flash, peak SRAM, latency, and energy]
  Measure --> Approve{Headroom remains?}
  Approve -- yes --> Pilot[Pilot deployment]
  Approve -- no --> Revise

Knowledge Check: Memory Budgets

13.7.2 TensorFlow Lite Micro Skeleton

This simplified structure shows the runtime pieces to review. Production code should add return-code checks, watchdog behavior, version reporting, and target-specific memory profiling.

#include "model_data.h"
#include "tensorflow/lite/micro/micro_interpreter.h"
#include "tensorflow/lite/micro/micro_mutable_op_resolver.h"
#include "tensorflow/lite/schema/schema_generated.h"

constexpr int kTensorArenaSize = 70 * 1024;
alignas(16) static uint8_t tensor_arena[kTensorArenaSize];

tflite::MicroMutableOpResolver<4> resolver;

void setup_inference() {
  resolver.AddConv2D();
  resolver.AddDepthwiseConv2D();
  resolver.AddFullyConnected();
  resolver.AddSoftmax();

  const tflite::Model* model = tflite::GetModel(g_model_data);
  static tflite::MicroInterpreter interpreter(
    model, resolver, tensor_arena, kTensorArenaSize);

  TfLiteStatus status = interpreter.AllocateTensors();
  if (status != kTfLiteOk) {
    report_fatal_error("tensor allocation failed");
  }
}

The tensor arena value is not a universal constant. Measure it for the exact converted model, operator resolver, compiler flags, and firmware build.

Knowledge Check: Runtime Fit

13.8 TinyML Development Loop

A production TinyML loop ties data, model, firmware, and measurement together.

TinyML development loop connecting field data, training, conversion, firmware integration, target measurement, deployment, monitoring, and retraining — Figure 13.2: TinyML development and validation loop

Collect Use production-like sensors, mounting, firmware timing, and environmental variation.

Train Keep a baseline model and validation set before compression or architecture changes.

Convert Quantize with representative calibration data and record the exact converter settings.

Integrate Build firmware with the real drivers, communication stack, watchdog, and update path.

Measure Record accuracy, latency, peak SRAM, flash size, energy, and failure modes on the target device.

13.9 Sensor Data Quality

TinyML failures often come from the input path rather than the neural network.

13.9.1 Distribution Shift

A model trained on clean lab data can fail when deployed with a different microphone, ADC, accelerometer mounting, enclosure, vibration profile, or temperature range.

13.9.2 Feature Mismatch

If training uses one FFT window, mel filter bank, normalization rule, image crop, or sensor axis order, firmware must implement the same rule.

13.9.3 Label Ambiguity

Small models cannot repair inconsistent labeling. Define unknown, transition, and background classes deliberately.

13.9.4 Threshold Drift

Confidence thresholds that work in a notebook may fail when quantized, duty-cycled, or exposed to field noise.

Knowledge Check: Sensor Match

13.10 Power and Duty Cycle

Avoid quoting one power number for a TinyML product. Build a duty-cycle budget.

13.10.1 Sleep

How long does the MCU spend in its lowest useful state, and which wake sources remain active?

13.10.2 Sense

Which sensors stay powered, at what sampling rate, and with what analog front-end cost?

13.10.3 Infer

How often does the model run, at what clock speed, and with which optimized kernels?

13.10.4 Act

What local action happens after a positive detection, and how long does it keep peripherals awake?

13.10.5 Transmit

How frequently does the radio send telemetry, alerts, logs, or model-health summaries?

13.10.6 Update

Can the device download, verify, stage, and roll back firmware or model updates within its energy and storage budget?

Battery review should use measured current traces from the target board. Development-board LEDs, debug UARTs, USB power paths, and unused peripherals can hide the production power profile.

13.11 Application Fit Guide

TinyML is a strong fit when the decision is local, the model is compact, data privacy matters, connectivity is intermittent, or the radio would cost more energy than inference. It is a weak fit when the task needs large context, frequent model changes, high-resolution perception, heavy postprocessing, or continuous learning on the device.

13.11.1 Good TinyML Candidates

Wake-word or audio-event trigger.
Gesture and motion classification.
Industrial anomaly score from a fixed sensor.
Low-resolution presence or state detection.
Sensor-quality checks and local filtering.

13.11.2 Move Up the Stack

Large language, vision, or multimodal models.
Tasks needing broad world knowledge or changing labels.
High-resolution object detection with tight accuracy requirements.
Workloads that require full Linux, GPU/NPU drivers, or large memory.
Applications without a safe update and monitoring path.

Knowledge Check: Placement Decision

13.12 Implementation Checklist

Use this checklist before moving a TinyML design from prototype to pilot.

Model artifact recorded: source model, converter version, quantization mode, calibration data, operator list, and checksum.

Memory measured: flash size, tensor arena allocation, feature buffers, stack/heap high-water marks, and update storage.

Latency measured: sensor capture, preprocessing, inference, postprocessing, local action, and telemetry path.

Power measured: sleep, sensing, inference burst, radio transmission, update flow, and representative duty cycle.

Data validated: production sensor data, edge cases, background class, threshold behavior, and field replay tests.

Operations ready: model version in telemetry, rollback plan, failure reporting, and safe degraded mode.

13.13 TinyML Decision Record

A short decision record makes TinyML choices maintainable:

Use case: local decision, response time, privacy or connectivity driver, and failure cost.
Target hardware: MCU, clock policy, flash, SRAM, sensors, radio, and update storage.
Model contract: input shape, output semantics, precision, operators, and confidence threshold.
Memory budget: flash and SRAM measured values with headroom.
Power budget: measured duty cycle and expected deployment profile.
Validation evidence: dataset split, target-device replay, field edge cases, and acceptance gates.
Lifecycle plan: versioning, signed update, rollback, monitoring, and retraining trigger.

Label the Diagram: TinyML Fit Gate

Code Challenge: TFLM Invocation

Match Concepts

Order the TinyML Deployment Steps

13.14 Common Pitfalls

1. Approving From Model Size Alone

A 40 KB model can still fail if activations, feature buffers, communication buffers, or stack growth exceed SRAM. Approve the measured firmware build, not the model file.

2. Training on the Wrong Sensor Path

Desktop audio, clean CSV files, or lab-mounted accelerometers may not match the field enclosure, ADC, sampling rate, axis orientation, or environmental noise. Collect representative data early.

3. Including Too Many Runtime Operators

A broad operator resolver increases flash usage and attack surface. Register the kernels the model actually uses and document the operator list.

4. Treating Power as Average Inference Cost

Battery life depends on the full duty cycle: sleep, sensor front end, clock changes, inference bursts, local action, radio use, logs, and updates.

5. Shipping Without a Model Lifecycle

TinyML still needs versioning, monitoring, signed updates, rollback, and a retraining trigger. A model that cannot be updated safely becomes a field liability.

13.15 Reference Path

Use these primary references when implementing a real TinyML deployment:

TensorFlow Lite for Microcontrollers: runtime concepts and microcontroller examples.
TensorFlow Lite model optimization: quantization and optimization entry points.
TensorFlow Lite Micro build and convert guide: converting models for microcontroller use.
Arm CMSIS-NN: optimized neural-network kernels for Arm Cortex-M processors.
Edge Impulse documentation: data collection, signal processing, model training, and deployment workflow.
MLCommons Tiny benchmark: benchmark suite for tiny inference workloads.

13.16 Summary

TinyML is the microcontroller form of edge AI. It is useful when a compact local decision saves latency, privacy exposure, radio energy, or dependency on connectivity. The hard part is system fit: flash, SRAM, operators, sensor data, duty cycle, and update operations must all be measured on the target device. Use TinyML when the local decision is narrow and stable; move to a gateway or accelerator when the model needs more memory, compute, context, or lifecycle flexibility.

13.17 What’s Next

Continue through the edge AI sequence:

Edge AI Overview: Connect TinyML to the broader edge AI architecture.
Model Optimization for Edge AI: Review quantization, pruning, and validation choices.
Edge AI Hardware: Compare MCU, gateway, accelerator, and server-class deployment targets.
Edge AI Lab: Practice measurement gates in a TinyML-style gesture classification workflow.

13.18 Key Takeaway

TinyML works when the model, sensor loop, memory budget, power budget, and update process fit the device. Plan validation and drift handling early because field data rarely behaves like a clean training set.