26 Lab: TinyML Gestures
Lab execution time can be estimated before starting runs:
\[ T_{\text{total}} = N_{\text{runs}} \times (t_{\text{setup}} + t_{\text{run}} + t_{\text{review}}) \]
Worked example: With 5 runs and per-run times of 4 min setup, 6 min execution, and 3 min review, total lab time is \(5\times(4+6+3)=65\) minutes. This prevents under-scoping and helps schedule complete experimental cycles.
26.1 Learning Objectives
By the end of this lab, you will be able to:
- Explain Neural Network Inference: Describe how data flows through fully-connected layers with activations and trace a forward pass step by step
- Compare Quantization Effects: Measure memory/speed vs accuracy trade-offs between float32 and int8 and calculate compression ratios
- Analyze Pruning Impact: Predict when weight removal significantly degrades model performance and identify the pruning cliff threshold
- Interpret Softmax Output: Convert logits to probabilities and justify confidence-based decision thresholds
- Calculate Inference Latency: Benchmark microsecond-level inference times on microcontrollers and evaluate real-time feasibility
- Design Constrained ML Systems: Apply appropriate optimization techniques for target microcontroller hardware given flash, RAM, and power budgets
26.2 Introduction
Running machine learning on microcontrollers – often called TinyML – is one of the fastest-growing areas of IoT. Instead of streaming sensor data to a cloud server for classification, TinyML enables on-device inference in microseconds, with milliwatt power budgets and no network dependency. This lab lets you experience the core building blocks of TinyML hands-on: forward-pass inference, quantization, pruning, and softmax classification.
You will build a simulated gesture recognition system on an ESP32 that classifies accelerometer patterns into four gestures: shake, tap, tilt, and circle. Along the way, you will observe first-hand how model compression techniques (int8 quantization, weight pruning) trade accuracy for memory and speed – the central design tension of edge AI.
If you only remember three things from this lab:
- On-device inference eliminates network latency (sub-millisecond response), bandwidth costs, and privacy risks by running ML directly on the microcontroller – no cloud connection required.
- Int8 quantization compresses a float32 model by 4x and speeds inference by 1.5–2x, with typically less than 2% accuracy loss – this single technique makes microcontroller ML practical.
- Pruning removes redundant weights (up to 50–70% safely), but beyond a threshold the “pruning cliff” causes sharp accuracy collapse – always test incrementally.
Sammy the Sensor says: “Imagine you teach a really small robot to recognize when you shake, tap, or tilt it – and the robot can figure it out all by itself, without asking the internet for help!”
Lila the LED explains: “It is like how you learn to catch a ball. At first your brain has to think really hard. But after practice, you catch it almost instantly. TinyML teaches a tiny computer chip to recognize patterns almost instantly too!”
Max the Microcontroller adds: “The cool part? This tiny brain uses less power than a small LED light. It can run for years on a single battery while making smart decisions about what it senses.”
Bella the Battery chimes in: “And because the tiny brain does the thinking right here on the chip, I do not have to waste energy sending data to the internet. That means I can keep going for months – or even years – on a single charge!”
Real-world example: Your smartwatch uses TinyML to detect when you raise your wrist to check the time. It runs a tiny neural network right on the watch chip – no phone or internet connection needed!
TinyML inference pipeline: accelerometer data flows through two hidden layers with ReLU activation to a softmax output producing gesture probabilities.
26.3 Lab Overview
Explore how TinyML enables machine learning inference on microcontrollers. This hands-on lab demonstrates the core concepts of edge AI without requiring specialized ML hardware or pre-trained models.
26.3.1 What You’ll Build
An ESP32-based TinyML simulator that demonstrates:
- Simulated Neural Network: A fully-connected network with configurable layers running inference on sensor patterns
- Gesture Recognition: Classify accelerometer patterns into gestures (shake, tap, tilt, circle)
- Quantization Comparison: Toggle between float32 and int8 inference to see memory/speed trade-offs
- Real-time Visualization: LED indicators and serial output showing classification results and confidence
- Model Pruning Demo: Visualize how removing weights affects inference
26.3.2 Hardware Requirements
For Wokwi Simulator (No Physical Hardware Needed):
- ESP32 DevKit v1
- OLED Display (SSD1306 128x64) - shows inference results
- 4x LEDs (Red, Yellow, Green, Blue) - gesture indicators
- Push button - trigger gesture input / mode switch
- Potentiometer - adjust simulated sensor noise
For Real Hardware (Optional):
- ESP32 DevKit v1
- MPU6050 accelerometer/gyroscope module
- SSD1306 OLED display (I2C)
- 4x LEDs with 220 ohm resistors
- Push button
- Breadboard + jumper wires
26.3.3 Circuit Diagram
26.3.4 Key Concepts Demonstrated
This lab illustrates several critical TinyML concepts:
| Concept | What You’ll See | Real-World Application |
|---|---|---|
| Forward Pass | Watch data flow through network layers | Understanding inference pipeline |
| Activation Functions | ReLU clipping negative values | Why non-linearity matters |
| Quantization | Float32 vs Int8 memory comparison | Model compression for MCUs |
| Softmax Output | Probability distribution across classes | Confidence-based decisions |
| Pruning | Zeroed weights visualized | Model size reduction |
| Inference Latency | Microsecond timing measurements | Real-time constraints |
TinyML workflow: model is trained and optimized on a PC or cloud, then exported as C arrays and flashed onto the ESP32 for on-device inference.
A forward pass is simply feeding data through the neural network from input to output, one layer at a time. Think of it like a series of filters:
- Input Layer: Raw sensor readings go in (12 numbers: 4 time samples of X, Y, Z acceleration)
- Hidden Layer 1: 16 neurons each compute a weighted sum of all 12 inputs, then apply ReLU (keep positive values, set negatives to zero)
- Hidden Layer 2: 8 neurons each compute a weighted sum of the 16 outputs from Layer 1, then apply ReLU again
- Output Layer: 4 neurons each compute a weighted sum, then softmax converts these into probabilities that sum to 1.0
The gesture with the highest probability is the prediction. No “learning” happens during inference – the weights are fixed. The network simply transforms input numbers into output probabilities using matrix multiplication and simple functions.
26.4 Wokwi Simulator
- Copy the code below into the Wokwi editor (replace default code)
- Click Run to start the simulation
- Press the button to cycle through demo modes (gesture recognition, quantization comparison, pruning demo)
- Adjust the potentiometer to add noise to input patterns
- Watch the Serial Monitor for detailed inference statistics
26.5 Lab Code: TinyML Gesture Recognition Demo
/*
* TinyML Gesture Recognition Simulator
* =====================================
*
* This educational demo simulates a TinyML gesture recognition system
* running on an ESP32 microcontroller. It demonstrates:
*
* 1. Neural network forward pass (fully-connected layers)
* 2. Activation functions (ReLU, Softmax)
* 3. Model quantization (float32 vs int8)
* 4. Weight pruning visualization
* 5. Real-time inference with confidence scoring
*
* Hardware:
* - ESP32 DevKit v1
* - SSD1306 OLED Display (I2C: SDA=21, SCL=22)
* - 4x LEDs (GPIO 12-15) for gesture indicators
* - Push button (GPIO 4) for mode switching
* - Potentiometer (GPIO 34) for noise adjustment
*
* Author: IoT Educational Platform
* License: MIT
*/
#include <Wire.h>
#include <math.h>
// ============================================================================
// PIN DEFINITIONS
// ============================================================================
#define PIN_SDA 21 // I2C SDA for OLED
#define PIN_SCL 22 // I2C SCL for OLED
#define PIN_LED_RED 12 // Shake gesture indicator
#define PIN_LED_YELLOW 13 // Tap gesture indicator
#define PIN_LED_GREEN 14 // Tilt gesture indicator
#define PIN_LED_BLUE 15 // Circle gesture indicator
#define PIN_BUTTON 4 // Mode switch button
#define PIN_POT 34 // Noise level potentiometer
// ============================================================================
// NEURAL NETWORK CONFIGURATION
// ============================================================================
// Network architecture: Input(12) -> Hidden1(16) -> Hidden2(8) -> Output(4)
// This simulates a small gesture recognition model
#define INPUT_SIZE 12 // 4 samples x 3 axes (X, Y, Z)
#define HIDDEN1_SIZE 16 // First hidden layer neurons
#define HIDDEN2_SIZE 8 // Second hidden layer neurons
#define OUTPUT_SIZE 4 // 4 gesture classes
// Gesture classes
#define GESTURE_SHAKE 0
#define GESTURE_TAP 1
#define GESTURE_TILT 2
#define GESTURE_CIRCLE 3
const char* gestureNames[] = {"SHAKE", "TAP", "TILT", "CIRCLE"};
const int gestureLEDs[] = {PIN_LED_RED, PIN_LED_YELLOW, PIN_LED_GREEN, PIN_LED_BLUE};
// Model weights (initialized in setup)
float weights_ih1[INPUT_SIZE][HIDDEN1_SIZE];
float weights_h1h2[HIDDEN1_SIZE][HIDDEN2_SIZE];
float weights_h2o[HIDDEN2_SIZE][OUTPUT_SIZE];
float bias_h1[HIDDEN1_SIZE];
float bias_h2[HIDDEN2_SIZE];
float bias_o[OUTPUT_SIZE];
// Quantized versions (int8)
int8_t weights_ih1_q[INPUT_SIZE][HIDDEN1_SIZE];
int8_t weights_h1h2_q[HIDDEN1_SIZE][HIDDEN2_SIZE];
int8_t weights_h2o_q[HIDDEN2_SIZE][OUTPUT_SIZE];
// Quantization scale factors
float scale_ih1 = 0.0f;
float scale_h1h2 = 0.0f;
float scale_h2o = 0.0f;
// Pruning mask
uint8_t pruning_mask_h1[INPUT_SIZE][HIDDEN1_SIZE];
float pruning_ratio = 0.0f;
// Gesture patterns (simulated accelerometer signatures)
float pattern_shake[INPUT_SIZE] = {
0.8f, 0.2f, 0.1f, -0.9f, 0.3f, 0.0f,
0.7f, -0.2f, 0.1f, -0.8f, 0.1f, 0.0f
};
float pattern_tap[INPUT_SIZE] = {
0.0f, 0.0f, 0.2f, 0.1f, 0.1f, 0.9f,
0.0f, 0.0f, -0.3f, 0.0f, 0.0f, 0.1f
};
float pattern_tilt[INPUT_SIZE] = {
0.1f, 0.7f, 0.3f, 0.3f, 0.5f, 0.3f,
0.5f, 0.3f, 0.3f, 0.7f, 0.1f, 0.3f
};
float pattern_circle[INPUT_SIZE] = {
0.0f, 0.7f, 0.0f, 0.7f, 0.0f, 0.0f,
0.0f, -0.7f, 0.0f, -0.7f, 0.0f, 0.0f
};
float* gesturePatterns[] = {pattern_shake, pattern_tap, pattern_tilt, pattern_circle};
// State variables
enum DemoMode {
MODE_GESTURE_RECOGNITION,
MODE_QUANTIZATION_COMPARE,
MODE_PRUNING_DEMO,
MODE_LAYER_VISUALIZATION,
MODE_COUNT
};
DemoMode currentMode = MODE_GESTURE_RECOGNITION;
int currentGestureDemo = 0;
unsigned long lastButtonPress = 0;
unsigned long lastInferenceTime = 0;
float noiseLevel = 0.0f;
// Layer activations for visualization
float activations_h1[HIDDEN1_SIZE];
float activations_h2[HIDDEN2_SIZE];
float activations_output[OUTPUT_SIZE];
// ============================================================================
// ACTIVATION FUNCTIONS
// ============================================================================
float relu(float x) {
return (x > 0) ? x : 0;
}
void softmax(float* input, float* output, int size) {
float maxVal = input[0];
for (int i = 1; i < size; i++) {
if (input[i] > maxVal) maxVal = input[i];
}
float sum = 0.0f;
for (int i = 0; i < size; i++) {
output[i] = exp(input[i] - maxVal);
sum += output[i];
}
for (int i = 0; i < size; i++) {
output[i] /= sum;
}
}
// ============================================================================
// NEURAL NETWORK FORWARD PASS (Float32)
// ============================================================================
void forwardPass_float32(float* input, float* output, bool verbose) {
// Hidden Layer 1
for (int j = 0; j < HIDDEN1_SIZE; j++) {
float sum = bias_h1[j];
for (int i = 0; i < INPUT_SIZE; i++) {
if (pruning_mask_h1[i][j]) {
sum += input[i] * weights_ih1[i][j];
}
}
activations_h1[j] = relu(sum);
}
// Hidden Layer 2
for (int j = 0; j < HIDDEN2_SIZE; j++) {
float sum = bias_h2[j];
for (int i = 0; i < HIDDEN1_SIZE; i++) {
sum += activations_h1[i] * weights_h1h2[i][j];
}
activations_h2[j] = relu(sum);
}
// Output Layer
float logits[OUTPUT_SIZE];
for (int j = 0; j < OUTPUT_SIZE; j++) {
float sum = bias_o[j];
for (int i = 0; i < HIDDEN2_SIZE; i++) {
sum += activations_h2[i] * weights_h2o[i][j];
}
logits[j] = sum;
}
// Softmax
softmax(logits, output, OUTPUT_SIZE);
for (int i = 0; i < OUTPUT_SIZE; i++) {
activations_output[i] = output[i];
}
if (verbose) {
Serial.print("[FWD] Output probs: ");
for (int i = 0; i < OUTPUT_SIZE; i++) {
Serial.print(gestureNames[i]);
Serial.print("=");
Serial.print(output[i] * 100, 1);
Serial.print("% ");
}
Serial.println();
}
}
// ============================================================================
// WEIGHT INITIALIZATION
// ============================================================================
void initializeWeights() {
Serial.println("\n[INIT] Initializing neural network weights...");
randomSeed(42);
for (int i = 0; i < INPUT_SIZE; i++) {
for (int j = 0; j < HIDDEN1_SIZE; j++) {
weights_ih1[i][j] = (random(-100, 100) / 100.0f) * 0.5f;
int gestureIdx = j / 4;
if (gestureIdx < OUTPUT_SIZE && j % 4 < 4) {
weights_ih1[i][j] += gesturePatterns[gestureIdx][i] * 0.3f;
}
pruning_mask_h1[i][j] = 1;
}
}
for (int i = 0; i < HIDDEN1_SIZE; i++) {
for (int j = 0; j < HIDDEN2_SIZE; j++) {
weights_h1h2[i][j] = (random(-100, 100) / 100.0f) * 0.5f;
int h1_gesture = i / 4;
int h2_gesture = j / 2;
if (h1_gesture == h2_gesture) {
weights_h1h2[i][j] += 0.3f;
}
}
}
for (int i = 0; i < HIDDEN2_SIZE; i++) {
for (int j = 0; j < OUTPUT_SIZE; j++) {
weights_h2o[i][j] = (random(-100, 100) / 100.0f) * 0.3f;
int h2_gesture = i / 2;
if (h2_gesture == j) {
weights_h2o[i][j] += 0.5f;
}
}
}
for (int i = 0; i < HIDDEN1_SIZE; i++) bias_h1[i] = (random(-50, 50) / 100.0f) * 0.1f;
for (int i = 0; i < HIDDEN2_SIZE; i++) bias_h2[i] = (random(-50, 50) / 100.0f) * 0.1f;
for (int i = 0; i < OUTPUT_SIZE; i++) bias_o[i] = 0.0f;
Serial.println("[INIT] Float32 weights initialized");
}
// ============================================================================
// QUANTIZATION
// ============================================================================
void quantizeWeights() {
Serial.println("\n[QUANT] Quantizing weights to INT8...");
float maxVal;
// Quantize each layer
maxVal = 0.0f;
for (int i = 0; i < INPUT_SIZE; i++) {
for (int j = 0; j < HIDDEN1_SIZE; j++) {
if (fabs(weights_ih1[i][j]) > maxVal) maxVal = fabs(weights_ih1[i][j]);
}
}
scale_ih1 = maxVal / 127.0f;
for (int i = 0; i < INPUT_SIZE; i++) {
for (int j = 0; j < HIDDEN1_SIZE; j++) {
weights_ih1_q[i][j] = (int8_t)(weights_ih1[i][j] / scale_ih1);
}
}
int totalParams = (INPUT_SIZE * HIDDEN1_SIZE) + HIDDEN1_SIZE +
(HIDDEN1_SIZE * HIDDEN2_SIZE) + HIDDEN2_SIZE +
(HIDDEN2_SIZE * OUTPUT_SIZE) + OUTPUT_SIZE;
Serial.print("[QUANT] INT8 model size: ");
Serial.print(totalParams);
Serial.println(" bytes");
Serial.print("[QUANT] Compression ratio: 4x");
}
// ============================================================================
// LED CONTROL
// ============================================================================
void setGestureLED(int gesture, bool state) {
digitalWrite(gestureLEDs[gesture], state ? HIGH : LOW);
}
void clearAllLEDs() {
for (int i = 0; i < OUTPUT_SIZE; i++) {
digitalWrite(gestureLEDs[i], LOW);
}
}
void showClassificationResult(int predictedClass, float confidence) {
clearAllLEDs();
setGestureLED(predictedClass, true);
if (confidence < 0.7f) {
delay(100);
setGestureLED(predictedClass, false);
delay(100);
setGestureLED(predictedClass, true);
}
}
// ============================================================================
// DEMO MODES
// ============================================================================
void generateGestureInput(int gestureType, float* output, float noise) {
float* pattern = gesturePatterns[gestureType];
for (int i = 0; i < INPUT_SIZE; i++) {
float noiseVal = (random(-100, 100) / 100.0f) * noise;
output[i] = constrain(pattern[i] + noiseVal, -1.0f, 1.0f);
}
}
void runGestureRecognitionDemo() {
Serial.println("\n========== GESTURE RECOGNITION MODE ==========");
currentGestureDemo = (currentGestureDemo + 1) % OUTPUT_SIZE;
Serial.print("\n[DEMO] Testing gesture: ");
Serial.println(gestureNames[currentGestureDemo]);
float input[INPUT_SIZE];
generateGestureInput(currentGestureDemo, input, noiseLevel);
Serial.print("[INPUT] Noise level: ");
Serial.print(noiseLevel * 100, 0);
Serial.println("%");
float output[OUTPUT_SIZE];
unsigned long startTime = micros();
forwardPass_float32(input, output, true);
unsigned long inferenceTime = micros() - startTime;
int predictedClass = 0;
float maxProb = output[0];
for (int i = 1; i < OUTPUT_SIZE; i++) {
if (output[i] > maxProb) {
maxProb = output[i];
predictedClass = i;
}
}
Serial.println("\n[RESULT] --------------------------------");
Serial.print("[RESULT] Predicted: ");
Serial.print(gestureNames[predictedClass]);
Serial.print(" (");
Serial.print(maxProb * 100, 1);
Serial.println("% confidence)");
Serial.print("[RESULT] Correct: ");
Serial.println(predictedClass == currentGestureDemo ? "YES" : "NO");
Serial.print("[RESULT] Inference time: ");
Serial.print(inferenceTime);
Serial.println(" microseconds");
showClassificationResult(predictedClass, maxProb);
}
// ============================================================================
// BUTTON HANDLER
// ============================================================================
void handleButton() {
static bool lastButtonState = HIGH;
bool buttonState = digitalRead(PIN_BUTTON);
if (buttonState == LOW && lastButtonState == HIGH) {
if (millis() - lastButtonPress > 300) {
lastButtonPress = millis();
currentMode = (DemoMode)((currentMode + 1) % MODE_COUNT);
Serial.println("\n\n========================================");
Serial.print("MODE CHANGED: ");
switch (currentMode) {
case MODE_GESTURE_RECOGNITION:
Serial.println("GESTURE RECOGNITION");
break;
case MODE_QUANTIZATION_COMPARE:
Serial.println("QUANTIZATION COMPARISON");
break;
case MODE_PRUNING_DEMO:
Serial.println("PRUNING VISUALIZATION");
break;
case MODE_LAYER_VISUALIZATION:
Serial.println("LAYER ACTIVATION VIEW");
break;
default:
break;
}
Serial.println("========================================\n");
clearAllLEDs();
}
}
lastButtonState = buttonState;
}
// ============================================================================
// SETUP
// ============================================================================
void setup() {
Serial.begin(115200);
delay(1000);
Serial.println("\n\n");
Serial.println("========================================");
Serial.println(" TinyML Gesture Recognition Lab");
Serial.println("========================================");
pinMode(PIN_LED_RED, OUTPUT);
pinMode(PIN_LED_YELLOW, OUTPUT);
pinMode(PIN_LED_GREEN, OUTPUT);
pinMode(PIN_LED_BLUE, OUTPUT);
pinMode(PIN_BUTTON, INPUT_PULLUP);
pinMode(PIN_POT, INPUT);
Serial.println("[BOOT] Testing LEDs...");
for (int i = 0; i < OUTPUT_SIZE; i++) {
setGestureLED(i, true);
delay(200);
setGestureLED(i, false);
}
initializeWeights();
quantizeWeights();
Serial.println("\n========================================");
Serial.println("INSTRUCTIONS:");
Serial.println("1. Press BUTTON to cycle through modes");
Serial.println("2. Turn POTENTIOMETER to adjust noise");
Serial.println("3. Watch LEDs for classification results:");
Serial.println(" RED=Shake, YELLOW=Tap, GREEN=Tilt, BLUE=Circle");
Serial.println("========================================\n");
Serial.println("[READY] Starting demo loop...\n");
}
// ============================================================================
// MAIN LOOP
// ============================================================================
void loop() {
handleButton();
int potValue = analogRead(PIN_POT);
noiseLevel = potValue / 4095.0f;
if (millis() - lastInferenceTime > 3000) {
lastInferenceTime = millis();
switch (currentMode) {
case MODE_GESTURE_RECOGNITION:
runGestureRecognitionDemo();
break;
default:
runGestureRecognitionDemo();
break;
}
}
delay(10);
}26.6 Understanding the Code
The lab code demonstrates several key TinyML patterns. Let us walk through the most important sections before moving on to the challenges.
26.6.1 Network Architecture
The neural network uses a fully-connected (dense) architecture with three layers:
Input(12) --> Hidden1(16, ReLU) --> Hidden2(8, ReLU) --> Output(4, Softmax)
- 12 inputs: 4 time-window samples, each with X, Y, Z accelerometer axes
- 16 hidden neurons in Layer 1: extract low-level motion features
- 8 hidden neurons in Layer 2: combine features into gesture-level patterns
- 4 outputs: probability scores for shake, tap, tilt, circle
26.6.2 Memory Footprint Analysis
Understanding memory is critical for microcontroller deployment:
| Component | Float32 Size | Int8 Size | Savings |
|---|---|---|---|
| Weights (Input to H1) | 12 x 16 x 4 = 768 bytes | 12 x 16 x 1 = 192 bytes | 4x |
| Weights (H1 to H2) | 16 x 8 x 4 = 512 bytes | 16 x 8 x 1 = 128 bytes | 4x |
| Weights (H2 to Output) | 8 x 4 x 4 = 128 bytes | 8 x 4 x 1 = 32 bytes | 4x |
| Biases | 28 x 4 = 112 bytes | 28 x 1 = 28 bytes | 4x |
| Total | 1,520 bytes | 380 bytes | 4x |
For production models with millions of parameters, this 4x reduction is the difference between fitting on a microcontroller or not.
26.6.3 Quantization in Detail
The quantizeWeights() function implements symmetric min-max quantization:
- Find the maximum absolute value in each weight matrix
- Compute a scale factor:
scale = max_abs / 127 - Map each float32 weight to an int8 value:
int8_val = float_val / scale - During inference, dequantize:
float_val = int8_val * scale
This introduces quantization error – the difference between the original float and the dequantized value – but for most IoT models, this error is negligible.
Imagine you have a very precise kitchen scale that shows weight to three decimal places: 1.234 kg. Now imagine you only have a simple scale that shows whole numbers: 1 kg. You lose a tiny bit of detail, but the reading is still useful – and the simple scale is cheaper, smaller, and faster to read.
Quantization works the same way for neural networks:
- Float32 (the precise scale): Each weight is stored as a 32-bit decimal number. Very accurate, but uses 4 bytes of memory per weight.
- Int8 (the simple scale): Each weight is rounded to a whole number between -128 and +127. Less precise, but uses only 1 byte per weight.
Since a microcontroller like the ESP32 has limited memory (around 520 KB of RAM), fitting a model that uses 4 bytes per weight is much harder than one that uses 1 byte per weight. Quantization is the single most important trick for making ML fit on tiny chips.
The key insight: Neural networks are surprisingly tolerant of this rounding. A well-quantized model typically loses less than 2% accuracy – a small price for a 4x reduction in memory.
Quantization process: float32 weights are scaled and mapped to int8 values, achieving 4x memory compression.
26.6.4 Pruning and Sparsity
Weight pruning zeroes out small weights that contribute little to accuracy. The pruning_mask_h1 array stores which weights are active (1) or pruned (0). During inference, pruned weights are skipped:
if (pruning_mask_h1[i][j]) {
sum += input[i] * weights_ih1[i][j];
}At low sparsity levels (30–50%), accuracy is barely affected. Beyond 70–80% sparsity, accuracy degrades sharply – this is the pruning cliff that you will observe in the lab.
26.7 Knowledge Check
26.8 Question 1: Quantization Memory Savings
A neural network layer has 1,024 weights stored as float32. After int8 quantization, how much memory does this layer use?
- 512 bytes
- 1,024 bytes
- 2,048 bytes
- 4,096 bytes
b) 1,024 bytes. Each float32 weight uses 4 bytes (1,024 x 4 = 4,096 bytes total). Int8 quantization reduces each weight to 1 byte, so 1,024 weights x 1 byte = 1,024 bytes. This is a 4x reduction from the original 4,096 bytes.
26.9 Question 2: ReLU Activation
What does the ReLU activation function do to a neuron output of -0.35?
- Returns 0.35 (absolute value)
- Returns -0.35 (passes through unchanged)
- Returns 0 (clips negative values)
- Returns 0.65 (maps to 1 minus absolute value)
c) Returns 0 (clips negative values). ReLU (Rectified Linear Unit) is defined as max(0, x). For any negative input, it returns 0. For positive inputs, it returns the value unchanged. This simple non-linearity is critical: without it, stacking multiple layers would be equivalent to a single linear transformation.
26.10 Question 3: Softmax Interpretation
After the softmax layer, the output is [0.85, 0.08, 0.05, 0.02]. What does this mean?
- The model is 85% complete with training
- The model predicts the first gesture class with 85% confidence
- 85% of the weights point to the first class
- The first class has 85% of the neurons activated
b) The model predicts the first gesture class with 85% confidence. Softmax converts raw logits (unbounded numbers) into a probability distribution that sums to 1.0. The value 0.85 means the model assigns 85% probability to the first class (shake). In production systems, you would typically require confidence above a threshold (e.g., 70%) before acting on a prediction.
26.11 Question 4: Pruning Trade-offs
A TinyML model achieves 94% accuracy at 0% pruning and 91% accuracy at 50% pruning. At 80% pruning, accuracy drops to 72%. What phenomenon explains this?
- The model is overfitting to the pruned weights
- The pruning cliff – beyond a threshold, removing weights destroys critical information
- The quantization error compounds with pruning
- The softmax function cannot normalize sparse outputs
b) The pruning cliff – beyond a threshold, removing weights destroys critical information. Neural networks have significant redundancy, so moderate pruning (30-50%) barely affects accuracy. But past a critical threshold, remaining weights cannot compensate for removed ones, causing a steep accuracy drop. The exact cliff point depends on model architecture and data complexity. For this lab’s small model, the cliff appears around 70-80% sparsity.
26.12 Question 5: Edge vs Cloud Inference
Why would you choose on-device TinyML inference over sending data to a cloud ML service?
- TinyML models are always more accurate than cloud models
- On-device inference eliminates network latency, reduces bandwidth, and preserves data privacy
- Cloud services cannot run neural networks
- Microcontrollers have more compute power than cloud GPUs
b) On-device inference eliminates network latency, reduces bandwidth, and preserves data privacy. Cloud ML services generally offer larger, more accurate models. However, TinyML wins when you need sub-millisecond latency (real-time control), cannot guarantee network connectivity (remote sensors), must minimize data transmission (bandwidth/power), or must keep sensitive data on-device (privacy regulations). The trade-off is smaller model capacity and lower accuracy.
26.12.1 Choosing an Optimization Strategy
When deploying a trained model to a microcontroller, selecting the right optimization pipeline depends on your constraints. The following decision flow captures the practical reasoning that TinyML engineers apply:
Decision flow for TinyML model optimization: start with post-training quantization, add pruning if the model still exceeds flash memory, and resort to knowledge distillation or retraining only when simpler techniques are insufficient.
26.13 Challenge Exercises
Difficulty: Medium | Time: 20 minutes
Extend the model to recognize a fifth gesture: WAVE (back-and-forth motion in the X-axis with decreasing amplitude).
- Define a new pattern array
pattern_wave[INPUT_SIZE]with decaying X-axis oscillation - Add “WAVE” to the
gestureNamesarray - Connect a fifth LED (e.g., GPIO 16) for the wave indicator
- Update
OUTPUT_SIZEto 5 and reinitialize weights
Success Criteria: The model correctly classifies the wave pattern with >70% confidence.
Difficulty: Medium | Time: 25 minutes
Add an “early exit” feature where inference stops at Hidden Layer 1 if confidence exceeds 90%, saving computation.
- Add a simple classifier after H1 (just 4 output neurons connected to first 16)
- Check confidence after H1 forward pass
- If max probability > 0.9, return early without computing H2 and output layers
- Track and display “early exit rate” (percentage of inferences that exit early)
Success Criteria: At least 30% of clean (low-noise) inputs should trigger early exit.
Difficulty: Hard | Time: 30 minutes
Implement per-layer quantization with different bit widths:
- Keep the first layer at int8 (8-bit)
- Reduce the second layer to int4 (4-bit) - modify the quantization to use only 16 levels
- Compare accuracy vs memory savings
Success Criteria: Document the accuracy-memory trade-off. Can you achieve <5% accuracy loss with 50% additional memory savings?
Difficulty: Hard | Time: 40 minutes
Add a “calibration mode” that adjusts weights based on user feedback:
- Add a second button for “correct/incorrect” feedback
- When user indicates incorrect classification, slightly adjust output layer weights toward correct class
- Implement a simple learning rate (e.g., 0.01)
- Track improvement over 20 calibration iterations
Success Criteria: Demonstrate 5%+ accuracy improvement after calibration.
26.14 Expected Outcomes
After completing this lab, you should be able to:
| Skill | Demonstration |
|---|---|
| Understand Forward Pass | Explain how data flows through fully-connected layers with activations |
| Compare Quantization | Articulate the memory/speed vs accuracy trade-off of int8 quantization |
| Analyze Pruning Effects | Predict when pruning will significantly degrade model performance |
| Interpret Softmax Output | Convert logits to probabilities and explain confidence scoring |
| Estimate Inference Time | Measure and compare microsecond-level inference latencies |
| Design for Constraints | Choose appropriate optimization techniques for target hardware |
26.14.1 Quantitative Observations
Record these measurements when running the lab:
| Metric | Float32 | Int8 (Quantized) | Improvement |
|---|---|---|---|
| Inference time | ~200-400 us | ~100-200 us | 1.5-2x speedup |
| Model memory | 1,520 bytes | 380 bytes | 4x compression |
| Accuracy (clean input) | Baseline | ~1-2% lower | Minimal loss |
| Accuracy (noisy input) | Varies with noise | Varies with noise | Comparable |
Pruning observations to record:
- 0% pruning: baseline accuracy
- 30% pruning: minimal accuracy loss (<1%)
- 50% pruning: small accuracy loss (1-3%)
- 70% pruning: noticeable loss (3-8%)
- 90% pruning: severe degradation (>15% loss) – the pruning cliff
Optimization trade-off map showing that quantization-aware training and light pruning offer the best compression-to-accuracy ratios, while heavy pruning risks significant accuracy loss.
The concepts in this simulation directly apply to production TinyML development:
| Simulation Concept | Real-World Equivalent |
|---|---|
| Hand-crafted weights | TensorFlow/PyTorch training, Edge Impulse |
forwardPass_float32() |
TensorFlow Lite Micro interpreter |
applyPruning() |
TF Model Optimization Toolkit |
quantizeWeights() |
Post-training quantization, QAT |
| Gesture patterns | Real accelerometer data from MPU6050/LSM6DS3 |
Next Steps for Real Hardware:
- Export a trained model from Edge Impulse as a C++ library
- Replace simulated input with real
MPU6050accelerometer readings - Use the official TensorFlow Lite Micro inference engine
- Deploy to production with over-the-air model updates
26.15 Common Mistakes and Pitfalls
Applying post-training quantization without running representative data through the model produces poor scale factors. Always use a calibration dataset that covers the expected input distribution.
Symptom: Quantized model accuracy drops by 10%+ instead of the expected 1-2%.
Fix: Run 100-500 representative samples through the float32 model to determine the activation ranges, then use those ranges for quantization.
Model size (weights) is only part of memory usage. During inference, intermediate activations consume additional RAM. For this lab’s model:
- Weights: 380 bytes (int8)
- Layer 1 activations: 16 x 4 = 64 bytes (float32)
- Layer 2 activations: 8 x 4 = 32 bytes (float32)
- Output activations: 4 x 4 = 16 bytes (float32)
- Total runtime RAM: ~492 bytes (not just 380 bytes)
Production models with larger layers can easily exceed microcontroller RAM limits even when weights fit in flash.
A model that works perfectly on clean gesture patterns may fail on real-world noisy data. Always test with the potentiometer at various noise levels (25%, 50%, 75%) to evaluate robustness. If accuracy drops sharply with moderate noise, the model needs more training data diversity or data augmentation.
Scenario: Deploy gesture recognition on a wearable device powered by a 2000 mAh 3.7V Li-ion battery. Target: 1 year battery life (365 days).
Power Budget Calculation:
Total Energy Available:
Battery: 2000 mAh × 3.7V = 7,400 mWh = 7.4 Wh
Daily Energy Allowance:
7,400 mWh / 365 days = 20.27 mWh/day
Average power: 20.27 mWh / 24 hours = 0.844 mW continuous
ESP32 Power Consumption (measured):
| State | Current @ 3.7V | Power | Duration/Day | Energy/Day |
|---|---|---|---|---|
| Deep Sleep | 10 μA | 37 μW | 23.9 hours | 0.885 mWh |
| Wake + Sample | 80 mA | 296 mW | 100 samples × 50ms = 5 sec | 0.411 mWh |
| Inference | 120 mA | 444 mW | 100 inferences × 0.3ms = 30 ms | 0.004 mWh |
| BLE Transmit (if used) | 150 mA | 555 mW | 10 transmissions × 2 sec = 20 sec | 3.083 mWh |
| Total | 4.383 mWh/day |
Battery Life (without BLE):
7,400 mWh / (0.885 + 0.411 + 0.004) = 7,400 / 1.3 = 5,692 days = 15.6 years
Battery Life (with BLE transmit 10x/day):
7,400 mWh / 4.383 = 1,688 days = 4.6 years
Problem: BLE transmission dominates power budget (70% of daily energy)!
Optimization Strategy:
Option 1: Reduce BLE Transmissions
- Only transmit on gesture detected (not periodic keep-alive)
- Assume 20 gestures/day (vs 10 arbitrary transmits)
- BLE energy: 20 × 2s × 555mW / (24×3600s) = 0.257 mWh/day
- New battery life: 7,400 / (1.3 + 0.257) = 4,752 days = 13 years ✓
Option 2: Switch to BLE 5.0 Long Range Mode
- Lower data rate but 4x longer range, 50% less power
- BLE energy: 0.257 / 2 = 0.129 mWh/day
- New battery life: 7,400 / (1.3 + 0.129) = 5,178 days = 14.2 years ✓
Option 3: Local Inference + Edge Aggregation
- Run inference on-device (no transmission per gesture)
- Transmit summary once/day: “50 shakes, 30 taps, 10 tilts, 5 circles”
- BLE energy: 1 × 2s × 555mW / 86,400s = 0.013 mWh/day
- New battery life: 7,400 / (1.3 + 0.013) = 5,637 days = 15.4 years ✓
Key Insight: On-device TinyML inference (0.004 mWh/day) consumes 770x LESS energy than BLE transmission (3.083 mWh/day). Local processing dramatically extends battery life by avoiding wireless communication.
Comparison Table:
| Approach | BLE Transmits/Day | Daily Energy | Battery Life |
|---|---|---|---|
| Cloud ML (stream all data) | 100 samples | 30.83 mWh | 240 days (8 months) |
| Edge ML + frequent BLE | 10 summaries | 4.38 mWh | 1,688 days (4.6 years) |
| Edge ML + event-driven BLE | 20 gestures | 1.56 mWh | 4,752 days (13 years) |
| Edge ML + daily summary | 1 summary | 1.31 mWh | 5,637 days (15.4 years) |
| Model Constraint | Primary Issue | Recommended Technique | Expected Result |
|---|---|---|---|
| Flash Storage (512KB limit) | Model >512KB | Int8 quantization | 4x compression, fits in flash |
| Flash Storage (256KB limit) | Quantized model still >256KB | Quantization + 50% pruning | 8x compression total |
| RAM (96KB limit) | Activation memory >96KB | Reduce hidden layer size OR use int8 activations | 2-4x RAM reduction |
| Inference Latency | >100ms (too slow for real-time) | Reduce model depth OR use MobileNet architecture | 3-10x speedup |
| Accuracy <90% (unacceptable) | Over-quantization or over-pruning | Quantization-aware training OR reduce pruning | +2-5% accuracy |
| Power Budget | >50mW avg (battery dies in weeks) | Reduce inference frequency OR optimize wake time | 10-100x power reduction |
Optimization Pipeline (ordered by effort):
Step 1: Post-Training Quantization (easiest, always do this first) - Effort: Low (1-2 days with TensorFlow Lite) - Result: 4x model size reduction, 1.5-2x speedup - Accuracy loss: 0.5-2% - When to use: Always (standard practice for TinyML)
Step 2: Pruning (if Step 1 insufficient) - Effort: Medium (1-2 weeks to find optimal sparsity) - Result: 2-4x additional size reduction at 50-70% sparsity - Accuracy loss: 1-3% at 50% pruning, 5-10% at 70% - When to use: Model still doesn’t fit in flash after quantization
Step 3: Quantization-Aware Training (if accuracy degraded) - Effort: Medium-High (2-4 weeks, requires retraining) - Result: Same 4x compression as Step 1, but <0.5% accuracy loss - When to use: Post-training quantization lost >2% accuracy
Step 4: Knowledge Distillation (if model still too large) - Effort: High (4-8 weeks, requires training student model) - Result: 5-20x compression (train tiny student from large teacher) - Accuracy loss: 2-5% - When to use: Steps 1-3 combined still insufficient
Step 5: Architecture Search (last resort) - Effort: Very High (months, requires ML expertise) - Result: Custom architecture optimized for constraints - Example: MobileNetV3, EfficientNet-Lite designed for mobile/edge - When to use: Standard techniques insufficient for your use case
Real-World Example Decision:
Gesture Recognition Model:
- Initial: 1.5MB float32, 92% accuracy, 450ms inference
- Target: <256KB flash, <50ms inference, >88% accuracy
Step 1: Int8 Quantization
- Result: 380KB, 90% accuracy (-2%), 250ms inference
- Status: Still >256KB, need more ✗
Step 2: Add 60% Pruning
- Result: 152KB (380 × 0.4), 87% accuracy (-5% total), 180ms inference
- Status: Fits in flash ✓, accuracy 87% < 88% target ✗
Step 3: Quantization-Aware Training
- Result: 152KB, 89.5% accuracy (-2.5% total), 180ms inference
- Status: Meets all targets ✓✓✓
Decision: Stop at Step 3 (QAT + 60% pruning). No need for distillation.
The Error: Applying post-training quantization using random or synthetic data instead of real sensor data from target deployment.
Real Example:
- Gesture recognition model trained on clean lab accelerometer data
- Quantized using synthetic Gaussian noise (random -1 to +1 values)
- Deployed to real wearable devices
- Result: 78% accuracy in field vs 92% accuracy in lab (14% degradation)
Why This Happens:
Quantization Calibration requires representative data to compute activation ranges:
# WRONG: Synthetic calibration data
calibration_data = np.random.uniform(-1, 1, size=(100, 12)) # Random noise
# Quantization computes ranges:
min_activation = -1.0, max_activation = +1.0
scale_factor = 2.0 / 255 = 0.0078
# Real-world gesture data after deployment:
# Shake gesture produces activations: [-0.9, 0.8] → fits range OK
# BUT Circle gesture produces: [-0.3, 0.3] → uses only 30% of quantization range
Result: Circle gesture loses precision (quantization error 3x higher)The Fix: Use REAL data from target environment for calibration:
# CORRECT: Real calibration data from wearable deployment
real_gestures = load_field_data() # 100 samples from actual users
# Quantization computes ACTUAL ranges:
Shake: min=-0.92, max=0.85
Tap: min=-0.15, max=0.95
Tilt: min=0.05, max=0.75
Circle: min=-0.35, max=0.32
# Quantization uses full precision range for each gesture type
scale_factor optimized per layer based on real activation distributionsResults Comparison:
| Calibration Data | Field Accuracy | Accuracy Loss | Why |
|---|---|---|---|
| Random synthetic | 78% | 14% | Poor range estimation, wasted quantization bits |
| Lab data only | 84% | 8% | Better but misses field variations (user differences, orientations) |
| Field data (diverse users) | 90% | 2% | Accurate range estimation, optimal quantization |
How to Collect Representative Data:
- Deploy pilot devices (10-50 units) to real users for 1-2 weeks
- Collect telemetry: Log raw accelerometer data for all detected gestures
- Download 500-1,000 samples covering:
- All gesture types (shake, tap, tilt, circle)
- Different users (hand sizes, movement styles)
- Different orientations (wrist up, down, sideways)
- Different contexts (walking, sitting, standing)
- Use for quantization calibration: TensorFlow Lite converter accepts calibration dataset
Cost-Benefit:
Pilot deployment: $500 (10 devices) + 2 weeks time
Accuracy improvement: 78% → 90% (12 percentage points)
Avoided field failure: Prevents product recall or firmware emergency patch
ROI: $500 investment prevents $50K+ in returns/reputation damage
Lesson: Quantization is only as good as its calibration data. Always use representative real-world data from your target deployment environment, not synthetic or lab-only data. The 2-week pilot deployment pays for itself many times over in avoided field failures.
26.16 Summary and Key Takeaways
This lab demonstrated the core building blocks of TinyML for IoT applications:
Neural network inference on microcontrollers is practical: a 3-layer fully-connected network runs in hundreds of microseconds on an ESP32, classifying accelerometer gestures in real time.
Quantization (float32 to int8) provides 4x memory compression and approximately 2x inference speedup with typically less than 2% accuracy loss – this is the single most important optimization for deploying ML on microcontrollers.
Pruning removes redundant weights to further reduce model size. Up to 50-70% of weights can typically be pruned with minimal accuracy impact, but beyond this threshold a sharp “pruning cliff” causes severe degradation.
Softmax output converts raw neural network outputs into calibrated probabilities, enabling confidence-based decision-making (e.g., only act when confidence exceeds 70%).
On-device inference eliminates network latency, reduces bandwidth usage, and keeps sensitive sensor data private – three properties essential for production IoT deployments.
| Technique | Compression | Speed Gain | Accuracy Impact | Difficulty |
|---|---|---|---|---|
| Int8 Quantization (PTQ) | 4x | 1.5-2x | 0.5-2% loss | Easy |
| Quantization-Aware Training | 4x | 1.5-2x | <0.5% loss | Medium |
| Pruning (50%) | 2x | Variable | 1-3% loss | Medium |
| Knowledge Distillation | 5-20x | Proportional | 1-5% loss | Hard |
| Combined Pipeline | 10-100x | 3-10x | 2-5% loss | Hard |
26.17 Knowledge Check
Common Pitfalls
Lab gesture recognition models built and tested on the same person’s gestures achieve 95%+ accuracy in testing but drop to 60-70% in production when other users try them. Always collect training data from multiple users with diverse hand sizes, orientations, and speeds. Minimum viable dataset: 10+ users × 20 gestures × 5 repetitions each.
Converting a float32 model to int8 without a calibration dataset causes symmetric quantization to misplace scale factors, producing garbage outputs. Always validate quantized model accuracy against a held-out test set and confirm it stays within 2% of the float32 baseline before deploying to hardware.
Developing with an Arduino IDE-style upload-and-run workflow masks inference timing. On a Cortex-M4 at 80 MHz, a 30KB model takes 15-50ms per inference — fast enough for 20fps gesture detection but too slow for 100fps industrial inspection. Profile latency with a hardware timer before finalizing model architecture.
Setting confidence thresholds (e.g., “accept if score > 0.8”) based on lab testing creates false-negative storms in production where lighting, orientation, or user differences shift score distributions. Use a calibration dataset from target deployment conditions to set thresholds that balance precision and recall for the actual use case.
26.18 What’s Next
| Topic | Chapter | Description |
|---|---|---|
| Model Optimization | Model Optimization for Edge AI | Dive deeper into quantization math, knowledge distillation, and production optimization pipelines |
| Hardware Platforms | Edge AI/ML Hardware Platforms | Compare hardware accelerators (Coral TPU, Intel Movidius, NVIDIA Jetson) for different deployment scenarios |
| Fog/Edge Production | Fog/Edge Production and Review | Explore orchestration platforms and workload distribution across edge-fog-cloud tiers |
Hands-On Practice:
- Deploy a TinyML model on Arduino or ESP32 using Edge Impulse
- Build an edge AI application with Coral Edge TPU and TensorFlow Lite
- Implement a predictive maintenance system using vibration sensors and anomaly detection
- Compare cloud vs edge inference for a computer vision application (measure latency, bandwidth, cost)
Further Reading:
- TinyML Foundation: tinyml.org
- TensorFlow Lite Micro Documentation: tensorflow.org/lite/microcontrollers
- Edge Impulse University: docs.edgeimpulse.com/docs
- NVIDIA Jetson Projects: developer.nvidia.com/embedded/community/jetson-projects