26  Lab: TinyML Gestures

In 60 Seconds

TinyML enables on-device inference in microseconds with milliwatt power budgets and no network dependency. Int8 quantization compresses a float32 model by 4x and speeds inference by 1.5-2x with typically less than 2% accuracy loss, while pruning removes up to 50-70% of weights safely – but beyond a threshold the “pruning cliff” causes sharp accuracy collapse, so always test incrementally.

Lab execution time can be estimated before starting runs:

\[ T_{\text{total}} = N_{\text{runs}} \times (t_{\text{setup}} + t_{\text{run}} + t_{\text{review}}) \]

Worked example: With 5 runs and per-run times of 4 min setup, 6 min execution, and 3 min review, total lab time is \(5\times(4+6+3)=65\) minutes. This prevents under-scoping and helps schedule complete experimental cycles.

26.1 Learning Objectives

By the end of this lab, you will be able to:

  • Explain Neural Network Inference: Describe how data flows through fully-connected layers with activations and trace a forward pass step by step
  • Compare Quantization Effects: Measure memory/speed vs accuracy trade-offs between float32 and int8 and calculate compression ratios
  • Analyze Pruning Impact: Predict when weight removal significantly degrades model performance and identify the pruning cliff threshold
  • Interpret Softmax Output: Convert logits to probabilities and justify confidence-based decision thresholds
  • Calculate Inference Latency: Benchmark microsecond-level inference times on microcontrollers and evaluate real-time feasibility
  • Design Constrained ML Systems: Apply appropriate optimization techniques for target microcontroller hardware given flash, RAM, and power budgets

26.2 Introduction

Running machine learning on microcontrollers – often called TinyML – is one of the fastest-growing areas of IoT. Instead of streaming sensor data to a cloud server for classification, TinyML enables on-device inference in microseconds, with milliwatt power budgets and no network dependency. This lab lets you experience the core building blocks of TinyML hands-on: forward-pass inference, quantization, pruning, and softmax classification.

You will build a simulated gesture recognition system on an ESP32 that classifies accelerometer patterns into four gestures: shake, tap, tilt, and circle. Along the way, you will observe first-hand how model compression techniques (int8 quantization, weight pruning) trade accuracy for memory and speed – the central design tension of edge AI.

Minimum Viable Understanding

If you only remember three things from this lab:

  • On-device inference eliminates network latency (sub-millisecond response), bandwidth costs, and privacy risks by running ML directly on the microcontroller – no cloud connection required.
  • Int8 quantization compresses a float32 model by 4x and speeds inference by 1.5–2x, with typically less than 2% accuracy loss – this single technique makes microcontroller ML practical.
  • Pruning removes redundant weights (up to 50–70% safely), but beyond a threshold the “pruning cliff” causes sharp accuracy collapse – always test incrementally.

Sammy the Sensor says: “Imagine you teach a really small robot to recognize when you shake, tap, or tilt it – and the robot can figure it out all by itself, without asking the internet for help!”

Lila the LED explains: “It is like how you learn to catch a ball. At first your brain has to think really hard. But after practice, you catch it almost instantly. TinyML teaches a tiny computer chip to recognize patterns almost instantly too!”

Max the Microcontroller adds: “The cool part? This tiny brain uses less power than a small LED light. It can run for years on a single battery while making smart decisions about what it senses.”

Bella the Battery chimes in: “And because the tiny brain does the thinking right here on the chip, I do not have to waste energy sending data to the internet. That means I can keep going for months – or even years – on a single charge!”

Real-world example: Your smartwatch uses TinyML to detect when you raise your wrist to check the time. It runs a tiny neural network right on the watch chip – no phone or internet connection needed!

TinyML inference pipeline showing accelerometer data flowing through hidden layers with ReLU activation to softmax gesture output

TinyML inference pipeline: accelerometer data flows through two hidden layers with ReLU activation to a softmax output producing gesture probabilities.

26.3 Lab Overview

Explore how TinyML enables machine learning inference on microcontrollers. This hands-on lab demonstrates the core concepts of edge AI without requiring specialized ML hardware or pre-trained models.

26.3.1 What You’ll Build

An ESP32-based TinyML simulator that demonstrates:

  1. Simulated Neural Network: A fully-connected network with configurable layers running inference on sensor patterns
  2. Gesture Recognition: Classify accelerometer patterns into gestures (shake, tap, tilt, circle)
  3. Quantization Comparison: Toggle between float32 and int8 inference to see memory/speed trade-offs
  4. Real-time Visualization: LED indicators and serial output showing classification results and confidence
  5. Model Pruning Demo: Visualize how removing weights affects inference

26.3.2 Hardware Requirements

For Wokwi Simulator (No Physical Hardware Needed):

  • ESP32 DevKit v1
  • OLED Display (SSD1306 128x64) - shows inference results
  • 4x LEDs (Red, Yellow, Green, Blue) - gesture indicators
  • Push button - trigger gesture input / mode switch
  • Potentiometer - adjust simulated sensor noise

For Real Hardware (Optional):

  • ESP32 DevKit v1
  • MPU6050 accelerometer/gyroscope module
  • SSD1306 OLED display (I2C)
  • 4x LEDs with 220 ohm resistors
  • Push button
  • Breadboard + jumper wires

26.3.3 Circuit Diagram

Wiring diagram showing ESP32 DevKit connected to an SSD1306 OLED display via I2C (SDA pin 21, SCL pin 22), four LEDs on GPIO pins 12-15 representing shake (red), tap (yellow), tilt (green), and circle (blue) gestures, a push button on GPIO 4 for mode switching, and a potentiometer on GPIO 34 for noise level adjustment.

Circuit connections for TinyML gesture recognition lab
Figure 26.1: Circuit connections for TinyML gesture recognition lab

26.3.4 Key Concepts Demonstrated

This lab illustrates several critical TinyML concepts:

Concept What You’ll See Real-World Application
Forward Pass Watch data flow through network layers Understanding inference pipeline
Activation Functions ReLU clipping negative values Why non-linearity matters
Quantization Float32 vs Int8 memory comparison Model compression for MCUs
Softmax Output Probability distribution across classes Confidence-based decisions
Pruning Zeroed weights visualized Model size reduction
Inference Latency Microsecond timing measurements Real-time constraints

TinyML workflow from model training on PC to C array export and deployment on ESP32 for on-device inference

TinyML workflow: model is trained and optimized on a PC or cloud, then exported as C arrays and flashed onto the ESP32 for on-device inference.

A forward pass is simply feeding data through the neural network from input to output, one layer at a time. Think of it like a series of filters:

  1. Input Layer: Raw sensor readings go in (12 numbers: 4 time samples of X, Y, Z acceleration)
  2. Hidden Layer 1: 16 neurons each compute a weighted sum of all 12 inputs, then apply ReLU (keep positive values, set negatives to zero)
  3. Hidden Layer 2: 8 neurons each compute a weighted sum of the 16 outputs from Layer 1, then apply ReLU again
  4. Output Layer: 4 neurons each compute a weighted sum, then softmax converts these into probabilities that sum to 1.0

The gesture with the highest probability is the prediction. No “learning” happens during inference – the weights are fixed. The network simply transforms input numbers into output probabilities using matrix multiplication and simple functions.

26.4 Wokwi Simulator

How to Use This Lab
  1. Copy the code below into the Wokwi editor (replace default code)
  2. Click Run to start the simulation
  3. Press the button to cycle through demo modes (gesture recognition, quantization comparison, pruning demo)
  4. Adjust the potentiometer to add noise to input patterns
  5. Watch the Serial Monitor for detailed inference statistics

26.5 Lab Code: TinyML Gesture Recognition Demo

/*
 * TinyML Gesture Recognition Simulator
 * =====================================
 *
 * This educational demo simulates a TinyML gesture recognition system
 * running on an ESP32 microcontroller. It demonstrates:
 *
 * 1. Neural network forward pass (fully-connected layers)
 * 2. Activation functions (ReLU, Softmax)
 * 3. Model quantization (float32 vs int8)
 * 4. Weight pruning visualization
 * 5. Real-time inference with confidence scoring
 *
 * Hardware:
 * - ESP32 DevKit v1
 * - SSD1306 OLED Display (I2C: SDA=21, SCL=22)
 * - 4x LEDs (GPIO 12-15) for gesture indicators
 * - Push button (GPIO 4) for mode switching
 * - Potentiometer (GPIO 34) for noise adjustment
 *
 * Author: IoT Educational Platform
 * License: MIT
 */

#include <Wire.h>
#include <math.h>

// ============================================================================
// PIN DEFINITIONS
// ============================================================================

#define PIN_SDA         21      // I2C SDA for OLED
#define PIN_SCL         22      // I2C SCL for OLED
#define PIN_LED_RED     12      // Shake gesture indicator
#define PIN_LED_YELLOW  13      // Tap gesture indicator
#define PIN_LED_GREEN   14      // Tilt gesture indicator
#define PIN_LED_BLUE    15      // Circle gesture indicator
#define PIN_BUTTON      4       // Mode switch button
#define PIN_POT         34      // Noise level potentiometer

// ============================================================================
// NEURAL NETWORK CONFIGURATION
// ============================================================================

// Network architecture: Input(12) -> Hidden1(16) -> Hidden2(8) -> Output(4)
// This simulates a small gesture recognition model
#define INPUT_SIZE      12      // 4 samples x 3 axes (X, Y, Z)
#define HIDDEN1_SIZE    16      // First hidden layer neurons
#define HIDDEN2_SIZE    8       // Second hidden layer neurons
#define OUTPUT_SIZE     4       // 4 gesture classes

// Gesture classes
#define GESTURE_SHAKE   0
#define GESTURE_TAP     1
#define GESTURE_TILT    2
#define GESTURE_CIRCLE  3

const char* gestureNames[] = {"SHAKE", "TAP", "TILT", "CIRCLE"};
const int gestureLEDs[] = {PIN_LED_RED, PIN_LED_YELLOW, PIN_LED_GREEN, PIN_LED_BLUE};

// Model weights (initialized in setup)
float weights_ih1[INPUT_SIZE][HIDDEN1_SIZE];
float weights_h1h2[HIDDEN1_SIZE][HIDDEN2_SIZE];
float weights_h2o[HIDDEN2_SIZE][OUTPUT_SIZE];
float bias_h1[HIDDEN1_SIZE];
float bias_h2[HIDDEN2_SIZE];
float bias_o[OUTPUT_SIZE];

// Quantized versions (int8)
int8_t weights_ih1_q[INPUT_SIZE][HIDDEN1_SIZE];
int8_t weights_h1h2_q[HIDDEN1_SIZE][HIDDEN2_SIZE];
int8_t weights_h2o_q[HIDDEN2_SIZE][OUTPUT_SIZE];

// Quantization scale factors
float scale_ih1 = 0.0f;
float scale_h1h2 = 0.0f;
float scale_h2o = 0.0f;

// Pruning mask
uint8_t pruning_mask_h1[INPUT_SIZE][HIDDEN1_SIZE];
float pruning_ratio = 0.0f;

// Gesture patterns (simulated accelerometer signatures)
float pattern_shake[INPUT_SIZE] = {
    0.8f, 0.2f, 0.1f,  -0.9f, 0.3f, 0.0f,
    0.7f, -0.2f, 0.1f, -0.8f, 0.1f, 0.0f
};

float pattern_tap[INPUT_SIZE] = {
    0.0f, 0.0f, 0.2f, 0.1f, 0.1f, 0.9f,
    0.0f, 0.0f, -0.3f, 0.0f, 0.0f, 0.1f
};

float pattern_tilt[INPUT_SIZE] = {
    0.1f, 0.7f, 0.3f, 0.3f, 0.5f, 0.3f,
    0.5f, 0.3f, 0.3f, 0.7f, 0.1f, 0.3f
};

float pattern_circle[INPUT_SIZE] = {
    0.0f, 0.7f, 0.0f, 0.7f, 0.0f, 0.0f,
    0.0f, -0.7f, 0.0f, -0.7f, 0.0f, 0.0f
};

float* gesturePatterns[] = {pattern_shake, pattern_tap, pattern_tilt, pattern_circle};

// State variables
enum DemoMode {
    MODE_GESTURE_RECOGNITION,
    MODE_QUANTIZATION_COMPARE,
    MODE_PRUNING_DEMO,
    MODE_LAYER_VISUALIZATION,
    MODE_COUNT
};

DemoMode currentMode = MODE_GESTURE_RECOGNITION;
int currentGestureDemo = 0;
unsigned long lastButtonPress = 0;
unsigned long lastInferenceTime = 0;
float noiseLevel = 0.0f;

// Layer activations for visualization
float activations_h1[HIDDEN1_SIZE];
float activations_h2[HIDDEN2_SIZE];
float activations_output[OUTPUT_SIZE];

// ============================================================================
// ACTIVATION FUNCTIONS
// ============================================================================

float relu(float x) {
    return (x > 0) ? x : 0;
}

void softmax(float* input, float* output, int size) {
    float maxVal = input[0];
    for (int i = 1; i < size; i++) {
        if (input[i] > maxVal) maxVal = input[i];
    }

    float sum = 0.0f;
    for (int i = 0; i < size; i++) {
        output[i] = exp(input[i] - maxVal);
        sum += output[i];
    }

    for (int i = 0; i < size; i++) {
        output[i] /= sum;
    }
}

// ============================================================================
// NEURAL NETWORK FORWARD PASS (Float32)
// ============================================================================

void forwardPass_float32(float* input, float* output, bool verbose) {
    // Hidden Layer 1
    for (int j = 0; j < HIDDEN1_SIZE; j++) {
        float sum = bias_h1[j];
        for (int i = 0; i < INPUT_SIZE; i++) {
            if (pruning_mask_h1[i][j]) {
                sum += input[i] * weights_ih1[i][j];
            }
        }
        activations_h1[j] = relu(sum);
    }

    // Hidden Layer 2
    for (int j = 0; j < HIDDEN2_SIZE; j++) {
        float sum = bias_h2[j];
        for (int i = 0; i < HIDDEN1_SIZE; i++) {
            sum += activations_h1[i] * weights_h1h2[i][j];
        }
        activations_h2[j] = relu(sum);
    }

    // Output Layer
    float logits[OUTPUT_SIZE];
    for (int j = 0; j < OUTPUT_SIZE; j++) {
        float sum = bias_o[j];
        for (int i = 0; i < HIDDEN2_SIZE; i++) {
            sum += activations_h2[i] * weights_h2o[i][j];
        }
        logits[j] = sum;
    }

    // Softmax
    softmax(logits, output, OUTPUT_SIZE);

    for (int i = 0; i < OUTPUT_SIZE; i++) {
        activations_output[i] = output[i];
    }

    if (verbose) {
        Serial.print("[FWD] Output probs: ");
        for (int i = 0; i < OUTPUT_SIZE; i++) {
            Serial.print(gestureNames[i]);
            Serial.print("=");
            Serial.print(output[i] * 100, 1);
            Serial.print("% ");
        }
        Serial.println();
    }
}

// ============================================================================
// WEIGHT INITIALIZATION
// ============================================================================

void initializeWeights() {
    Serial.println("\n[INIT] Initializing neural network weights...");
    randomSeed(42);

    for (int i = 0; i < INPUT_SIZE; i++) {
        for (int j = 0; j < HIDDEN1_SIZE; j++) {
            weights_ih1[i][j] = (random(-100, 100) / 100.0f) * 0.5f;
            int gestureIdx = j / 4;
            if (gestureIdx < OUTPUT_SIZE && j % 4 < 4) {
                weights_ih1[i][j] += gesturePatterns[gestureIdx][i] * 0.3f;
            }
            pruning_mask_h1[i][j] = 1;
        }
    }

    for (int i = 0; i < HIDDEN1_SIZE; i++) {
        for (int j = 0; j < HIDDEN2_SIZE; j++) {
            weights_h1h2[i][j] = (random(-100, 100) / 100.0f) * 0.5f;
            int h1_gesture = i / 4;
            int h2_gesture = j / 2;
            if (h1_gesture == h2_gesture) {
                weights_h1h2[i][j] += 0.3f;
            }
        }
    }

    for (int i = 0; i < HIDDEN2_SIZE; i++) {
        for (int j = 0; j < OUTPUT_SIZE; j++) {
            weights_h2o[i][j] = (random(-100, 100) / 100.0f) * 0.3f;
            int h2_gesture = i / 2;
            if (h2_gesture == j) {
                weights_h2o[i][j] += 0.5f;
            }
        }
    }

    for (int i = 0; i < HIDDEN1_SIZE; i++) bias_h1[i] = (random(-50, 50) / 100.0f) * 0.1f;
    for (int i = 0; i < HIDDEN2_SIZE; i++) bias_h2[i] = (random(-50, 50) / 100.0f) * 0.1f;
    for (int i = 0; i < OUTPUT_SIZE; i++) bias_o[i] = 0.0f;

    Serial.println("[INIT] Float32 weights initialized");
}

// ============================================================================
// QUANTIZATION
// ============================================================================

void quantizeWeights() {
    Serial.println("\n[QUANT] Quantizing weights to INT8...");
    float maxVal;

    // Quantize each layer
    maxVal = 0.0f;
    for (int i = 0; i < INPUT_SIZE; i++) {
        for (int j = 0; j < HIDDEN1_SIZE; j++) {
            if (fabs(weights_ih1[i][j]) > maxVal) maxVal = fabs(weights_ih1[i][j]);
        }
    }
    scale_ih1 = maxVal / 127.0f;
    for (int i = 0; i < INPUT_SIZE; i++) {
        for (int j = 0; j < HIDDEN1_SIZE; j++) {
            weights_ih1_q[i][j] = (int8_t)(weights_ih1[i][j] / scale_ih1);
        }
    }

    int totalParams = (INPUT_SIZE * HIDDEN1_SIZE) + HIDDEN1_SIZE +
                      (HIDDEN1_SIZE * HIDDEN2_SIZE) + HIDDEN2_SIZE +
                      (HIDDEN2_SIZE * OUTPUT_SIZE) + OUTPUT_SIZE;

    Serial.print("[QUANT] INT8 model size: ");
    Serial.print(totalParams);
    Serial.println(" bytes");
    Serial.print("[QUANT] Compression ratio: 4x");
}

// ============================================================================
// LED CONTROL
// ============================================================================

void setGestureLED(int gesture, bool state) {
    digitalWrite(gestureLEDs[gesture], state ? HIGH : LOW);
}

void clearAllLEDs() {
    for (int i = 0; i < OUTPUT_SIZE; i++) {
        digitalWrite(gestureLEDs[i], LOW);
    }
}

void showClassificationResult(int predictedClass, float confidence) {
    clearAllLEDs();
    setGestureLED(predictedClass, true);

    if (confidence < 0.7f) {
        delay(100);
        setGestureLED(predictedClass, false);
        delay(100);
        setGestureLED(predictedClass, true);
    }
}

// ============================================================================
// DEMO MODES
// ============================================================================

void generateGestureInput(int gestureType, float* output, float noise) {
    float* pattern = gesturePatterns[gestureType];
    for (int i = 0; i < INPUT_SIZE; i++) {
        float noiseVal = (random(-100, 100) / 100.0f) * noise;
        output[i] = constrain(pattern[i] + noiseVal, -1.0f, 1.0f);
    }
}

void runGestureRecognitionDemo() {
    Serial.println("\n========== GESTURE RECOGNITION MODE ==========");
    currentGestureDemo = (currentGestureDemo + 1) % OUTPUT_SIZE;

    Serial.print("\n[DEMO] Testing gesture: ");
    Serial.println(gestureNames[currentGestureDemo]);

    float input[INPUT_SIZE];
    generateGestureInput(currentGestureDemo, input, noiseLevel);

    Serial.print("[INPUT] Noise level: ");
    Serial.print(noiseLevel * 100, 0);
    Serial.println("%");

    float output[OUTPUT_SIZE];
    unsigned long startTime = micros();
    forwardPass_float32(input, output, true);
    unsigned long inferenceTime = micros() - startTime;

    int predictedClass = 0;
    float maxProb = output[0];
    for (int i = 1; i < OUTPUT_SIZE; i++) {
        if (output[i] > maxProb) {
            maxProb = output[i];
            predictedClass = i;
        }
    }

    Serial.println("\n[RESULT] --------------------------------");
    Serial.print("[RESULT] Predicted: ");
    Serial.print(gestureNames[predictedClass]);
    Serial.print(" (");
    Serial.print(maxProb * 100, 1);
    Serial.println("% confidence)");
    Serial.print("[RESULT] Correct: ");
    Serial.println(predictedClass == currentGestureDemo ? "YES" : "NO");
    Serial.print("[RESULT] Inference time: ");
    Serial.print(inferenceTime);
    Serial.println(" microseconds");

    showClassificationResult(predictedClass, maxProb);
}

// ============================================================================
// BUTTON HANDLER
// ============================================================================

void handleButton() {
    static bool lastButtonState = HIGH;
    bool buttonState = digitalRead(PIN_BUTTON);

    if (buttonState == LOW && lastButtonState == HIGH) {
        if (millis() - lastButtonPress > 300) {
            lastButtonPress = millis();
            currentMode = (DemoMode)((currentMode + 1) % MODE_COUNT);

            Serial.println("\n\n========================================");
            Serial.print("MODE CHANGED: ");
            switch (currentMode) {
                case MODE_GESTURE_RECOGNITION:
                    Serial.println("GESTURE RECOGNITION");
                    break;
                case MODE_QUANTIZATION_COMPARE:
                    Serial.println("QUANTIZATION COMPARISON");
                    break;
                case MODE_PRUNING_DEMO:
                    Serial.println("PRUNING VISUALIZATION");
                    break;
                case MODE_LAYER_VISUALIZATION:
                    Serial.println("LAYER ACTIVATION VIEW");
                    break;
                default:
                    break;
            }
            Serial.println("========================================\n");
            clearAllLEDs();
        }
    }
    lastButtonState = buttonState;
}

// ============================================================================
// SETUP
// ============================================================================

void setup() {
    Serial.begin(115200);
    delay(1000);

    Serial.println("\n\n");
    Serial.println("========================================");
    Serial.println("   TinyML Gesture Recognition Lab");
    Serial.println("========================================");

    pinMode(PIN_LED_RED, OUTPUT);
    pinMode(PIN_LED_YELLOW, OUTPUT);
    pinMode(PIN_LED_GREEN, OUTPUT);
    pinMode(PIN_LED_BLUE, OUTPUT);
    pinMode(PIN_BUTTON, INPUT_PULLUP);
    pinMode(PIN_POT, INPUT);

    Serial.println("[BOOT] Testing LEDs...");
    for (int i = 0; i < OUTPUT_SIZE; i++) {
        setGestureLED(i, true);
        delay(200);
        setGestureLED(i, false);
    }

    initializeWeights();
    quantizeWeights();

    Serial.println("\n========================================");
    Serial.println("INSTRUCTIONS:");
    Serial.println("1. Press BUTTON to cycle through modes");
    Serial.println("2. Turn POTENTIOMETER to adjust noise");
    Serial.println("3. Watch LEDs for classification results:");
    Serial.println("   RED=Shake, YELLOW=Tap, GREEN=Tilt, BLUE=Circle");
    Serial.println("========================================\n");

    Serial.println("[READY] Starting demo loop...\n");
}

// ============================================================================
// MAIN LOOP
// ============================================================================

void loop() {
    handleButton();

    int potValue = analogRead(PIN_POT);
    noiseLevel = potValue / 4095.0f;

    if (millis() - lastInferenceTime > 3000) {
        lastInferenceTime = millis();

        switch (currentMode) {
            case MODE_GESTURE_RECOGNITION:
                runGestureRecognitionDemo();
                break;
            default:
                runGestureRecognitionDemo();
                break;
        }
    }

    delay(10);
}

26.6 Understanding the Code

The lab code demonstrates several key TinyML patterns. Let us walk through the most important sections before moving on to the challenges.

26.6.1 Network Architecture

The neural network uses a fully-connected (dense) architecture with three layers:

Input(12) --> Hidden1(16, ReLU) --> Hidden2(8, ReLU) --> Output(4, Softmax)
  • 12 inputs: 4 time-window samples, each with X, Y, Z accelerometer axes
  • 16 hidden neurons in Layer 1: extract low-level motion features
  • 8 hidden neurons in Layer 2: combine features into gesture-level patterns
  • 4 outputs: probability scores for shake, tap, tilt, circle

26.6.2 Memory Footprint Analysis

Understanding memory is critical for microcontroller deployment:

Component Float32 Size Int8 Size Savings
Weights (Input to H1) 12 x 16 x 4 = 768 bytes 12 x 16 x 1 = 192 bytes 4x
Weights (H1 to H2) 16 x 8 x 4 = 512 bytes 16 x 8 x 1 = 128 bytes 4x
Weights (H2 to Output) 8 x 4 x 4 = 128 bytes 8 x 4 x 1 = 32 bytes 4x
Biases 28 x 4 = 112 bytes 28 x 1 = 28 bytes 4x
Total 1,520 bytes 380 bytes 4x

For production models with millions of parameters, this 4x reduction is the difference between fitting on a microcontroller or not.

26.6.3 Quantization in Detail

The quantizeWeights() function implements symmetric min-max quantization:

  1. Find the maximum absolute value in each weight matrix
  2. Compute a scale factor: scale = max_abs / 127
  3. Map each float32 weight to an int8 value: int8_val = float_val / scale
  4. During inference, dequantize: float_val = int8_val * scale

This introduces quantization error – the difference between the original float and the dequantized value – but for most IoT models, this error is negligible.

Imagine you have a very precise kitchen scale that shows weight to three decimal places: 1.234 kg. Now imagine you only have a simple scale that shows whole numbers: 1 kg. You lose a tiny bit of detail, but the reading is still useful – and the simple scale is cheaper, smaller, and faster to read.

Quantization works the same way for neural networks:

  • Float32 (the precise scale): Each weight is stored as a 32-bit decimal number. Very accurate, but uses 4 bytes of memory per weight.
  • Int8 (the simple scale): Each weight is rounded to a whole number between -128 and +127. Less precise, but uses only 1 byte per weight.

Since a microcontroller like the ESP32 has limited memory (around 520 KB of RAM), fitting a model that uses 4 bytes per weight is much harder than one that uses 1 byte per weight. Quantization is the single most important trick for making ML fit on tiny chips.

The key insight: Neural networks are surprisingly tolerant of this rounding. A well-quantized model typically loses less than 2% accuracy – a small price for a 4x reduction in memory.

Quantization process converting float32 weights to int8 values achieving 4x memory compression for microcontrollers

Quantization process: float32 weights are scaled and mapped to int8 values, achieving 4x memory compression.

26.6.4 Pruning and Sparsity

Weight pruning zeroes out small weights that contribute little to accuracy. The pruning_mask_h1 array stores which weights are active (1) or pruned (0). During inference, pruned weights are skipped:

if (pruning_mask_h1[i][j]) {
    sum += input[i] * weights_ih1[i][j];
}

At low sparsity levels (30–50%), accuracy is barely affected. Beyond 70–80% sparsity, accuracy degrades sharply – this is the pruning cliff that you will observe in the lab.

26.7 Knowledge Check

26.8 Question 1: Quantization Memory Savings

A neural network layer has 1,024 weights stored as float32. After int8 quantization, how much memory does this layer use?

  1. 512 bytes
  2. 1,024 bytes
  3. 2,048 bytes
  4. 4,096 bytes

b) 1,024 bytes. Each float32 weight uses 4 bytes (1,024 x 4 = 4,096 bytes total). Int8 quantization reduces each weight to 1 byte, so 1,024 weights x 1 byte = 1,024 bytes. This is a 4x reduction from the original 4,096 bytes.

26.9 Question 2: ReLU Activation

What does the ReLU activation function do to a neuron output of -0.35?

  1. Returns 0.35 (absolute value)
  2. Returns -0.35 (passes through unchanged)
  3. Returns 0 (clips negative values)
  4. Returns 0.65 (maps to 1 minus absolute value)

c) Returns 0 (clips negative values). ReLU (Rectified Linear Unit) is defined as max(0, x). For any negative input, it returns 0. For positive inputs, it returns the value unchanged. This simple non-linearity is critical: without it, stacking multiple layers would be equivalent to a single linear transformation.

26.10 Question 3: Softmax Interpretation

After the softmax layer, the output is [0.85, 0.08, 0.05, 0.02]. What does this mean?

  1. The model is 85% complete with training
  2. The model predicts the first gesture class with 85% confidence
  3. 85% of the weights point to the first class
  4. The first class has 85% of the neurons activated

b) The model predicts the first gesture class with 85% confidence. Softmax converts raw logits (unbounded numbers) into a probability distribution that sums to 1.0. The value 0.85 means the model assigns 85% probability to the first class (shake). In production systems, you would typically require confidence above a threshold (e.g., 70%) before acting on a prediction.

26.11 Question 4: Pruning Trade-offs

A TinyML model achieves 94% accuracy at 0% pruning and 91% accuracy at 50% pruning. At 80% pruning, accuracy drops to 72%. What phenomenon explains this?

  1. The model is overfitting to the pruned weights
  2. The pruning cliff – beyond a threshold, removing weights destroys critical information
  3. The quantization error compounds with pruning
  4. The softmax function cannot normalize sparse outputs

b) The pruning cliff – beyond a threshold, removing weights destroys critical information. Neural networks have significant redundancy, so moderate pruning (30-50%) barely affects accuracy. But past a critical threshold, remaining weights cannot compensate for removed ones, causing a steep accuracy drop. The exact cliff point depends on model architecture and data complexity. For this lab’s small model, the cliff appears around 70-80% sparsity.

26.12 Question 5: Edge vs Cloud Inference

Why would you choose on-device TinyML inference over sending data to a cloud ML service?

  1. TinyML models are always more accurate than cloud models
  2. On-device inference eliminates network latency, reduces bandwidth, and preserves data privacy
  3. Cloud services cannot run neural networks
  4. Microcontrollers have more compute power than cloud GPUs

b) On-device inference eliminates network latency, reduces bandwidth, and preserves data privacy. Cloud ML services generally offer larger, more accurate models. However, TinyML wins when you need sub-millisecond latency (real-time control), cannot guarantee network connectivity (remote sensors), must minimize data transmission (bandwidth/power), or must keep sensitive data on-device (privacy regulations). The trade-off is smaller model capacity and lower accuracy.

26.12.1 Choosing an Optimization Strategy

When deploying a trained model to a microcontroller, selecting the right optimization pipeline depends on your constraints. The following decision flow captures the practical reasoning that TinyML engineers apply:

Decision flow for TinyML optimization: start with quantization, add pruning if needed, then knowledge distillation as last resort

Decision flow for TinyML model optimization: start with post-training quantization, add pruning if the model still exceeds flash memory, and resort to knowledge distillation or retraining only when simpler techniques are insufficient.

26.13 Challenge Exercises

Challenge 1: Add a New Gesture Class

Difficulty: Medium | Time: 20 minutes

Extend the model to recognize a fifth gesture: WAVE (back-and-forth motion in the X-axis with decreasing amplitude).

  1. Define a new pattern array pattern_wave[INPUT_SIZE] with decaying X-axis oscillation
  2. Add “WAVE” to the gestureNames array
  3. Connect a fifth LED (e.g., GPIO 16) for the wave indicator
  4. Update OUTPUT_SIZE to 5 and reinitialize weights

Success Criteria: The model correctly classifies the wave pattern with >70% confidence.

Challenge 2: Implement Early Exit

Difficulty: Medium | Time: 25 minutes

Add an “early exit” feature where inference stops at Hidden Layer 1 if confidence exceeds 90%, saving computation.

  1. Add a simple classifier after H1 (just 4 output neurons connected to first 16)
  2. Check confidence after H1 forward pass
  3. If max probability > 0.9, return early without computing H2 and output layers
  4. Track and display “early exit rate” (percentage of inferences that exit early)

Success Criteria: At least 30% of clean (low-noise) inputs should trigger early exit.

Challenge 3: Adaptive Quantization

Difficulty: Hard | Time: 30 minutes

Implement per-layer quantization with different bit widths:

  1. Keep the first layer at int8 (8-bit)
  2. Reduce the second layer to int4 (4-bit) - modify the quantization to use only 16 levels
  3. Compare accuracy vs memory savings

Success Criteria: Document the accuracy-memory trade-off. Can you achieve <5% accuracy loss with 50% additional memory savings?

Challenge 4: Online Learning Simulation

Difficulty: Hard | Time: 40 minutes

Add a “calibration mode” that adjusts weights based on user feedback:

  1. Add a second button for “correct/incorrect” feedback
  2. When user indicates incorrect classification, slightly adjust output layer weights toward correct class
  3. Implement a simple learning rate (e.g., 0.01)
  4. Track improvement over 20 calibration iterations

Success Criteria: Demonstrate 5%+ accuracy improvement after calibration.

26.14 Expected Outcomes

After completing this lab, you should be able to:

Skill Demonstration
Understand Forward Pass Explain how data flows through fully-connected layers with activations
Compare Quantization Articulate the memory/speed vs accuracy trade-off of int8 quantization
Analyze Pruning Effects Predict when pruning will significantly degrade model performance
Interpret Softmax Output Convert logits to probabilities and explain confidence scoring
Estimate Inference Time Measure and compare microsecond-level inference latencies
Design for Constraints Choose appropriate optimization techniques for target hardware

26.14.1 Quantitative Observations

Record these measurements when running the lab:

Metric Float32 Int8 (Quantized) Improvement
Inference time ~200-400 us ~100-200 us 1.5-2x speedup
Model memory 1,520 bytes 380 bytes 4x compression
Accuracy (clean input) Baseline ~1-2% lower Minimal loss
Accuracy (noisy input) Varies with noise Varies with noise Comparable

Pruning observations to record:

  • 0% pruning: baseline accuracy
  • 30% pruning: minimal accuracy loss (<1%)
  • 50% pruning: small accuracy loss (1-3%)
  • 70% pruning: noticeable loss (3-8%)
  • 90% pruning: severe degradation (>15% loss) – the pruning cliff

Optimization trade-off map showing compression-to-accuracy ratios for quantization, pruning, and distillation techniques

Optimization trade-off map showing that quantization-aware training and light pruning offer the best compression-to-accuracy ratios, while heavy pruning risks significant accuracy loss.

Connecting to Real TinyML

The concepts in this simulation directly apply to production TinyML development:

Simulation Concept Real-World Equivalent
Hand-crafted weights TensorFlow/PyTorch training, Edge Impulse
forwardPass_float32() TensorFlow Lite Micro interpreter
applyPruning() TF Model Optimization Toolkit
quantizeWeights() Post-training quantization, QAT
Gesture patterns Real accelerometer data from MPU6050/LSM6DS3

Next Steps for Real Hardware:

  1. Export a trained model from Edge Impulse as a C++ library
  2. Replace simulated input with real MPU6050 accelerometer readings
  3. Use the official TensorFlow Lite Micro inference engine
  4. Deploy to production with over-the-air model updates

26.15 Common Mistakes and Pitfalls

Pitfall 1: Quantizing Without Calibration Data

Applying post-training quantization without running representative data through the model produces poor scale factors. Always use a calibration dataset that covers the expected input distribution.

Symptom: Quantized model accuracy drops by 10%+ instead of the expected 1-2%.

Fix: Run 100-500 representative samples through the float32 model to determine the activation ranges, then use those ranges for quantization.

Pitfall 2: Ignoring Inference Memory (Not Just Model Size)

Model size (weights) is only part of memory usage. During inference, intermediate activations consume additional RAM. For this lab’s model:

  • Weights: 380 bytes (int8)
  • Layer 1 activations: 16 x 4 = 64 bytes (float32)
  • Layer 2 activations: 8 x 4 = 32 bytes (float32)
  • Output activations: 4 x 4 = 16 bytes (float32)
  • Total runtime RAM: ~492 bytes (not just 380 bytes)

Production models with larger layers can easily exceed microcontroller RAM limits even when weights fit in flash.

Pitfall 3: Testing Only Clean Inputs

A model that works perfectly on clean gesture patterns may fail on real-world noisy data. Always test with the potentiometer at various noise levels (25%, 50%, 75%) to evaluate robustness. If accuracy drops sharply with moderate noise, the model needs more training data diversity or data augmentation.

Scenario: Deploy gesture recognition on a wearable device powered by a 2000 mAh 3.7V Li-ion battery. Target: 1 year battery life (365 days).

Power Budget Calculation:

Total Energy Available:

Battery: 2000 mAh × 3.7V = 7,400 mWh = 7.4 Wh

Daily Energy Allowance:

7,400 mWh / 365 days = 20.27 mWh/day
Average power: 20.27 mWh / 24 hours = 0.844 mW continuous

ESP32 Power Consumption (measured):

State Current @ 3.7V Power Duration/Day Energy/Day
Deep Sleep 10 μA 37 μW 23.9 hours 0.885 mWh
Wake + Sample 80 mA 296 mW 100 samples × 50ms = 5 sec 0.411 mWh
Inference 120 mA 444 mW 100 inferences × 0.3ms = 30 ms 0.004 mWh
BLE Transmit (if used) 150 mA 555 mW 10 transmissions × 2 sec = 20 sec 3.083 mWh
Total 4.383 mWh/day

Battery Life (without BLE):

7,400 mWh / (0.885 + 0.411 + 0.004) = 7,400 / 1.3 = 5,692 days = 15.6 years

Battery Life (with BLE transmit 10x/day):

7,400 mWh / 4.383 = 1,688 days = 4.6 years

Problem: BLE transmission dominates power budget (70% of daily energy)!

Optimization Strategy:

Option 1: Reduce BLE Transmissions

  • Only transmit on gesture detected (not periodic keep-alive)
  • Assume 20 gestures/day (vs 10 arbitrary transmits)
  • BLE energy: 20 × 2s × 555mW / (24×3600s) = 0.257 mWh/day
  • New battery life: 7,400 / (1.3 + 0.257) = 4,752 days = 13 years

Option 2: Switch to BLE 5.0 Long Range Mode

  • Lower data rate but 4x longer range, 50% less power
  • BLE energy: 0.257 / 2 = 0.129 mWh/day
  • New battery life: 7,400 / (1.3 + 0.129) = 5,178 days = 14.2 years

Option 3: Local Inference + Edge Aggregation

  • Run inference on-device (no transmission per gesture)
  • Transmit summary once/day: “50 shakes, 30 taps, 10 tilts, 5 circles”
  • BLE energy: 1 × 2s × 555mW / 86,400s = 0.013 mWh/day
  • New battery life: 7,400 / (1.3 + 0.013) = 5,637 days = 15.4 years

Key Insight: On-device TinyML inference (0.004 mWh/day) consumes 770x LESS energy than BLE transmission (3.083 mWh/day). Local processing dramatically extends battery life by avoiding wireless communication.

Comparison Table:

Approach BLE Transmits/Day Daily Energy Battery Life
Cloud ML (stream all data) 100 samples 30.83 mWh 240 days (8 months)
Edge ML + frequent BLE 10 summaries 4.38 mWh 1,688 days (4.6 years)
Edge ML + event-driven BLE 20 gestures 1.56 mWh 4,752 days (13 years)
Edge ML + daily summary 1 summary 1.31 mWh 5,637 days (15.4 years)
Model Constraint Primary Issue Recommended Technique Expected Result
Flash Storage (512KB limit) Model >512KB Int8 quantization 4x compression, fits in flash
Flash Storage (256KB limit) Quantized model still >256KB Quantization + 50% pruning 8x compression total
RAM (96KB limit) Activation memory >96KB Reduce hidden layer size OR use int8 activations 2-4x RAM reduction
Inference Latency >100ms (too slow for real-time) Reduce model depth OR use MobileNet architecture 3-10x speedup
Accuracy <90% (unacceptable) Over-quantization or over-pruning Quantization-aware training OR reduce pruning +2-5% accuracy
Power Budget >50mW avg (battery dies in weeks) Reduce inference frequency OR optimize wake time 10-100x power reduction

Optimization Pipeline (ordered by effort):

Step 1: Post-Training Quantization (easiest, always do this first) - Effort: Low (1-2 days with TensorFlow Lite) - Result: 4x model size reduction, 1.5-2x speedup - Accuracy loss: 0.5-2% - When to use: Always (standard practice for TinyML)

Step 2: Pruning (if Step 1 insufficient) - Effort: Medium (1-2 weeks to find optimal sparsity) - Result: 2-4x additional size reduction at 50-70% sparsity - Accuracy loss: 1-3% at 50% pruning, 5-10% at 70% - When to use: Model still doesn’t fit in flash after quantization

Step 3: Quantization-Aware Training (if accuracy degraded) - Effort: Medium-High (2-4 weeks, requires retraining) - Result: Same 4x compression as Step 1, but <0.5% accuracy loss - When to use: Post-training quantization lost >2% accuracy

Step 4: Knowledge Distillation (if model still too large) - Effort: High (4-8 weeks, requires training student model) - Result: 5-20x compression (train tiny student from large teacher) - Accuracy loss: 2-5% - When to use: Steps 1-3 combined still insufficient

Step 5: Architecture Search (last resort) - Effort: Very High (months, requires ML expertise) - Result: Custom architecture optimized for constraints - Example: MobileNetV3, EfficientNet-Lite designed for mobile/edge - When to use: Standard techniques insufficient for your use case

Real-World Example Decision:

Gesture Recognition Model:

  • Initial: 1.5MB float32, 92% accuracy, 450ms inference
  • Target: <256KB flash, <50ms inference, >88% accuracy

Step 1: Int8 Quantization

  • Result: 380KB, 90% accuracy (-2%), 250ms inference
  • Status: Still >256KB, need more ✗

Step 2: Add 60% Pruning

  • Result: 152KB (380 × 0.4), 87% accuracy (-5% total), 180ms inference
  • Status: Fits in flash ✓, accuracy 87% < 88% target ✗

Step 3: Quantization-Aware Training

  • Result: 152KB, 89.5% accuracy (-2.5% total), 180ms inference
  • Status: Meets all targets ✓✓✓

Decision: Stop at Step 3 (QAT + 60% pruning). No need for distillation.

Common Mistake: Quantizing Model Without Representative Data

The Error: Applying post-training quantization using random or synthetic data instead of real sensor data from target deployment.

Real Example:

  • Gesture recognition model trained on clean lab accelerometer data
  • Quantized using synthetic Gaussian noise (random -1 to +1 values)
  • Deployed to real wearable devices
  • Result: 78% accuracy in field vs 92% accuracy in lab (14% degradation)

Why This Happens:

Quantization Calibration requires representative data to compute activation ranges:

# WRONG: Synthetic calibration data
calibration_data = np.random.uniform(-1, 1, size=(100, 12))  # Random noise

# Quantization computes ranges:
min_activation = -1.0, max_activation = +1.0
scale_factor = 2.0 / 255 = 0.0078

# Real-world gesture data after deployment:
# Shake gesture produces activations: [-0.9, 0.8] → fits range OK
# BUT Circle gesture produces: [-0.3, 0.3] → uses only 30% of quantization range

Result: Circle gesture loses precision (quantization error 3x higher)

The Fix: Use REAL data from target environment for calibration:

# CORRECT: Real calibration data from wearable deployment
real_gestures = load_field_data()  # 100 samples from actual users

# Quantization computes ACTUAL ranges:
Shake: min=-0.92, max=0.85
Tap: min=-0.15, max=0.95
Tilt: min=0.05, max=0.75
Circle: min=-0.35, max=0.32

# Quantization uses full precision range for each gesture type
scale_factor optimized per layer based on real activation distributions

Results Comparison:

Calibration Data Field Accuracy Accuracy Loss Why
Random synthetic 78% 14% Poor range estimation, wasted quantization bits
Lab data only 84% 8% Better but misses field variations (user differences, orientations)
Field data (diverse users) 90% 2% Accurate range estimation, optimal quantization

How to Collect Representative Data:

  1. Deploy pilot devices (10-50 units) to real users for 1-2 weeks
  2. Collect telemetry: Log raw accelerometer data for all detected gestures
  3. Download 500-1,000 samples covering:
    • All gesture types (shake, tap, tilt, circle)
    • Different users (hand sizes, movement styles)
    • Different orientations (wrist up, down, sideways)
    • Different contexts (walking, sitting, standing)
  4. Use for quantization calibration: TensorFlow Lite converter accepts calibration dataset

Cost-Benefit:

Pilot deployment: $500 (10 devices) + 2 weeks time
Accuracy improvement: 78% → 90% (12 percentage points)
Avoided field failure: Prevents product recall or firmware emergency patch

ROI: $500 investment prevents $50K+ in returns/reputation damage

Lesson: Quantization is only as good as its calibration data. Always use representative real-world data from your target deployment environment, not synthetic or lab-only data. The 2-week pilot deployment pays for itself many times over in avoided field failures.

26.16 Summary and Key Takeaways

This lab demonstrated the core building blocks of TinyML for IoT applications:

  1. Neural network inference on microcontrollers is practical: a 3-layer fully-connected network runs in hundreds of microseconds on an ESP32, classifying accelerometer gestures in real time.

  2. Quantization (float32 to int8) provides 4x memory compression and approximately 2x inference speedup with typically less than 2% accuracy loss – this is the single most important optimization for deploying ML on microcontrollers.

  3. Pruning removes redundant weights to further reduce model size. Up to 50-70% of weights can typically be pruned with minimal accuracy impact, but beyond this threshold a sharp “pruning cliff” causes severe degradation.

  4. Softmax output converts raw neural network outputs into calibrated probabilities, enabling confidence-based decision-making (e.g., only act when confidence exceeds 70%).

  5. On-device inference eliminates network latency, reduces bandwidth usage, and keeps sensitive sensor data private – three properties essential for production IoT deployments.

Technique Compression Speed Gain Accuracy Impact Difficulty
Int8 Quantization (PTQ) 4x 1.5-2x 0.5-2% loss Easy
Quantization-Aware Training 4x 1.5-2x <0.5% loss Medium
Pruning (50%) 2x Variable 1-3% loss Medium
Knowledge Distillation 5-20x Proportional 1-5% loss Hard
Combined Pipeline 10-100x 3-10x 2-5% loss Hard

26.17 Knowledge Check

Common Pitfalls

Lab gesture recognition models built and tested on the same person’s gestures achieve 95%+ accuracy in testing but drop to 60-70% in production when other users try them. Always collect training data from multiple users with diverse hand sizes, orientations, and speeds. Minimum viable dataset: 10+ users × 20 gestures × 5 repetitions each.

Converting a float32 model to int8 without a calibration dataset causes symmetric quantization to misplace scale factors, producing garbage outputs. Always validate quantized model accuracy against a held-out test set and confirm it stays within 2% of the float32 baseline before deploying to hardware.

Developing with an Arduino IDE-style upload-and-run workflow masks inference timing. On a Cortex-M4 at 80 MHz, a 30KB model takes 15-50ms per inference — fast enough for 20fps gesture detection but too slow for 100fps industrial inspection. Profile latency with a hardware timer before finalizing model architecture.

Setting confidence thresholds (e.g., “accept if score > 0.8”) based on lab testing creates false-negative storms in production where lighting, orientation, or user differences shift score distributions. Use a calibration dataset from target deployment conditions to set thresholds that balance precision and recall for the actual use case.

26.18 What’s Next

Topic Chapter Description
Model Optimization Model Optimization for Edge AI Dive deeper into quantization math, knowledge distillation, and production optimization pipelines
Hardware Platforms Edge AI/ML Hardware Platforms Compare hardware accelerators (Coral TPU, Intel Movidius, NVIDIA Jetson) for different deployment scenarios
Fog/Edge Production Fog/Edge Production and Review Explore orchestration platforms and workload distribution across edge-fog-cloud tiers

Hands-On Practice:

  • Deploy a TinyML model on Arduino or ESP32 using Edge Impulse
  • Build an edge AI application with Coral Edge TPU and TensorFlow Lite
  • Implement a predictive maintenance system using vibration sensors and anomaly detection
  • Compare cloud vs edge inference for a computer vision application (measure latency, bandwidth, cost)

Further Reading: