6  Edge ML and TinyML Deployment

In 60 Seconds

Edge ML and TinyML enable machine learning inference on devices with less than 1MB RAM by using quantization, pruning, and knowledge distillation to shrink models. INT8 quantization alone reduces model size by 4x with only 1-3% accuracy loss, making it possible to run neural networks on microcontrollers for applications like predictive maintenance and smart HVAC control.

6.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Decide Edge vs Cloud: Apply the decision framework to determine optimal ML deployment location
  • Implement TinyML: Deploy quantized neural networks on microcontrollers
  • Optimize Models: Use pruning, quantization, and knowledge distillation techniques
  • Design Real-Time Systems: Build predictive control systems with edge ML

Key Concepts

  • Model quantisation: Converting ML model weights from 32-bit floats to 8-bit integers (INT8) or 4-bit values, reducing model size by 4–8× and inference speed by 2–4× with minimal accuracy loss.
  • Model pruning: Removing connections (weights near zero) or entire neurons from a trained neural network, reducing model size and inference cost while preserving most of the model’s accuracy.
  • Knowledge distillation: Training a small ‘student’ model to mimic the outputs of a larger ‘teacher’ model, producing a compact model that captures the teacher’s knowledge in a fraction of the parameters.
  • TensorFlow Lite (TFLite): A lightweight ML inference framework designed for microcontrollers and mobile devices, supporting quantised models with a binary footprint as small as 16 KB.
  • TinyML: The field of running ML inference on microcontrollers and other extremely resource-constrained devices (< 1 MB RAM), enabling on-device intelligence without cloud connectivity.
  • ONNX (Open Neural Network Exchange): A cross-framework ML model format enabling models trained in PyTorch or TensorFlow to be exported and run in optimised edge inference runtimes.

Normally, machine learning runs on powerful cloud servers. But what if your device is in a remote forest, a moving vehicle, or a patient’s wrist – with no reliable internet? Edge ML means running the intelligence directly on the device itself. TinyML takes this even further, squeezing models into microcontrollers with less memory than a single photo on your phone. Think of it like carrying a pocket calculator instead of calling a math professor every time you need an answer – it is less powerful, but instant and always available.

6.2 Prerequisites

Chapter Series: Modeling and Inferencing

This is part 4 of the IoT Machine Learning series:

  1. ML Fundamentals - Core concepts
  2. Mobile Sensing - HAR, transportation
  3. IoT ML Pipeline - 7-step pipeline
  4. Edge ML & Deployment (this chapter) - TinyML, quantization
  5. Audio Feature Processing - MFCC
  6. Feature Engineering - Feature design
  7. Production ML - Monitoring

6.3 Edge vs Cloud ML: Decision Framework

When to process data locally (edge) versus sending to the cloud is a critical architecture decision:

Decision tree flowchart for edge vs cloud ML deployment showing decision points: Is latency under 100ms critical? branches to Edge if yes, continues to Is privacy sensitive? if no, branches to Edge if yes, continues to Is bandwidth high or connectivity intermittent? branches to Edge if yes, continues to Is model under 1MB? branches to Edge if yes or Cloud if no
Figure 6.1: Edge vs Cloud ML Deployment Decision Framework

6.3.1 Decision Factors

Factor Choose Edge Choose Cloud
Latency < 100ms required (safety-critical) 1-5 seconds acceptable
Privacy Sensitive data (health, financial) Anonymous/aggregated data
Bandwidth High data rate (video, audio) Low data rate (temperature)
Connectivity Intermittent or offline Always connected
Model Complexity Simple models (< 1MB) Complex models (> 10MB)
Cost High cloud costs, many devices Expensive edge hardware

6.3.2 Example Scenarios

Application Best Deployment Reasoning
Fall Detection Edge < 100ms latency critical for safety
Voice Assistant Wake Word Edge Privacy (always listening)
Full Voice Command Cloud Complex NLU requires large models
Industrial Anomaly Detection Hybrid Real-time alerts (edge) + root cause (cloud)
Smart Thermostat Edge Works offline, simple model
Fleet-wide Predictive Maintenance Cloud Cross-device learning required
Tradeoff: Model Accuracy vs Edge Deployability

Complex models (deep neural networks, large ensembles) achieve 95-98% accuracy but require megabytes of RAM and GPU/NPU acceleration. Lightweight models (decision trees, quantized CNNs) achieve 88-93% accuracy but fit in less than 256KB and run on basic MCUs. For battery-powered wearables requiring 5+ day battery life, choose lightweight. For edge AI accelerators (Jetson, Coral) with consistent power, choose complex models. The sweet spot is INT8 quantization: 4x size reduction with only 1-2% accuracy loss.

6.4 TinyML and Model Quantization

TinyML enables machine learning on devices with < 1MB RAM and < 1MB storage. The key technique is quantization—reducing numerical precision.

6.4.1 Quantization Overview

Model quantization comparison showing neural network weights: FP32 baseline at 500KB with 32-bit floating point precision, FP16 reduced to 250KB with 2x compression, INT8 quantized to 125KB with 4x compression and 1-3% accuracy loss, INT4 compressed to 62.5KB with 8x compression and 3-8% accuracy loss
Figure 6.2: Model Quantization from FP32 to INT8 and INT4
Precision Bits Size Reduction Accuracy Loss Use Case
FP32 32 Baseline 0% Training, cloud inference
FP16 16 2x < 1% GPU inference
INT8 8 4x 1-3% Edge devices, ESP32
INT4 4 8x 3-8% MCUs, Cortex-M4

Quantization: Trading Precision for Efficiency

Quantization maps floating-point values to fixed-point integers. The math behind INT8 quantization shows the trade-off:

FP32 to INT8 Mapping: \[ \text{INT8 range} = [-128, 127] \quad \text{(256 discrete values)} \] \[ \text{FP32 range} = [w_{\min}, w_{\max}] \quad \text{(from model weights)} \]

Scale Factor: \[ S = \frac{w_{\max} - w_{\min}}{255} \quad \text{(quantization step size)} \]

Zero Point (offset to handle asymmetric ranges): \[ Z = -\text{round}\left(\frac{w_{\min}}{S}\right) \quad \text{(maps 0 in FP32 space)} \]

Quantization Formula: \[ w_{\text{INT8}} = \text{clip}\left(\text{round}\left(\frac{w_{\text{FP32}}}{S}\right) + Z, -128, 127\right) \]

De-quantization (for inference): \[ w_{\text{FP32}} \approx S \times (w_{\text{INT8}} - Z) \]

Example: Model weight range [-0.5, 0.8] \[ S = \frac{0.8 - (-0.5)}{255} = 0.0051 \quad Z = -\text{round}\left(\frac{-0.5}{0.0051}\right) = 98 \] \[ w = -0.2 \rightarrow \text{clip}\!\left(\text{round}\left(\frac{-0.2}{0.0051}\right) + 98, -128, 127\right) = \text{clip}(59, -128, 127) = 59 \quad \text{(INT8)} \]

Quantization Error: \[ \epsilon = S \times 0.5 = 0.0026 \quad \text{(maximum per-weight error)} \]

With 1000 weights, cumulative error remains bounded by Central Limit Theorem: \[ \epsilon_{\text{total}} \sim \mathcal{N}(0, \frac{S}{\sqrt{12}} \times \sqrt{n}) = \mathcal{N}(0, 0.046) \quad \text{(< 5% typical)} \]

This explains why INT8 maintains 97-99% accuracy despite 4x compression—quantization errors are zero-mean and largely cancel out across layers.

6.4.2 Try It: Quantization Explorer

Use the sliders below to experiment with INT8 quantization. Adjust the weight range and input weight to see how the scale factor, zero point, quantized value, and reconstruction error change in real time.

6.4.3 TensorFlow Lite Quantization

import tensorflow as tf

# Train full-precision model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(27,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(5, activation='softmax')  # 5 activity classes
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))

# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]

# Representative dataset for calibration (required for INT8)
def representative_data_gen():
    for sample in X_calibration[:100]:
        yield [sample.reshape(1, -1).astype(np.float32)]

converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8

# Convert and save
tflite_model = converter.convert()
with open('activity_classifier_int8.tflite', 'wb') as f:
    f.write(tflite_model)

# Result: 500KB → 125KB (4x reduction)

6.4.4 Quantization Calibration

INT8 quantization requires a representative dataset to calibrate the mapping from FP32 to INT8 ranges:

# Calibration determines min/max values for each layer
# Using 100-1000 representative samples works well

def representative_data_gen():
    # Use validation data (NOT training data) for calibration
    for i in range(min(100, len(X_val))):
        yield [X_val[i:i+1].astype(np.float32)]

Calibration Best Practices:

  • Use 100-1000 samples from validation set
  • Include diverse examples covering all classes
  • Do NOT use training data (causes overfit to training distribution)

6.5 Model Optimization Techniques

6.5.1 1. Pruning

Remove unimportant weights (set to zero):

import tensorflow_model_optimization as tfmot

# Apply pruning during training
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude

pruning_params = {
    'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
        initial_sparsity=0.30,
        final_sparsity=0.70,
        begin_step=1000,
        end_step=5000
    )
}

model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')

# Train with pruning
model_for_pruning.fit(X_train, y_train, epochs=10)

# Result: 70% of weights become zero → ~3x compression with <2% accuracy loss

6.5.2 2. Knowledge Distillation

Train small “student” model to mimic large “teacher” model:

Knowledge distillation diagram showing large Teacher Model with 10M parameters outputting soft probability distributions that train small Student Model with 100K parameters, where Student learns from Teacher's soft labels achieving 95% accuracy compared to 90% when trained only on hard ground truth labels
Figure 6.3: Knowledge Distillation from Teacher to Student Model

Benefits:

  • Student achieves 95% accuracy vs 90% when trained directly on labels
  • Soft labels from teacher contain more information than hard labels
  • Enables deployment of small models with large-model accuracy

6.6 Worked Example: HVAC Predictive Control with Edge LSTM

Scenario: A smart building requires temperature predictions 60 minutes ahead for efficient HVAC pre-conditioning. The edge gateway has 512KB RAM and must work offline.

6.6.1 System Architecture

HVAC predictive control architecture showing data flow from sensors (indoor temperature, outdoor temperature, occupancy, HVAC state) to feature engineering layer creating lag features and cyclic time encoding, to quantized LSTM model running on edge gateway predicting temperature 60 minutes ahead, to HVAC controller adjusting pre-conditioning schedule
Figure 6.4: HVAC Predictive Control Pipeline with Edge LSTM Model

6.6.2 Step 1: Data Preparation

Raw Data Collection (30 days of historical readings):

Metric Value Notes
Total samples 43,200 30 days × 24 hours × 60 minutes
Sampling rate 1 minute Standard HVAC sensor frequency
Sensors 4 Indoor temp, outdoor temp, occupancy, HVAC state
Storage ~2 MB CSV format with timestamps

Train/Validation/Test Split (time-series aware):

Split Samples Percentage Date Range
Training 30,240 70% Days 1-21
Validation 6,480 15% Days 22-25.5
Test 6,480 15% Days 25.5-30

6.6.3 Step 2: Feature Engineering

Feature Set (14 features from 4 raw sensors):

Feature Name Type Importance Rationale
temp_lag_1h Numeric 0.35 Strong autocorrelation
temp_lag_30m Numeric 0.22 Recent trend indicator
outdoor_temp Numeric 0.18 Heat transfer driver
outdoor_delta_1h Numeric 0.08 Predicts future heat load
hour_sin Cyclic 0.05 Daily temperature cycle
hour_cos Cyclic 0.03 Daily cycle phase
occupancy Binary 0.04 Body heat, door openings
Why Cyclic Encoding for Time?

Problem: Hour of day (0-23) is categorical but has circular relationship (11 PM is close to 1 AM).

Bad approach: Raw hour value (0-23) → Model thinks 0 and 23 are far apart!

Good approach: Encode as (sin, cos) pair: - Hour 0: sin=0, cos=1 - Hour 6: sin=1, cos=0 - Hour 12: sin=0, cos=-1 - Hour 23: sin≈-0.26, cos≈0.97 (close to hour 0!)

This preserves the circular relationship.

6.6.4 Step 3: Model Selection

Model MAE (deg C) Size (KB) Inference (ms) Fits 512KB?
Linear Regression 1.8 2 0.1 Yes
Random Forest (100 trees) 0.9 450 25 Yes (tight)
LSTM (2 layers, 32 units) 0.7 200 50 Yes
Quantized LSTM (INT8) 0.8 50 15 Yes

Winner: Quantized LSTM - Only 0.1 deg C accuracy loss, 4x smaller, 3x faster

6.6.5 Step 4: Quantization

import tensorflow as tf

# Train full-precision LSTM
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(60, 14)),
    tf.keras.layers.LSTM(16),
    tf.keras.layers.Dense(1)  # Predict temperature
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))

# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()

# Save quantized model
with open('hvac_predictor_int8.tflite', 'wb') as f:
    f.write(tflite_model)
# Result: 200KB → 50KB (75% reduction)

6.6.6 Step 5: Results

Deployment Validation:

Metric Target Achieved Status
Model size < 512KB 50KB Pass
Inference latency < 1000ms 15ms Pass (66x margin)
MAE accuracy < 1.0 deg C 0.8 deg C Pass
Works offline Required Yes Pass

Energy Savings:

Pre-deployment (reactive HVAC):
- HVAC cycles: 24 per day (start/stop at setpoint)
- Energy waste: HVAC runs until room reaches setpoint, then overshoots

Post-deployment (predictive HVAC):
- HVAC pre-starts 30-45 min before needed
- Reduces cycling: 24 → 18 cycles/day (25% fewer)

Monthly energy comparison:
- Baseline: 1,200 kWh/month
- Predictive: 1,020 kWh/month
- Savings: 180 kWh/month (15% reduction)

Annual savings (10-story office building):
- 180 kWh × 12 months × $0.12/kWh = $259/floor/year
- 10 floors = $2,590/year
- Edge gateway cost: $150 one-time
- Payback period: 0.7 months

6.7 Key Takeaways from HVAC Example

  1. Feature engineering drives accuracy: Lag features contributed 57% of model importance
  2. Quantization is nearly free: INT8 reduced size by 75% with only 0.1 deg C loss
  3. Time-series requires chronological splits: Random splits cause data leakage
  4. Cyclic encoding for time: sin/cos encoding preserves circular relationships
  5. Business value justifies complexity: 15% energy savings pays for development in < 1 month

6.8 Real-World Deployment: Sony’s Spresense and Edge AI for Wildlife Conservation

Wildlife Conservation Society (WCS) deployed edge ML on Sony’s Spresense microcontroller board (Cortex-M4F, 768 KB SRAM, $25/unit) to detect illegal chainsaw activity in tropical rainforests. The system illustrates the full edge ML deployment pipeline in a resource-constrained, real-world environment.

Problem: Illegal logging accounts for 15-30% of global timber trade. Traditional monitoring uses satellite imagery with 3-7 day update cycles – by the time deforestation is detected, the loggers are gone. Audio monitoring can detect chainsaws in real-time, but requires ML inference in remote locations with no internet and solar-only power.

Model optimization journey:

Stage Model Size Accuracy Inference Time Power
Cloud prototype ResNet-18 (PyTorch) 44 MB 97.2% 180 ms (GPU) N/A
Compressed MobileNetV2 (TF Lite) 3.4 MB 95.8% 420 ms (RPi 4) 3.2 W
Quantized INT8 MobileNetV2 (TF Lite Micro) 680 KB 94.1% 280 ms (Spresense) 45 mW
Pruned + quantized Custom CNN (TF Lite Micro) 210 KB 91.6% 85 ms (Spresense) 32 mW

The final 210 KB model fits comfortably in the Spresense’s 768 KB SRAM with room for audio buffers and firmware. The 5.6% accuracy reduction from the cloud prototype was acceptable because the system’s value comes from real-time detection (minutes, not days) rather than perfect classification.

Deployment results (Peruvian Amazon, 12 months):

  • 50 solar-powered nodes covering 2,500 hectares
  • Battery life: 14 months on 3,000 mAh LiPo + 1W solar panel (duty-cycled: 5 seconds of audio analysis every 30 seconds)
  • True positive rate: 89% for chainsaws at distances up to 300 meters
  • False positive rate: 2.1 per node per day (mainly heavy rain on metal roofs)
  • Response time: Alert transmitted via LoRa to ranger station within 90 seconds of detection
  • Outcome: 7 illegal logging operations intercepted, estimated 180 hectares of forest saved

The key lesson: a 91.6% accurate model running in real-time on a $25 microcontroller delivered more conservation value than a 97.2% accurate model that would have required $500 in gateway hardware, cellular connectivity, and cloud processing per node – making the 50-node deployment financially impossible.

6.9 How It Works: INT8 Quantization

INT8 quantization transforms a floating-point neural network into an integer-only version, reducing model size by 4x with minimal accuracy loss. Here is the complete process (see the “Putting Numbers to It” callout above for the detailed math):

Step 1: Calibration – Feed 100-1000 representative samples from the validation set through the FP32 model, recording min/max activation values for each layer. For an activity classifier: Layer 1 activations range [-2.3, 4.7], Layer 2 ranges [-1.8, 3.2], etc.

Step 2: Scale and zero-point calculation – For each layer, compute scale = (max - min) / 255 and zero_point = -round(min / scale). These map the continuous FP32 range onto 256 discrete INT8 values.

Step 3: Weight quantization – Convert each FP32 weight to INT8: q_weight = clip(round(fp32_weight / scale) + zero_point, -128, 127). Each weight shrinks from 4 bytes to 1 byte.

Step 4: Runtime inference – All arithmetic uses integer operations. Multiply-accumulate becomes: q_output = (q_input * q_weight) >> shift where shift adjusts for accumulated scaling. No floating-point operations – this enables deployment on microcontrollers without an FPU.

Step 5: Dequantization (output layer only) – Convert INT8 output back to FP32: fp32_output = (q_output - zero_point) * scale. For classification, apply temperature scaling to calibrate softmax probabilities.

Why it works: Neural networks are inherently robust to quantization noise. Precise floating-point values (e.g., 1.35472891) contribute no more accuracy than their rounded INT8 equivalents. The 1-3% typical accuracy loss comes from extreme activation values outside the calibration range, which get clipped.

Hardware acceleration: ARM Cortex-M processors include SIMD instructions (SMLAD, SMUAD) that perform 4x INT8 multiplications in a single cycle, making quantized inference 3-5x faster than FP32 on the same hardware.

6.10 Knowledge Check

Common Pitfalls

INT8 quantisation typically reduces accuracy by 1–3%, but for some model architectures or data distributions it can drop by 10–20%. Always evaluate quantised model accuracy on a representative test set before deploying.

A model designed for a high-end GPU and then ‘ported’ to a microcontroller is almost always too large and too slow. Start with the edge hardware constraints (RAM, flash, MIPS) and design the model architecture within those limits from the beginning.

IoT environments change (new noise sources, seasonal variation, equipment ageing) causing model accuracy to degrade. Plan for periodic model retraining and OTA model updates from the start of deployment.

A 100 ms inference time is fine for batch analysis but unacceptable for real-time motor control requiring <10 ms response. Profile inference latency on target hardware during model development, not after deployment.

6.11 Summary

This chapter covered edge ML and TinyML deployment:

  • Edge vs Cloud Decision: Use edge for low latency, privacy, and offline requirements
  • Quantization: INT8 reduces model size by 4x with 1-3% accuracy loss
  • Pruning: Remove 70% of weights with <2% accuracy loss
  • Knowledge Distillation: Train small models to match large model performance
  • HVAC Example: Quantized LSTM achieves 15% energy savings with 0.7-month payback

Key Insight: Start with the simplest model that meets constraints. Quantization is nearly free—always try it before abandoning edge deployment.

Key Takeaway

INT8 quantization is the single most impactful optimization for edge ML – it reduces model size by 4x and inference latency by 3x with only 1-3% accuracy loss. Always try quantization before concluding a model cannot run on edge hardware. The HVAC example demonstrates that even a 50KB quantized LSTM on a $75 Raspberry Pi can deliver 15% energy savings with a payback period under one month.

Can a tiny computer be as smart as a big one? The Sensor Squad finds out!

Max the Microcontroller has a problem. He wants to predict when a room will get too hot so he can turn on the air conditioning early. But the super-smart brain (neural network) that can do this is TOO BIG to fit in his tiny memory!

“I only have 512 kilobytes of memory!” Max sighs. “That is like having a bookshelf that only fits 5 books, but the brain needs 200 books!”

Sammy the Sensor has an idea: “What if we make the brain SMALLER?”

They try three tricks:

Trick 1 - Shrink the Numbers (Quantization) Instead of using really precise numbers like 3.14159265, Max rounds everything to simpler numbers like 3. This makes the brain 4 times smaller!

“It is like drawing with 8 crayons instead of 32,” explains Lila the LED. “You lose a tiny bit of detail, but the picture still looks great!”

Trick 2 - Remove the Lazy Parts (Pruning) Some parts of the brain do almost nothing. Max removes 70% of the lazy connections, and the brain still works almost as well!

Trick 3 - Learn from the Expert (Distillation) A big expert brain teaches a tiny student brain. The student learns the shortcuts and becomes almost as smart!

After all three tricks, the brain goes from 200KB down to 50KB – it fits perfectly in Max’s memory! And it only takes 15 milliseconds to make a prediction.

Bella the Battery cheers: “And because the brain is so small, I barely use any energy running it! I can last for DAYS!”

The building saves 15% on electricity because Max can predict when to turn on the AC ahead of time, instead of waiting until the room is already too hot.

6.11.1 Try This at Home!

Draw a detailed picture of your pet (or favorite animal) using 32 colored pencils. Now try drawing the SAME picture using only 8 colors. It still looks like your pet, right? That is quantization! You used fewer colors (less precision) but kept the important information. Computers do the same thing with numbers to make smart brains fit in tiny devices.

6.12 Concept Relationships

Edge ML builds on:

Edge ML enables:

  • Audio Processing - Wake word detection on Cortex-M4 using quantized models
  • Production ML - Real-time anomaly detection without cloud dependency

Parallel concepts:

  • Model quantization (FP32 → INT8) ↔︎ Audio compression (PCM → MP3): Both sacrifice precision for size with minimal quality loss
  • Pruning ↔︎ Feature selection: Both remove low-value components to improve efficiency
  • Knowledge distillation (teacher → student) ↔︎ Transfer learning: Both leverage large models to improve small models

6.13 See Also

Chapter series:

Edge deployment resources:

Hardware platforms:

  • ESP32 - $5 MCU with 520KB SRAM, Wi-Fi/Bluetooth
  • Raspberry Pi 4 - $35-75 SBC with 4-8GB RAM, quad-core ARM
  • NVIDIA Jetson Nano - $99 GPU accelerator with 4GB RAM
  • Google Coral Dev Board - $149 Edge TPU accelerator

6.14 What’s Next

Direction Chapter Link
Next Audio Feature Processing modeling-audio-features.html
Previous IoT ML Pipeline modeling-pipeline.html
Related Feature Engineering modeling-feature-engineering.html
Related Production ML modeling-production.html