6 Edge ML and TinyML Deployment
6.1 Learning Objectives
By the end of this chapter, you will be able to:
- Decide Edge vs Cloud: Apply the decision framework to determine optimal ML deployment location
- Implement TinyML: Deploy quantized neural networks on microcontrollers
- Optimize Models: Use pruning, quantization, and knowledge distillation techniques
- Design Real-Time Systems: Build predictive control systems with edge ML
Key Concepts
- Model quantisation: Converting ML model weights from 32-bit floats to 8-bit integers (INT8) or 4-bit values, reducing model size by 4–8× and inference speed by 2–4× with minimal accuracy loss.
- Model pruning: Removing connections (weights near zero) or entire neurons from a trained neural network, reducing model size and inference cost while preserving most of the model’s accuracy.
- Knowledge distillation: Training a small ‘student’ model to mimic the outputs of a larger ‘teacher’ model, producing a compact model that captures the teacher’s knowledge in a fraction of the parameters.
- TensorFlow Lite (TFLite): A lightweight ML inference framework designed for microcontrollers and mobile devices, supporting quantised models with a binary footprint as small as 16 KB.
- TinyML: The field of running ML inference on microcontrollers and other extremely resource-constrained devices (< 1 MB RAM), enabling on-device intelligence without cloud connectivity.
- ONNX (Open Neural Network Exchange): A cross-framework ML model format enabling models trained in PyTorch or TensorFlow to be exported and run in optimised edge inference runtimes.
For Beginners: Edge ML and TinyML Deployment
Normally, machine learning runs on powerful cloud servers. But what if your device is in a remote forest, a moving vehicle, or a patient’s wrist – with no reliable internet? Edge ML means running the intelligence directly on the device itself. TinyML takes this even further, squeezing models into microcontrollers with less memory than a single photo on your phone. Think of it like carrying a pocket calculator instead of calling a math professor every time you need an answer – it is less powerful, but instant and always available.
6.2 Prerequisites
- ML Fundamentals: Training vs inference concepts
- IoT ML Pipeline: 7-step ML pipeline
- Basic understanding of microcontrollers (ESP32, Cortex-M4)
Chapter Series: Modeling and Inferencing
This is part 4 of the IoT Machine Learning series:
- ML Fundamentals - Core concepts
- Mobile Sensing - HAR, transportation
- IoT ML Pipeline - 7-step pipeline
- Edge ML & Deployment (this chapter) - TinyML, quantization
- Audio Feature Processing - MFCC
- Feature Engineering - Feature design
- Production ML - Monitoring
6.3 Edge vs Cloud ML: Decision Framework
When to process data locally (edge) versus sending to the cloud is a critical architecture decision:
6.3.1 Decision Factors
| Factor | Choose Edge | Choose Cloud |
|---|---|---|
| Latency | < 100ms required (safety-critical) | 1-5 seconds acceptable |
| Privacy | Sensitive data (health, financial) | Anonymous/aggregated data |
| Bandwidth | High data rate (video, audio) | Low data rate (temperature) |
| Connectivity | Intermittent or offline | Always connected |
| Model Complexity | Simple models (< 1MB) | Complex models (> 10MB) |
| Cost | High cloud costs, many devices | Expensive edge hardware |
6.3.2 Example Scenarios
| Application | Best Deployment | Reasoning |
|---|---|---|
| Fall Detection | Edge | < 100ms latency critical for safety |
| Voice Assistant Wake Word | Edge | Privacy (always listening) |
| Full Voice Command | Cloud | Complex NLU requires large models |
| Industrial Anomaly Detection | Hybrid | Real-time alerts (edge) + root cause (cloud) |
| Smart Thermostat | Edge | Works offline, simple model |
| Fleet-wide Predictive Maintenance | Cloud | Cross-device learning required |
Tradeoff: Model Accuracy vs Edge Deployability
Complex models (deep neural networks, large ensembles) achieve 95-98% accuracy but require megabytes of RAM and GPU/NPU acceleration. Lightweight models (decision trees, quantized CNNs) achieve 88-93% accuracy but fit in less than 256KB and run on basic MCUs. For battery-powered wearables requiring 5+ day battery life, choose lightweight. For edge AI accelerators (Jetson, Coral) with consistent power, choose complex models. The sweet spot is INT8 quantization: 4x size reduction with only 1-2% accuracy loss.
6.4 TinyML and Model Quantization
TinyML enables machine learning on devices with < 1MB RAM and < 1MB storage. The key technique is quantization—reducing numerical precision.
6.4.1 Quantization Overview
| Precision | Bits | Size Reduction | Accuracy Loss | Use Case |
|---|---|---|---|---|
| FP32 | 32 | Baseline | 0% | Training, cloud inference |
| FP16 | 16 | 2x | < 1% | GPU inference |
| INT8 | 8 | 4x | 1-3% | Edge devices, ESP32 |
| INT4 | 4 | 8x | 3-8% | MCUs, Cortex-M4 |
Putting Numbers to It
Quantization: Trading Precision for Efficiency
Quantization maps floating-point values to fixed-point integers. The math behind INT8 quantization shows the trade-off:
FP32 to INT8 Mapping: \[ \text{INT8 range} = [-128, 127] \quad \text{(256 discrete values)} \] \[ \text{FP32 range} = [w_{\min}, w_{\max}] \quad \text{(from model weights)} \]
Scale Factor: \[ S = \frac{w_{\max} - w_{\min}}{255} \quad \text{(quantization step size)} \]
Zero Point (offset to handle asymmetric ranges): \[ Z = -\text{round}\left(\frac{w_{\min}}{S}\right) \quad \text{(maps 0 in FP32 space)} \]
Quantization Formula: \[ w_{\text{INT8}} = \text{clip}\left(\text{round}\left(\frac{w_{\text{FP32}}}{S}\right) + Z, -128, 127\right) \]
De-quantization (for inference): \[ w_{\text{FP32}} \approx S \times (w_{\text{INT8}} - Z) \]
Example: Model weight range [-0.5, 0.8] \[ S = \frac{0.8 - (-0.5)}{255} = 0.0051 \quad Z = -\text{round}\left(\frac{-0.5}{0.0051}\right) = 98 \] \[ w = -0.2 \rightarrow \text{clip}\!\left(\text{round}\left(\frac{-0.2}{0.0051}\right) + 98, -128, 127\right) = \text{clip}(59, -128, 127) = 59 \quad \text{(INT8)} \]
Quantization Error: \[ \epsilon = S \times 0.5 = 0.0026 \quad \text{(maximum per-weight error)} \]
With 1000 weights, cumulative error remains bounded by Central Limit Theorem: \[ \epsilon_{\text{total}} \sim \mathcal{N}(0, \frac{S}{\sqrt{12}} \times \sqrt{n}) = \mathcal{N}(0, 0.046) \quad \text{(< 5% typical)} \]
This explains why INT8 maintains 97-99% accuracy despite 4x compression—quantization errors are zero-mean and largely cancel out across layers.
6.4.2 Try It: Quantization Explorer
Use the sliders below to experiment with INT8 quantization. Adjust the weight range and input weight to see how the scale factor, zero point, quantized value, and reconstruction error change in real time.
6.4.3 TensorFlow Lite Quantization
import tensorflow as tf
# Train full-precision model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(27,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(5, activation='softmax') # 5 activity classes
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))
# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Representative dataset for calibration (required for INT8)
def representative_data_gen():
for sample in X_calibration[:100]:
yield [sample.reshape(1, -1).astype(np.float32)]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Convert and save
tflite_model = converter.convert()
with open('activity_classifier_int8.tflite', 'wb') as f:
f.write(tflite_model)
# Result: 500KB → 125KB (4x reduction)6.4.4 Quantization Calibration
INT8 quantization requires a representative dataset to calibrate the mapping from FP32 to INT8 ranges:
# Calibration determines min/max values for each layer
# Using 100-1000 representative samples works well
def representative_data_gen():
# Use validation data (NOT training data) for calibration
for i in range(min(100, len(X_val))):
yield [X_val[i:i+1].astype(np.float32)]Calibration Best Practices:
- Use 100-1000 samples from validation set
- Include diverse examples covering all classes
- Do NOT use training data (causes overfit to training distribution)
6.5 Model Optimization Techniques
6.5.1 1. Pruning
Remove unimportant weights (set to zero):
import tensorflow_model_optimization as tfmot
# Apply pruning during training
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.30,
final_sparsity=0.70,
begin_step=1000,
end_step=5000
)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
# Train with pruning
model_for_pruning.fit(X_train, y_train, epochs=10)
# Result: 70% of weights become zero → ~3x compression with <2% accuracy loss6.5.2 2. Knowledge Distillation
Train small “student” model to mimic large “teacher” model:
Benefits:
- Student achieves 95% accuracy vs 90% when trained directly on labels
- Soft labels from teacher contain more information than hard labels
- Enables deployment of small models with large-model accuracy
6.6 Worked Example: HVAC Predictive Control with Edge LSTM
Scenario: A smart building requires temperature predictions 60 minutes ahead for efficient HVAC pre-conditioning. The edge gateway has 512KB RAM and must work offline.
6.6.1 System Architecture
6.6.2 Step 1: Data Preparation
Raw Data Collection (30 days of historical readings):
| Metric | Value | Notes |
|---|---|---|
| Total samples | 43,200 | 30 days × 24 hours × 60 minutes |
| Sampling rate | 1 minute | Standard HVAC sensor frequency |
| Sensors | 4 | Indoor temp, outdoor temp, occupancy, HVAC state |
| Storage | ~2 MB | CSV format with timestamps |
Train/Validation/Test Split (time-series aware):
| Split | Samples | Percentage | Date Range |
|---|---|---|---|
| Training | 30,240 | 70% | Days 1-21 |
| Validation | 6,480 | 15% | Days 22-25.5 |
| Test | 6,480 | 15% | Days 25.5-30 |
6.6.3 Step 2: Feature Engineering
Feature Set (14 features from 4 raw sensors):
| Feature Name | Type | Importance | Rationale |
|---|---|---|---|
temp_lag_1h |
Numeric | 0.35 | Strong autocorrelation |
temp_lag_30m |
Numeric | 0.22 | Recent trend indicator |
outdoor_temp |
Numeric | 0.18 | Heat transfer driver |
outdoor_delta_1h |
Numeric | 0.08 | Predicts future heat load |
hour_sin |
Cyclic | 0.05 | Daily temperature cycle |
hour_cos |
Cyclic | 0.03 | Daily cycle phase |
occupancy |
Binary | 0.04 | Body heat, door openings |
Why Cyclic Encoding for Time?
Problem: Hour of day (0-23) is categorical but has circular relationship (11 PM is close to 1 AM).
Bad approach: Raw hour value (0-23) → Model thinks 0 and 23 are far apart!
Good approach: Encode as (sin, cos) pair: - Hour 0: sin=0, cos=1 - Hour 6: sin=1, cos=0 - Hour 12: sin=0, cos=-1 - Hour 23: sin≈-0.26, cos≈0.97 (close to hour 0!)
This preserves the circular relationship.
6.6.4 Step 3: Model Selection
| Model | MAE (deg C) | Size (KB) | Inference (ms) | Fits 512KB? |
|---|---|---|---|---|
| Linear Regression | 1.8 | 2 | 0.1 | Yes |
| Random Forest (100 trees) | 0.9 | 450 | 25 | Yes (tight) |
| LSTM (2 layers, 32 units) | 0.7 | 200 | 50 | Yes |
| Quantized LSTM (INT8) | 0.8 | 50 | 15 | Yes |
Winner: Quantized LSTM - Only 0.1 deg C accuracy loss, 4x smaller, 3x faster
6.6.5 Step 4: Quantization
import tensorflow as tf
# Train full-precision LSTM
model = tf.keras.Sequential([
tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(60, 14)),
tf.keras.layers.LSTM(16),
tf.keras.layers.Dense(1) # Predict temperature
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))
# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()
# Save quantized model
with open('hvac_predictor_int8.tflite', 'wb') as f:
f.write(tflite_model)
# Result: 200KB → 50KB (75% reduction)6.6.6 Step 5: Results
Deployment Validation:
| Metric | Target | Achieved | Status |
|---|---|---|---|
| Model size | < 512KB | 50KB | Pass |
| Inference latency | < 1000ms | 15ms | Pass (66x margin) |
| MAE accuracy | < 1.0 deg C | 0.8 deg C | Pass |
| Works offline | Required | Yes | Pass |
Energy Savings:
Pre-deployment (reactive HVAC):
- HVAC cycles: 24 per day (start/stop at setpoint)
- Energy waste: HVAC runs until room reaches setpoint, then overshoots
Post-deployment (predictive HVAC):
- HVAC pre-starts 30-45 min before needed
- Reduces cycling: 24 → 18 cycles/day (25% fewer)
Monthly energy comparison:
- Baseline: 1,200 kWh/month
- Predictive: 1,020 kWh/month
- Savings: 180 kWh/month (15% reduction)
Annual savings (10-story office building):
- 180 kWh × 12 months × $0.12/kWh = $259/floor/year
- 10 floors = $2,590/year
- Edge gateway cost: $150 one-time
- Payback period: 0.7 months
6.7 Key Takeaways from HVAC Example
- Feature engineering drives accuracy: Lag features contributed 57% of model importance
- Quantization is nearly free: INT8 reduced size by 75% with only 0.1 deg C loss
- Time-series requires chronological splits: Random splits cause data leakage
- Cyclic encoding for time: sin/cos encoding preserves circular relationships
- Business value justifies complexity: 15% energy savings pays for development in < 1 month
6.8 Real-World Deployment: Sony’s Spresense and Edge AI for Wildlife Conservation
Wildlife Conservation Society (WCS) deployed edge ML on Sony’s Spresense microcontroller board (Cortex-M4F, 768 KB SRAM, $25/unit) to detect illegal chainsaw activity in tropical rainforests. The system illustrates the full edge ML deployment pipeline in a resource-constrained, real-world environment.
Problem: Illegal logging accounts for 15-30% of global timber trade. Traditional monitoring uses satellite imagery with 3-7 day update cycles – by the time deforestation is detected, the loggers are gone. Audio monitoring can detect chainsaws in real-time, but requires ML inference in remote locations with no internet and solar-only power.
Model optimization journey:
| Stage | Model | Size | Accuracy | Inference Time | Power |
|---|---|---|---|---|---|
| Cloud prototype | ResNet-18 (PyTorch) | 44 MB | 97.2% | 180 ms (GPU) | N/A |
| Compressed | MobileNetV2 (TF Lite) | 3.4 MB | 95.8% | 420 ms (RPi 4) | 3.2 W |
| Quantized INT8 | MobileNetV2 (TF Lite Micro) | 680 KB | 94.1% | 280 ms (Spresense) | 45 mW |
| Pruned + quantized | Custom CNN (TF Lite Micro) | 210 KB | 91.6% | 85 ms (Spresense) | 32 mW |
The final 210 KB model fits comfortably in the Spresense’s 768 KB SRAM with room for audio buffers and firmware. The 5.6% accuracy reduction from the cloud prototype was acceptable because the system’s value comes from real-time detection (minutes, not days) rather than perfect classification.
Deployment results (Peruvian Amazon, 12 months):
- 50 solar-powered nodes covering 2,500 hectares
- Battery life: 14 months on 3,000 mAh LiPo + 1W solar panel (duty-cycled: 5 seconds of audio analysis every 30 seconds)
- True positive rate: 89% for chainsaws at distances up to 300 meters
- False positive rate: 2.1 per node per day (mainly heavy rain on metal roofs)
- Response time: Alert transmitted via LoRa to ranger station within 90 seconds of detection
- Outcome: 7 illegal logging operations intercepted, estimated 180 hectares of forest saved
The key lesson: a 91.6% accurate model running in real-time on a $25 microcontroller delivered more conservation value than a 97.2% accurate model that would have required $500 in gateway hardware, cellular connectivity, and cloud processing per node – making the 50-node deployment financially impossible.
6.9 How It Works: INT8 Quantization
INT8 quantization transforms a floating-point neural network into an integer-only version, reducing model size by 4x with minimal accuracy loss. Here is the complete process (see the “Putting Numbers to It” callout above for the detailed math):
Step 1: Calibration – Feed 100-1000 representative samples from the validation set through the FP32 model, recording min/max activation values for each layer. For an activity classifier: Layer 1 activations range [-2.3, 4.7], Layer 2 ranges [-1.8, 3.2], etc.
Step 2: Scale and zero-point calculation – For each layer, compute scale = (max - min) / 255 and zero_point = -round(min / scale). These map the continuous FP32 range onto 256 discrete INT8 values.
Step 3: Weight quantization – Convert each FP32 weight to INT8: q_weight = clip(round(fp32_weight / scale) + zero_point, -128, 127). Each weight shrinks from 4 bytes to 1 byte.
Step 4: Runtime inference – All arithmetic uses integer operations. Multiply-accumulate becomes: q_output = (q_input * q_weight) >> shift where shift adjusts for accumulated scaling. No floating-point operations – this enables deployment on microcontrollers without an FPU.
Step 5: Dequantization (output layer only) – Convert INT8 output back to FP32: fp32_output = (q_output - zero_point) * scale. For classification, apply temperature scaling to calibrate softmax probabilities.
Why it works: Neural networks are inherently robust to quantization noise. Precise floating-point values (e.g., 1.35472891) contribute no more accuracy than their rounded INT8 equivalents. The 1-3% typical accuracy loss comes from extreme activation values outside the calibration range, which get clipped.
Hardware acceleration: ARM Cortex-M processors include SIMD instructions (SMLAD, SMUAD) that perform 4x INT8 multiplications in a single cycle, making quantized inference 3-5x faster than FP32 on the same hardware.
6.10 Knowledge Check
Common Pitfalls
1. Quantising a model without post-quantisation accuracy evaluation
INT8 quantisation typically reduces accuracy by 1–3%, but for some model architectures or data distributions it can drop by 10–20%. Always evaluate quantised model accuracy on a representative test set before deploying.
2. Designing the ML pipeline without considering the target edge hardware first
A model designed for a high-end GPU and then ‘ported’ to a microcontroller is almost always too large and too slow. Start with the edge hardware constraints (RAM, flash, MIPS) and design the model architecture within those limits from the beginning.
3. Treating model deployment as a one-time task
IoT environments change (new noise sources, seasonal variation, equipment ageing) causing model accuracy to degrade. Plan for periodic model retraining and OTA model updates from the start of deployment.
4. Ignoring inference latency requirements in model selection
A 100 ms inference time is fine for batch analysis but unacceptable for real-time motor control requiring <10 ms response. Profile inference latency on target hardware during model development, not after deployment.
6.11 Summary
This chapter covered edge ML and TinyML deployment:
- Edge vs Cloud Decision: Use edge for low latency, privacy, and offline requirements
- Quantization: INT8 reduces model size by 4x with 1-3% accuracy loss
- Pruning: Remove 70% of weights with <2% accuracy loss
- Knowledge Distillation: Train small models to match large model performance
- HVAC Example: Quantized LSTM achieves 15% energy savings with 0.7-month payback
Key Insight: Start with the simplest model that meets constraints. Quantization is nearly free—always try it before abandoning edge deployment.
Key Takeaway
INT8 quantization is the single most impactful optimization for edge ML – it reduces model size by 4x and inference latency by 3x with only 1-3% accuracy loss. Always try quantization before concluding a model cannot run on edge hardware. The HVAC example demonstrates that even a 50KB quantized LSTM on a $75 Raspberry Pi can deliver 15% energy savings with a payback period under one month.
For Kids: Meet the Sensor Squad!
Can a tiny computer be as smart as a big one? The Sensor Squad finds out!
Max the Microcontroller has a problem. He wants to predict when a room will get too hot so he can turn on the air conditioning early. But the super-smart brain (neural network) that can do this is TOO BIG to fit in his tiny memory!
“I only have 512 kilobytes of memory!” Max sighs. “That is like having a bookshelf that only fits 5 books, but the brain needs 200 books!”
Sammy the Sensor has an idea: “What if we make the brain SMALLER?”
They try three tricks:
Trick 1 - Shrink the Numbers (Quantization) Instead of using really precise numbers like 3.14159265, Max rounds everything to simpler numbers like 3. This makes the brain 4 times smaller!
“It is like drawing with 8 crayons instead of 32,” explains Lila the LED. “You lose a tiny bit of detail, but the picture still looks great!”
Trick 2 - Remove the Lazy Parts (Pruning) Some parts of the brain do almost nothing. Max removes 70% of the lazy connections, and the brain still works almost as well!
Trick 3 - Learn from the Expert (Distillation) A big expert brain teaches a tiny student brain. The student learns the shortcuts and becomes almost as smart!
After all three tricks, the brain goes from 200KB down to 50KB – it fits perfectly in Max’s memory! And it only takes 15 milliseconds to make a prediction.
Bella the Battery cheers: “And because the brain is so small, I barely use any energy running it! I can last for DAYS!”
The building saves 15% on electricity because Max can predict when to turn on the AC ahead of time, instead of waiting until the room is already too hot.
6.11.1 Try This at Home!
Draw a detailed picture of your pet (or favorite animal) using 32 colored pencils. Now try drawing the SAME picture using only 8 colors. It still looks like your pet, right? That is quantization! You used fewer colors (less precision) but kept the important information. Computers do the same thing with numbers to make smart brains fit in tiny devices.
6.12 Concept Relationships
Edge ML builds on:
- ML Fundamentals - Training vs inference separation enables edge deployment
- IoT ML Pipeline - Stages 4-5 (optimize, deploy) focus on edge preparation
- Edge Computing Architecture - Edge vs cloud decision framework
Edge ML enables:
- Audio Processing - Wake word detection on Cortex-M4 using quantized models
- Production ML - Real-time anomaly detection without cloud dependency
Parallel concepts:
- Model quantization (FP32 → INT8) ↔︎ Audio compression (PCM → MP3): Both sacrifice precision for size with minimal quality loss
- Pruning ↔︎ Feature selection: Both remove low-value components to improve efficiency
- Knowledge distillation (teacher → student) ↔︎ Transfer learning: Both leverage large models to improve small models
6.13 See Also
Chapter series:
- ML Fundamentals - Core concepts and edge vs cloud decision
- IoT ML Pipeline - Complete 7-step pipeline
- Audio Feature Processing - MFCC for voice recognition
- Production ML - Monitoring deployed models
Edge deployment resources:
- TensorFlow Lite - Official quantization and deployment guide
- TensorFlow Lite Micro - MCU deployment
- ARM CMSIS-NN - Optimized neural network kernels for Cortex-M
- Edge Impulse - End-to-end TinyML platform
Hardware platforms:
- ESP32 - $5 MCU with 520KB SRAM, Wi-Fi/Bluetooth
- Raspberry Pi 4 - $35-75 SBC with 4-8GB RAM, quad-core ARM
- NVIDIA Jetson Nano - $99 GPU accelerator with 4GB RAM
- Google Coral Dev Board - $149 Edge TPU accelerator
6.14 What’s Next
| Direction | Chapter | Link |
|---|---|---|
| Next | Audio Feature Processing | modeling-audio-features.html |
| Previous | IoT ML Pipeline | modeling-pipeline.html |
| Related | Feature Engineering | modeling-feature-engineering.html |
| Related | Production ML | modeling-production.html |