6 Edge ML and TinyML Deployment
6.1 Learning Objectives
By the end of this chapter, you will be able to:
- Decide Edge vs Cloud: Apply the decision framework to determine optimal ML deployment location
- Implement TinyML: Deploy quantized neural networks on microcontrollers
- Optimize Models: Use pruning, quantization, and knowledge distillation techniques
- Design Real-Time Systems: Build predictive control systems with edge ML
Key Concepts
- Model quantisation: Converting ML model weights from 32-bit floats to 8-bit integers (INT8) or 4-bit values, reducing model size by 4–8× and inference speed by 2–4× with minimal accuracy loss.
- Model pruning: Removing connections (weights near zero) or entire neurons from a trained neural network, reducing model size and inference cost while preserving most of the model’s accuracy.
- Knowledge distillation: Training a small ‘student’ model to mimic the outputs of a larger ‘teacher’ model, producing a compact model that captures the teacher’s knowledge in a fraction of the parameters.
- TensorFlow Lite (TFLite): A lightweight ML inference framework designed for microcontrollers and mobile devices, supporting quantised models with a binary footprint as small as 16 KB.
- TinyML: The field of running ML inference on microcontrollers and other extremely resource-constrained devices (< 1 MB RAM), enabling on-device intelligence without cloud connectivity.
- ONNX (Open Neural Network Exchange): A cross-framework ML model format enabling models trained in PyTorch or TensorFlow to be exported and run in optimised edge inference runtimes.
For Beginners: Edge ML and TinyML Deployment
Normally, machine learning runs on powerful cloud servers. But what if your device is in a remote forest, a moving vehicle, or a patient’s wrist – with no reliable internet? Edge ML means running the intelligence directly on the device itself. TinyML takes this even further, squeezing models into microcontrollers with less memory than a single photo on your phone. Think of it like carrying a pocket calculator instead of calling a math professor every time you need an answer – it is less powerful, but instant and always available.
6.2 Prerequisites
- ML Fundamentals: Training vs inference concepts
- IoT ML Pipeline: 7-step ML pipeline
- Basic understanding of microcontrollers (ESP32, Cortex-M4)
Chapter Series: Modeling and Inferencing
This is part 4 of the IoT Machine Learning series:
- ML Fundamentals - Core concepts
- Mobile Sensing - HAR, transportation
- IoT ML Pipeline - 7-step pipeline
- Edge ML & Deployment (this chapter) - TinyML, quantization
- Audio Feature Processing - MFCC
- Feature Engineering - Feature design
- Production ML - Monitoring
6.3 Edge vs Cloud ML: Decision Framework
When to process data locally (edge) versus sending to the cloud is a critical architecture decision:
6.3.0.1 Latency
Under 100 ms and safety-critical?
6.3.0.2 Privacy
Sensitive personal or operational data?
6.3.0.3 Connectivity
High bandwidth demand or intermittent links?
6.3.0.4 Model Fit
Can the optimised model run within the device budget?
Figure 6.1: Work through the latency, privacy, connectivity, and model-fit checks before committing inference to the edge.
6.3.1 Decision Factors
6.3.1.1 Latency
- Choose edge: Under 100 ms for safety-critical or tightly interactive responses
- Choose cloud: 1-5 seconds is acceptable
6.3.1.2 Privacy
- Choose edge: Sensitive health, financial, or operational data stays on-device
- Choose cloud: Data is anonymous or already aggregated
6.3.1.3 Bandwidth
- Choose edge: Video, audio, or other high-rate streams are expensive to upload
- Choose cloud: Low-rate telemetry such as temperature is cheap to transmit
6.3.1.4 Connectivity
- Choose edge: Links are intermittent or the device must keep working offline
- Choose cloud: The deployment is always connected
6.3.1.5 Model complexity
- Choose edge: Optimised models fit within about 1 MB
- Choose cloud: Large models need tens of megabytes or more
6.3.1.6 Cost
- Choose edge: Cloud inference would be expensive across many devices
- Choose cloud: Specialised edge hardware would dominate deployment cost
6.3.2 Example Scenarios
6.3.2.1 Fall Detection
- Best deployment: Edge
- Why: Safety workflows need sub-100 ms response times
6.3.2.2 Voice Assistant Wake Word
- Best deployment: Edge
- Why: Always-listening audio should stay private on-device
6.3.2.3 Full Voice Command
- Best deployment: Cloud
- Why: Natural-language understanding needs much larger models
6.3.2.4 Industrial Anomaly Detection
- Best deployment: Hybrid
- Why: Edge raises the alert fast while cloud handles deeper root-cause analysis
6.3.2.5 Smart Thermostat
- Best deployment: Edge
- Why: The model is compact and must keep working offline
6.3.2.6 Fleet-wide Predictive Maintenance
- Best deployment: Cloud
- Why: Cross-device learning benefits from shared centralised data
Tradeoff: Model Accuracy vs Edge Deployability
Complex models (deep neural networks, large ensembles) achieve 95-98% accuracy but require megabytes of RAM and GPU/NPU acceleration. Lightweight models (decision trees, quantized CNNs) achieve 88-93% accuracy but fit in less than 256KB and run on basic MCUs. For battery-powered wearables requiring 5+ day battery life, choose lightweight. For edge AI accelerators (Jetson, Coral) with consistent power, choose complex models. The sweet spot is INT8 quantization: 4x size reduction with only 1-2% accuracy loss.
6.4 TinyML and Model Quantization
TinyML enables machine learning on devices with < 1MB RAM and < 1MB storage. The key technique is quantization—reducing numerical precision.
6.4.1 Quantization Overview
6.4.1.1 FP32
4 bytes per parameter. Best for training and cloud inference.
6.4.1.2 FP16
Half precision for GPU inference when memory is tighter.
6.4.1.3 INT8
4x smaller with a 1-3% accuracy tradeoff on edge devices.
6.4.1.4 INT4
8x smaller for aggressive MCU deployments with more accuracy loss.
Figure 6.2: Each step down in precision trades numerical detail for lower model size and faster inference.
6.4.1.5 FP32
- Bits: 32
- Size reduction: Baseline
- Accuracy loss: 0%
- Use case: Training and cloud inference
6.4.1.6 FP16
- Bits: 16
- Size reduction: 2x
- Accuracy loss: Less than 1%
- Use case: GPU inference
6.4.1.7 INT8
- Bits: 8
- Size reduction: 4x
- Accuracy loss: 1-3%
- Use case: Edge devices such as ESP32-class targets
6.4.1.8 INT4
- Bits: 4
- Size reduction: 8x
- Accuracy loss: 3-8%
- Use case: Very small MCUs such as Cortex-M4 deployments
Putting Numbers to It
Quantization: Trading Precision for Efficiency
Quantization maps floating-point values to fixed-point integers. The math behind INT8 quantization shows the trade-off:
FP32 to INT8 Mapping: \[ \text{INT8 range} = [-128, 127] \quad \text{(256 discrete values)} \] \[ \text{FP32 range} = [w_{\min}, w_{\max}] \quad \text{(from model weights)} \]
Scale Factor: \[ S = \frac{w_{\max} - w_{\min}}{255} \quad \text{(quantization step size)} \]
Zero Point (offset to handle asymmetric ranges): \[ Z = -\text{round}\left(\frac{w_{\min}}{S}\right) \quad \text{(maps 0 in FP32 space)} \]
Quantization Formula: \[ w_{\text{INT8}} = \text{clip}\left(\text{round}\left(\frac{w_{\text{FP32}}}{S}\right) + Z, -128, 127\right) \]
De-quantization (for inference): \[ w_{\text{FP32}} \approx S \times (w_{\text{INT8}} - Z) \]
Example: Model weight range [-0.5, 0.8] \[ S = \frac{0.8 - (-0.5)}{255} = 0.0051 \quad Z = -\text{round}\left(\frac{-0.5}{0.0051}\right) = 98 \] \[ w = -0.2 \rightarrow \text{clip}\!\left(\text{round}\left(\frac{-0.2}{0.0051}\right) + 98, -128, 127\right) = \text{clip}(59, -128, 127) = 59 \quad \text{(INT8)} \]
Quantization Error: \[ \epsilon = S \times 0.5 = 0.0026 \quad \text{(maximum per-weight error)} \]
With 1000 weights, cumulative error remains bounded by Central Limit Theorem: \[ \epsilon_{\text{total}} \sim \mathcal{N}(0, \frac{S}{\sqrt{12}} \times \sqrt{n}) = \mathcal{N}(0, 0.046) \quad \text{(< 5% typical)} \]
This explains why INT8 maintains 97-99% accuracy despite 4x compression—quantization errors are zero-mean and largely cancel out across layers.
6.4.2 Try It: Quantization Explorer
Use the sliders below to experiment with INT8 quantization. Adjust the weight range and input weight to see how the scale factor, zero point, quantized value, and reconstruction error change in real time.
6.4.3 TensorFlow Lite Quantization
import tensorflow as tf
# Train full-precision model
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu', input_shape=(27,)),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(5, activation='softmax') # 5 activity classes
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))
# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
# Representative dataset for calibration (required for INT8)
def representative_data_gen():
for sample in X_calibration[:100]:
yield [sample.reshape(1, -1).astype(np.float32)]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
# Convert and save
tflite_model = converter.convert()
with open('activity_classifier_int8.tflite', 'wb') as f:
f.write(tflite_model)
# Result: 500KB → 125KB (4x reduction)6.4.4 Quantization Calibration
INT8 quantization requires a representative dataset to calibrate the mapping from FP32 to INT8 ranges:
# Calibration determines min/max values for each layer
# Using 100-1000 representative samples works well
def representative_data_gen():
# Use validation data (NOT training data) for calibration
for i in range(min(100, len(X_val))):
yield [X_val[i:i+1].astype(np.float32)]Calibration Best Practices:
- Use 100-1000 samples from validation set
- Include diverse examples covering all classes
- Do NOT use training data (causes overfit to training distribution)
6.5 Model Optimization Techniques
6.5.1 Pruning
Remove unimportant weights (set to zero):
import tensorflow_model_optimization as tfmot
# Apply pruning during training
prune_low_magnitude = tfmot.sparsity.keras.prune_low_magnitude
pruning_params = {
'pruning_schedule': tfmot.sparsity.keras.PolynomialDecay(
initial_sparsity=0.30,
final_sparsity=0.70,
begin_step=1000,
end_step=5000
)
}
model_for_pruning = prune_low_magnitude(model, **pruning_params)
model_for_pruning.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
# Train with pruning
model_for_pruning.fit(X_train, y_train, epochs=10)
# Result: 70% of weights become zero → ~3x compression with <2% accuracy loss6.5.2 Knowledge Distillation
Train small “student” model to mimic large “teacher” model:
6.5.2.1 Teacher
Large reference model produces richer probability outputs.
6.5.2.2 Soft Labels
Confidence scores transfer more structure than hard labels alone.
6.5.2.3 Student
Small deployment model learns the teacher’s behaviour with fewer parameters.
Figure 6.3: Distillation compresses a large model into a smaller student without starting from scratch.
Benefits:
- Student achieves 95% accuracy vs 90% when trained directly on labels
- Soft labels from teacher contain more information than hard labels
- Enables deployment of small models with large-model accuracy
6.6 Worked Example: HVAC Predictive Control with Edge LSTM
Scenario: A smart building requires temperature predictions 60 minutes ahead for efficient HVAC pre-conditioning. The edge gateway has 512KB RAM and must work offline.
6.6.1 System Architecture
6.6.1.1 Sensors
Indoor temperature, outdoor temperature, occupancy, and HVAC state.
6.6.1.2 Feature Prep
Lag features plus cyclic time encoding summarise recent context.
6.6.1.3 Edge LSTM
Quantized model predicts the room state 60 minutes ahead.
6.6.1.4 Controller
HVAC schedule adjusts before occupants feel the temperature drift.
Figure 6.4: The predictive-control loop keeps feature prep, inference, and actuation on the edge gateway.
6.6.2 Step 1: Data Preparation
Raw Data Collection (30 days of historical readings):
6.6.2.1 Total samples
- Value: 43,200
- Notes: 30 days × 24 hours × 60 minutes
6.6.2.2 Sampling rate
- Value: 1 minute
- Notes: Standard HVAC sensor frequency
6.6.2.3 Sensors
- Value: 4
- Notes: Indoor temp, outdoor temp, occupancy, HVAC state
6.6.2.4 Storage
- Value: ~2 MB
- Notes: CSV format with timestamps
Train/Validation/Test Split (time-series aware):
6.6.2.5 Training
- Samples: 30,240
- Share: 70%
- Date range: Days 1-21
6.6.2.6 Validation
- Samples: 6,480
- Share: 15%
- Date range: Days 22-25.5
6.6.2.7 Test
- Samples: 6,480
- Share: 15%
- Date range: Days 25.5-30
6.6.3 Step 2: Feature Engineering
Feature Set (14 features from 4 raw sensors):
6.6.3.1 temp_lag_1h
- Type: Numeric
- Importance: 0.35
- Why it matters: Strong autocorrelation
6.6.3.2 temp_lag_30m
- Type: Numeric
- Importance: 0.22
- Why it matters: Recent trend indicator
6.6.3.3 outdoor_temp
- Type: Numeric
- Importance: 0.18
- Why it matters: Heat-transfer driver
6.6.3.4 outdoor_delta_1h
- Type: Numeric
- Importance: 0.08
- Why it matters: Predicts future heat load
6.6.3.5 hour_sin
- Type: Cyclic
- Importance: 0.05
- Why it matters: Captures the daily temperature cycle
6.6.3.6 hour_cos
- Type: Cyclic
- Importance: 0.03
- Why it matters: Captures daily cycle phase
6.6.3.7 occupancy
- Type: Binary
- Importance: 0.04
- Why it matters: Represents body heat and door openings
Why Cyclic Encoding for Time?
Problem: Hour of day (0-23) is categorical but has circular relationship (11 PM is close to 1 AM).
Bad approach: Raw hour value (0-23) → Model thinks 0 and 23 are far apart!
Good approach: Encode as (sin, cos) pair: - Hour 0: sin=0, cos=1 - Hour 6: sin=1, cos=0 - Hour 12: sin=0, cos=-1 - Hour 23: sin≈-0.26, cos≈0.97 (close to hour 0!)
This preserves the circular relationship.
6.6.4 Step 3: Model Selection
6.6.4.1 Linear Regression
- MAE: 1.8 deg C
- Size: 2 KB
- Inference: 0.1 ms
- Fits 512 KB? Yes
6.6.4.2 Random Forest (100 trees)
- MAE: 0.9 deg C
- Size: 450 KB
- Inference: 25 ms
- Fits 512 KB? Yes, but tightly
6.6.4.3 LSTM (2 layers, 32 units)
- MAE: 0.7 deg C
- Size: 200 KB
- Inference: 50 ms
- Fits 512 KB? Yes
6.6.4.4 Quantized LSTM (INT8)
- MAE: 0.8 deg C
- Size: 50 KB
- Inference: 15 ms
- Fits 512 KB? Yes
Winner: Quantized LSTM - Only 0.1 deg C accuracy loss, 4x smaller, 3x faster
6.6.5 Step 4: Quantization
import tensorflow as tf
# Train full-precision LSTM
model = tf.keras.Sequential([
tf.keras.layers.LSTM(32, return_sequences=True, input_shape=(60, 14)),
tf.keras.layers.LSTM(16),
tf.keras.layers.Dense(1) # Predict temperature
])
model.compile(optimizer='adam', loss='mse')
model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val))
# Convert to TensorFlow Lite with INT8 quantization
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.representative_dataset = representative_data_gen
converter.target_spec.supported_types = [tf.int8]
tflite_model = converter.convert()
# Save quantized model
with open('hvac_predictor_int8.tflite', 'wb') as f:
f.write(tflite_model)
# Result: 200KB → 50KB (75% reduction)6.6.6 Step 5: Results
Deployment Validation:
6.6.6.1 Model size
- Target: Less than 512 KB
- Achieved: 50 KB
- Status: Pass
6.6.6.2 Inference latency
- Target: Less than 1000 ms
- Achieved: 15 ms
- Status: Pass with 66x margin
6.6.6.3 MAE accuracy
- Target: Less than 1.0 deg C
- Achieved: 0.8 deg C
- Status: Pass
6.6.6.4 Offline operation
- Target: Required
- Achieved: Yes
- Status: Pass
Energy Savings:
6.6.6.5 Pre-deployment baseline
- HVAC cycles: 24 per day
- Behaviour: The system starts and stops at setpoint, then overshoots and wastes energy
6.6.6.6 Post-deployment behaviour
- HVAC pre-start: 30-45 minutes before needed
- Cycling reduction: 24 to 18 cycles per day, or 25% fewer cycles
6.6.6.7 Monthly energy comparison
- Baseline: 1,200 kWh per month
- Predictive: 1,020 kWh per month
- Savings: 180 kWh per month, or 15%
6.6.6.8 Annual economics
- Per floor: 180 kWh x 12 months x $0.12 per kWh = $259 per year
- Ten floors: $2,590 per year
- Edge gateway cost: $150 one-time
- Payback period: 0.7 months
6.7 Key Takeaways from HVAC Example
- Feature engineering drives accuracy: Lag features contributed 57% of model importance
- Quantization is nearly free: INT8 reduced size by 75% with only 0.1 deg C loss
- Time-series requires chronological splits: Random splits cause data leakage
- Cyclic encoding for time: sin/cos encoding preserves circular relationships
- Business value justifies complexity: 15% energy savings pays for development in < 1 month
6.8 Real-World Deployment: Sony’s Spresense and Edge AI for Wildlife Conservation
Wildlife Conservation Society (WCS) deployed edge ML on Sony’s Spresense microcontroller board (Cortex-M4F, 768 KB SRAM, $25/unit) to detect illegal chainsaw activity in tropical rainforests. The system illustrates the full edge ML deployment pipeline in a resource-constrained, real-world environment.
Problem: Illegal logging accounts for 15-30% of global timber trade. Traditional monitoring uses satellite imagery with 3-7 day update cycles – by the time deforestation is detected, the loggers are gone. Audio monitoring can detect chainsaws in real-time, but requires ML inference in remote locations with no internet and solar-only power.
Model optimization journey:
6.8.0.1 Cloud prototype
- Model: ResNet-18 (PyTorch)
- Size: 44 MB
- Accuracy: 97.2%
- Inference: 180 ms on GPU
- Power: N/A
6.8.0.2 Compressed
- Model: MobileNetV2 (TF Lite)
- Size: 3.4 MB
- Accuracy: 95.8%
- Inference: 420 ms on Raspberry Pi 4
- Power: 3.2 W
6.8.0.3 Quantized INT8
- Model: MobileNetV2 (TF Lite Micro)
- Size: 680 KB
- Accuracy: 94.1%
- Inference: 280 ms on Spresense
- Power: 45 mW
6.8.0.4 Pruned + quantized
- Model: Custom CNN (TF Lite Micro)
- Size: 210 KB
- Accuracy: 91.6%
- Inference: 85 ms on Spresense
- Power: 32 mW
The final 210 KB model fits comfortably in the Spresense’s 768 KB SRAM with room for audio buffers and firmware. The 5.6% accuracy reduction from the cloud prototype was acceptable because the system’s value comes from real-time detection (minutes, not days) rather than perfect classification.
Deployment results (Peruvian Amazon, 12 months):
- 50 solar-powered nodes covering 2,500 hectares
- Battery life: 14 months on 3,000 mAh LiPo + 1W solar panel (duty-cycled: 5 seconds of audio analysis every 30 seconds)
- True positive rate: 89% for chainsaws at distances up to 300 meters
- False positive rate: 2.1 per node per day (mainly heavy rain on metal roofs)
- Response time: Alert transmitted via LoRa to ranger station within 90 seconds of detection
- Outcome: 7 illegal logging operations intercepted, estimated 180 hectares of forest saved
The key lesson: a 91.6% accurate model running in real-time on a $25 microcontroller delivered more conservation value than a 97.2% accurate model that would have required $500 in gateway hardware, cellular connectivity, and cloud processing per node – making the 50-node deployment financially impossible.
6.9 How It Works: INT8 Quantization
INT8 quantization transforms a floating-point neural network into an integer-only version, reducing model size by 4x with minimal accuracy loss. Here is the complete process (see the “Putting Numbers to It” callout above for the detailed math):
Step 1: Calibration – Feed 100-1000 representative samples from the validation set through the FP32 model, recording min/max activation values for each layer. For an activity classifier: Layer 1 activations range [-2.3, 4.7], Layer 2 ranges [-1.8, 3.2], etc.
Step 2: Scale and zero-point calculation – For each layer, compute scale = (max - min) / 255 and zero_point = -round(min / scale). These map the continuous FP32 range onto 256 discrete INT8 values.
Step 3: Weight quantization – Convert each FP32 weight to INT8: q_weight = clip(round(fp32_weight / scale) + zero_point, -128, 127). Each weight shrinks from 4 bytes to 1 byte.
Step 4: Runtime inference – All arithmetic uses integer operations. Multiply-accumulate becomes: q_output = (q_input * q_weight) >> shift where shift adjusts for accumulated scaling. No floating-point operations – this enables deployment on microcontrollers without an FPU.
Step 5: Dequantization (output layer only) – Convert INT8 output back to FP32: fp32_output = (q_output - zero_point) * scale. For classification, apply temperature scaling to calibrate softmax probabilities.
Why it works: Neural networks are inherently robust to quantization noise. Precise floating-point values (e.g., 1.35472891) contribute no more accuracy than their rounded INT8 equivalents. The 1-3% typical accuracy loss comes from extreme activation values outside the calibration range, which get clipped.
Hardware acceleration: ARM Cortex-M processors include SIMD instructions (SMLAD, SMUAD) that perform 4x INT8 multiplications in a single cycle, making quantized inference 3-5x faster than FP32 on the same hardware.
6.10 Knowledge Check
Common Pitfalls
1. Quantising a model without post-quantisation accuracy evaluation
INT8 quantisation typically reduces accuracy by 1–3%, but for some model architectures or data distributions it can drop by 10–20%. Always evaluate quantised model accuracy on a representative test set before deploying.
2. Designing the ML pipeline without considering the target edge hardware first
A model designed for a high-end GPU and then ‘ported’ to a microcontroller is almost always too large and too slow. Start with the edge hardware constraints (RAM, flash, MIPS) and design the model architecture within those limits from the beginning.
3. Treating model deployment as a one-time task
IoT environments change (new noise sources, seasonal variation, equipment ageing) causing model accuracy to degrade. Plan for periodic model retraining and OTA model updates from the start of deployment.
4. Ignoring inference latency requirements in model selection
A 100 ms inference time is fine for batch analysis but unacceptable for real-time motor control requiring <10 ms response. Profile inference latency on target hardware during model development, not after deployment.
6.11 Summary
This chapter covered edge ML and TinyML deployment:
- Edge vs Cloud Decision: Use edge for low latency, privacy, and offline requirements
- Quantization: INT8 reduces model size by 4x with 1-3% accuracy loss
- Pruning: Remove 70% of weights with <2% accuracy loss
- Knowledge Distillation: Train small models to match large model performance
- HVAC Example: Quantized LSTM achieves 15% energy savings with 0.7-month payback
Key Insight: Start with the simplest model that meets constraints. Quantization is nearly free—always try it before abandoning edge deployment.
Key Takeaway
INT8 quantization is the single most impactful optimization for edge ML – it reduces model size by 4x and inference latency by 3x with only 1-3% accuracy loss. Always try quantization before concluding a model cannot run on edge hardware. The HVAC example demonstrates that even a 50KB quantized LSTM on a $75 Raspberry Pi can deliver 15% energy savings with a payback period under one month.
For Kids: Meet the Sensor Squad!
Can a tiny computer be as smart as a big one? The Sensor Squad finds out!
Max the Microcontroller has a problem. He wants to predict when a room will get too hot so he can turn on the air conditioning early. But the super-smart brain (neural network) that can do this is TOO BIG to fit in his tiny memory!
“I only have 512 kilobytes of memory!” Max sighs. “That is like having a bookshelf that only fits 5 books, but the brain needs 200 books!”
Sammy the Sensor has an idea: “What if we make the brain SMALLER?”
They try three tricks:
Trick 1 - Shrink the Numbers (Quantization) Instead of using really precise numbers like 3.14159265, Max rounds everything to simpler numbers like 3. This makes the brain 4 times smaller!
“It is like drawing with 8 crayons instead of 32,” explains Lila the LED. “You lose a tiny bit of detail, but the picture still looks great!”
Trick 2 - Remove the Lazy Parts (Pruning) Some parts of the brain do almost nothing. Max removes 70% of the lazy connections, and the brain still works almost as well!
Trick 3 - Learn from the Expert (Distillation) A big expert brain teaches a tiny student brain. The student learns the shortcuts and becomes almost as smart!
After all three tricks, the brain goes from 200KB down to 50KB – it fits perfectly in Max’s memory! And it only takes 15 milliseconds to make a prediction.
Bella the Battery cheers: “And because the brain is so small, I barely use any energy running it! I can last for DAYS!”
The building saves 15% on electricity because Max can predict when to turn on the AC ahead of time, instead of waiting until the room is already too hot.
6.11.1 Try This at Home!
Draw a detailed picture of your pet (or favorite animal) using 32 colored pencils. Now try drawing the SAME picture using only 8 colors. It still looks like your pet, right? That is quantization! You used fewer colors (less precision) but kept the important information. Computers do the same thing with numbers to make smart brains fit in tiny devices.
6.12 Concept Relationships
Edge ML builds on:
- ML Fundamentals - Training vs inference separation enables edge deployment
- IoT ML Pipeline - Stages 4-5 (optimize, deploy) focus on edge preparation
- Edge Computing Architecture - Edge vs cloud decision framework
Edge ML enables:
- Audio Processing - Wake word detection on Cortex-M4 using quantized models
- Production ML - Real-time anomaly detection without cloud dependency
Parallel concepts:
- Model quantization (FP32 → INT8) ↔︎ Audio compression (PCM → MP3): Both sacrifice precision for size with minimal quality loss
- Pruning ↔︎ Feature selection: Both remove low-value components to improve efficiency
- Knowledge distillation (teacher → student) ↔︎ Transfer learning: Both leverage large models to improve small models
6.13 See Also
Chapter series:
- ML Fundamentals - Core concepts and edge vs cloud decision
- IoT ML Pipeline - Complete 7-step pipeline
- Audio Feature Processing - MFCC for voice recognition
- Production ML - Monitoring deployed models
Edge deployment resources:
- TensorFlow Lite - Official quantization and deployment guide
- TensorFlow Lite Micro - MCU deployment
- ARM CMSIS-NN - Optimized neural network kernels for Cortex-M
- Edge Impulse - End-to-end TinyML platform
Hardware platforms:
- ESP32 - $5 MCU with 520KB SRAM, Wi-Fi/Bluetooth
- Raspberry Pi 4 - $35-75 SBC with 4-8GB RAM, quad-core ARM
- NVIDIA Jetson Nano - $99 GPU accelerator with 4GB RAM
- Google Coral Dev Board - $149 Edge TPU accelerator
6.14 What’s Next
6.14.0.1 Next
Audio Feature Processing
modeling-audio-features.html
6.14.0.2 Previous
IoT ML Pipeline
modeling-pipeline.html