%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22','tertiaryColor':'#7F8C8D','clusterBkg':'#ECF0F1','clusterBorder':'#16A085','edgeLabelBackground':'#ECF0F1'}}}%%
flowchart LR
subgraph input["Voice Input"]
A["Microphone<br/>Analog Signal<br/>20 Hz - 20 kHz"]
end
subgraph preprocessing["Preprocessing"]
B["Band-Pass Filter<br/>300 Hz - 3400 Hz"]
C["Anti-Aliasing<br/>Low-Pass 4 kHz"]
end
subgraph digitization["Digitization"]
D["Sample at 8 kHz"]
E["Quantize to<br/>8 bits (256 levels)"]
end
subgraph compression["Compression"]
F{"Compression<br/>Method?"}
G["Companding<br/>(μ-Law/A-Law)<br/>64 → 64 kbps"]
H["LPC Vocoder<br/>64 → 2-8 kbps"]
I["CELP/ACELP<br/>64 → 8-16 kbps"]
end
subgraph output["Compressed Output"]
J["Transmit over<br/>IoT Network"]
end
A --> B --> C --> D --> E --> F
F -->|"Simple"| G
F -->|"High Compression"| H
F -->|"Balanced"| I
G --> J
H --> J
I --> J
style input fill:#E67E22,stroke:#2C3E50,stroke-width:3px,color:#000
style preprocessing fill:#16A085,stroke:#2C3E50,stroke-width:2px,color:#fff
style digitization fill:#2C3E50,stroke:#16A085,stroke-width:2px,color:#fff
style compression fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
style output fill:#16A085,stroke:#2C3E50,stroke-width:3px,color:#fff
style A fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
style B fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
style C fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
style D fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
style E fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
style F fill:#ECF0F1,stroke:#7F8C8D,stroke-width:2px,color:#000
style G fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
style H fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
style I fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
style J fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
68 Voice and Audio Compression for IoT
68.1 Learning Objectives
By the end of this chapter, you will be able to:
- Understand Toll Quality: Explain the 64 kbps baseline for voice transmission
- Apply Companding: Use μ-law and A-law compression for better audio SNR
- Understand LPC: Explain the source-filter model and how it achieves 8-27× compression
- Select Audio Codecs: Choose appropriate compression for IoT bandwidth constraints
- Calculate Compression Ratios: Determine bit rate requirements for voice IoT applications
Fundamentals: - Signal Processing Overview - Sampling and Nyquist theorem - Aliasing and ADC Resolution - Previous chapter - Sensor Dynamics - Next in series - Data Representation - Binary encoding
Networking: - Long-Range Protocols - LoRaWAN bandwidth constraints - Short-Range Protocols - BLE audio streaming
Practical: - Signal Processing Labs - Hands-on experiments
68.2 Prerequisites
- Signal Processing Overview: Sampling rate and ADC concepts
- Aliasing and ADC: Understanding quantization
- Basic logarithms: Understanding log scales for companding formulas
68.3 Voice and Audio Compression for IoT
The Problem: Voice-enabled IoT devices—smart speakers, intercoms, wearable communicators, and industrial voice systems—need to transmit audio over constrained networks. Uncompressed audio is prohibitively expensive for IoT bandwidth and storage budgets.
The Scale of the Challenge: - Telephone-quality audio: 8-bit samples at 8 kHz = 64 kbps (kilobits per second) - Typical IoT uplink: 10-50 kbps (LoRaWAN, NB-IoT, BLE) - Gap: Raw audio exceeds available bandwidth by 2-6×!
The Problem in Plain Terms: Imagine you have a walkie-talkie that can only send 50 small packages per second, but your voice generates 64 packages per second. You can’t keep up! You need to either speak slower (not practical) or pack more information into fewer packages (compression).
Real-World Analogy: Think of a newspaper headline vs. the full article. The headline captures the essential meaning in far fewer words. Audio compression does something similar—it captures the essential sounds while discarding information your ear won’t miss.
Why This Matters for IoT: - Smart speakers: Need to stream voice commands to the cloud for processing - Baby monitors: Continuous audio over Wi-Fi without clogging the network - Industrial intercoms: Clear communication in noisy factories over low-bandwidth radio - Wearable translators: Real-time voice translation requires fast, compressed transmission
Key Insight: Good compression reduces 64 kbps to 8 kbps or less—an 8× reduction—while keeping speech intelligible!
68.3.1 Toll Quality: The Baseline for Voice
Toll quality refers to telephone-standard voice quality established by telecom standards:
| Parameter | Value | Calculation |
|---|---|---|
| Sampling rate | 8,000 Hz | Voice frequencies 300-3400 Hz (Nyquist: >6800 Hz) |
| Bit depth | 8 bits | 256 quantization levels |
| Bit rate | 64 kbps | 8 bits × 8000 samples/sec |
| Bandwidth | 3.1 kHz | 300 Hz to 3400 Hz (telephone band) |
Why 8 kHz?: Human speech contains most intelligible content between 300-3400 Hz. Sampling at 8 kHz (twice 4 kHz) captures this range with margin.
Why 8 bits?: Early telephone systems used 8-bit quantization as a balance between quality and transmission cost. This became the G.711 PCM standard still used today.
Voice Compression Pipeline: Audio flows from microphone through band-pass filtering (telephone band 300-3400 Hz), anti-aliasing, 8 kHz sampling, 8-bit quantization, and then through one of three compression methods: companding (simple, preserves 64 kbps), LPC vocoder (high compression to 2-8 kbps), or CELP (balanced quality at 8-16 kbps). {fig-alt=“Flowchart showing complete voice compression pipeline for IoT. Input stage shows microphone capturing 20Hz-20kHz analog signal. Preprocessing applies band-pass filter (300-3400 Hz) and anti-aliasing low-pass at 4 kHz. Digitization samples at 8 kHz and quantizes to 8 bits. Compression decision point offers three paths: companding (mu-law/A-law, 64 kbps), LPC vocoder (2-8 kbps high compression), or CELP/ACELP (8-16 kbps balanced). All paths lead to compressed output for IoT network transmission.”}
68.3.2 Companding: Compression + Expansion
Companding (compression + expansion) is a technique that improves perceived audio quality without changing the bit rate. It exploits a fundamental property of human hearing:
The Key Insight: Human ears perceive loudness logarithmically, not linearly. A sound 10× louder doesn’t seem 10× louder—it seems about twice as loud. This is the Weber-Fechner law.
How Companding Works:
- Compression (transmitter): Apply logarithmic encoding to give more precision to quiet sounds
- Expansion (receiver): Apply inverse (exponential) decoding to restore original dynamics
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22','tertiaryColor':'#7F8C8D','clusterBkg':'#ECF0F1','clusterBorder':'#16A085','edgeLabelBackground':'#ECF0F1'}}}%%
flowchart TB
subgraph problem["The Problem with Linear Quantization"]
P1["Quiet sounds: Few bits<br/>→ High quantization noise"]
P2["Loud sounds: Many bits<br/>→ Wasteful precision"]
P3["SNR varies with<br/>signal amplitude"]
end
subgraph solution["Companding Solution"]
direction LR
S1["Compress<br/>(Logarithmic)<br/>Before ADC"]
S2["Uniform<br/>Quantization<br/>(8 bits)"]
S3["Expand<br/>(Exponential)<br/>After DAC"]
end
subgraph result["Result"]
R1["Quiet sounds: More bits<br/>→ Lower noise floor"]
R2["Loud sounds: Fewer bits<br/>→ Still adequate"]
R3["SNR constant across<br/>all amplitudes!"]
end
problem --> solution --> result
subgraph standards["Regional Standards"]
MU["μ-Law (mu-law)<br/>• North America & Japan<br/>• μ = 255<br/>• Slightly better for low signals"]
AL["A-Law<br/>• Europe & rest of world<br/>• A = 87.6<br/>• Slightly better dynamic range"]
end
solution --> standards
style problem fill:#E67E22,stroke:#2C3E50,stroke-width:3px,color:#000
style solution fill:#16A085,stroke:#2C3E50,stroke-width:3px,color:#fff
style result fill:#2C3E50,stroke:#16A085,stroke-width:3px,color:#fff
style standards fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
style P1 fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
style P2 fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
style P3 fill:#ECF0F1,stroke:#E67E22,stroke-width:1px,color:#000
style S1 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
style S2 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
style S3 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
style R1 fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
style R2 fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
style R3 fill:#ECF0F1,stroke:#2C3E50,stroke-width:1px,color:#000
style MU fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
style AL fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
Companding Principle: Linear quantization wastes bits on loud sounds while under-representing quiet sounds. Companding applies logarithmic compression before digitization and exponential expansion after, achieving constant signal-to-noise ratio (SNR) across all amplitudes. Two standards exist: μ-law (North America/Japan, μ=255) and A-law (Europe/international, A=87.6). {fig-alt=“Three-stage diagram explaining companding. Problem section (orange): linear quantization gives quiet sounds few bits (high noise), loud sounds many bits (wasteful), and SNR varies with amplitude. Solution section (teal): compress with logarithmic function before ADC, uniform 8-bit quantization, expand with exponential function after DAC. Result section (navy): quiet sounds get more bits (lower noise), loud sounds get fewer but adequate bits, SNR is constant across all amplitudes. Standards section shows μ-law (North America/Japan, μ=255) and A-law (Europe, A=87.6).”}
The μ-Law Formula (North America, Japan):
\[F(x) = \text{sgn}(x) \cdot \frac{\ln(1 + \mu |x|)}{\ln(1 + \mu)}\]
Where: - \(x\) = normalized input (-1 to +1) - \(\mu\) = compression parameter (typically 255) - \(\text{sgn}(x)\) = sign of x (+1 or -1)
The A-Law Formula (Europe, international):
\[F(x) = \text{sgn}(x) \cdot \begin{cases} \frac{A|x|}{1 + \ln(A)} & |x| < \frac{1}{A} \\ \frac{1 + \ln(A|x|)}{1 + \ln(A)} & |x| \geq \frac{1}{A} \end{cases}\]
Where A = 87.6 (standard value)
Comparison: μ-Law vs A-Law:
| Aspect | μ-Law | A-Law |
|---|---|---|
| Region | North America, Japan | Europe, rest of world |
| Parameter | μ = 255 | A = 87.6 |
| Low-level signals | Slightly better SNR | Slightly worse |
| Dynamic range | Slightly less | Slightly more (13 dB vs 12 dB for μ-law) |
| Standard | ITU-T G.711 | ITU-T G.711 |
Important: Companding doesn’t change the bit rate (still 64 kbps), but it dramatically improves perceived quality by matching quantization precision to human hearing sensitivity.
68.3.3 Linear Predictive Coding (LPC): 8× Compression
The Breakthrough: Instead of transmitting the audio waveform directly, transmit parameters that describe how to synthesize it. This is the key to achieving 8× or greater compression.
The Source-Filter Model of Speech:
LPC is based on a revolutionary model of human speech developed by Gunnar Fant (KTH, Sweden) and Bishnu Atal (AT&T Bell Labs):
%%{init: {'theme': 'base', 'themeVariables': {'primaryColor':'#2C3E50','primaryTextColor':'#fff','primaryBorderColor':'#16A085','lineColor':'#16A085','secondaryColor':'#E67E22','tertiaryColor':'#7F8C8D','clusterBkg':'#ECF0F1','clusterBorder':'#16A085','edgeLabelBackground':'#ECF0F1'}}}%%
flowchart LR
subgraph source["Excitation Source"]
E1["Voiced Sounds<br/>(vowels, m, n, l)<br/>Periodic impulse train<br/>from vocal cords"]
E2["Unvoiced Sounds<br/>(s, f, sh, t)<br/>Random noise<br/>from turbulent airflow"]
end
subgraph filter["Vocal Tract Filter"]
F1["Lips, tongue, jaw<br/>shape the sound"]
F2["Modeled as<br/>All-Pole Filter<br/>10-16 coefficients"]
end
subgraph output["Speech Output"]
O1["Intelligible<br/>Speech Signal"]
end
E1 --> F1
E2 --> F1
F1 --> F2 --> O1
subgraph transmission["What LPC Transmits"]
T1["• Voiced/Unvoiced flag (1 bit)"]
T2["• Pitch period (6-7 bits)"]
T3["• Filter coefficients (10-16 × 4-6 bits)"]
T4["• Gain (5-6 bits)"]
T5["Total: ~50-80 bits per 20ms frame"]
end
output --> transmission
style source fill:#E67E22,stroke:#2C3E50,stroke-width:3px,color:#000
style filter fill:#16A085,stroke:#2C3E50,stroke-width:3px,color:#fff
style output fill:#2C3E50,stroke:#16A085,stroke-width:3px,color:#fff
style transmission fill:#7F8C8D,stroke:#2C3E50,stroke-width:2px,color:#fff
style E1 fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
style E2 fill:#ECF0F1,stroke:#E67E22,stroke-width:2px,color:#000
style F1 fill:#ECF0F1,stroke:#16A085,stroke-width:1px,color:#000
style F2 fill:#ECF0F1,stroke:#16A085,stroke-width:2px,color:#000
style O1 fill:#ECF0F1,stroke:#2C3E50,stroke-width:2px,color:#000
style T1 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
style T2 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
style T3 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
style T4 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
style T5 fill:#ECF0F1,stroke:#7F8C8D,stroke-width:1px,color:#000
Source-Filter Model: Human speech is modeled as an excitation source (vocal cords for voiced sounds, turbulent noise for unvoiced) filtered by the vocal tract (lips, tongue, jaw). LPC transmits only the source type, pitch, filter coefficients, and gain—about 50-80 bits per 20ms frame instead of 1280 bits for raw PCM. {fig-alt=“Diagram of source-filter speech model used in LPC. Source section (orange) shows two excitation types: voiced sounds (vowels, m, n, l) as periodic impulse train from vocal cords, and unvoiced sounds (s, f, sh, t) as random noise from turbulent airflow. Filter section (teal) shows vocal tract shaping by lips/tongue/jaw, modeled as all-pole filter with 10-16 coefficients. Output (navy) is intelligible speech. Transmission section (gray) lists what LPC sends: voiced/unvoiced flag (1 bit), pitch period (6-7 bits), filter coefficients (10-16 × 4-6 bits), gain (5-6 bits), totaling 50-80 bits per 20ms frame.”}
The LPC Compression Calculation:
Raw PCM (Toll Quality): - Sampling rate: 8,000 Hz - Bits per sample: 8 bits - Bit rate: 8 × 8,000 = 64,000 bps = 64 kbps
LPC-10 (US Government Standard): - Frame size: 20 ms (160 samples) - Bits per frame: - Voiced/unvoiced: 1 bit - Pitch: 6 bits - Gain: 5 bits - 10 reflection coefficients: ~36 bits - Total: ~48 bits per frame - Frames per second: 50 (1000 ms / 20 ms) - Bit rate: 48 × 50 = 2,400 bps = 2.4 kbps
Compression Ratio: 64 kbps / 2.4 kbps = 26.7:1
That’s 0.5-1.5 bits per sample instead of 8 bits per sample!
Trade-off: LPC-10 sounds “robotic” because it perfectly models the vocal tract but loses the fine nuances that make voices sound natural. It’s intelligible but not pleasant for long conversations.
Modern LPC Variants for IoT:
| Codec | Bit Rate | Quality | Use Case |
|---|---|---|---|
| LPC-10 | 2.4 kbps | Robotic, intelligible | Military secure voice, extreme low bandwidth |
| CELP | 4.8-16 kbps | Good | Early digital cellular |
| AMR-NB | 4.75-12.2 kbps | Good to excellent | GSM/3G voice calls |
| Opus | 6-510 kbps | Excellent | VoIP, smart speakers, gaming |
| Codec2 | 0.7-3.2 kbps | Fair | Amateur radio, IoT ultra-low bandwidth |
68.3.4 Practical IoT Voice Applications
Application 1: Smart Doorbell - Constraint: Wi-Fi connected, but video already consumes bandwidth - Solution: Use Opus at 16 kbps for voice (vs 64 kbps PCM) - Savings: 75% bandwidth reduction, more headroom for video
Application 2: LoRaWAN Voice Alert System - Constraint: 250 bps effective data rate - Solution: Pre-recorded vocabulary + Codec2 at 700 bps - Approach: Don’t stream—buffer 2-3 second messages, transmit over 20-30 seconds
Application 3: Wearable Translator - Constraint: BLE bandwidth ~1 Mbps, but must be real-time - Solution: AMR-WB at 12.65 kbps for high quality - Result: Multiple simultaneous voice streams possible
Scenario: You’re designing a voice-enabled industrial IoT intercom system for a noisy factory floor. Requirements: - Communication range: 500 meters (using LoRa radio, 10 kbps effective) - Voice quality: Must be intelligible over background noise - Latency: <500 ms end-to-end acceptable - Battery: Solar-powered, always listening
Questions:
- Can you transmit raw 64 kbps PCM audio? Why or why not?
- Which compression approach would you choose: companding, LPC-10, or CELP?
- What bit rate is achievable, and what’s the compression ratio?
Answers:
Q1: Can you transmit 64 kbps PCM? - No. Your radio supports 10 kbps, but 64 kbps audio requires 6.4× more bandwidth than available. - Even with protocol overhead removed, you need >6× compression minimum.
Q2: Which compression approach?
| Approach | Bit Rate | Pros | Cons |
|---|---|---|---|
| Companding | 64 kbps | Best quality | Too high—still 6.4× over budget |
| LPC-10 | 2.4 kbps | Fits easily (24% of channel) | Robotic, hard to understand in noise |
| CELP/AMR | 4.75-8 kbps | Good quality, fits channel | 47-80% of channel used |
| Codec2 | 1.2-3.2 kbps | Ultra-low, open source | Lower quality, but intelligible |
Best choice: CELP/AMR at ~6 kbps (or Codec2 at 3.2 kbps for safety margin)
Reasoning: - 6 kbps uses 60% of 10 kbps channel, leaving room for headers and retransmits - CELP quality is acceptable even in noisy environments - Latency budget allows for ~100-200 ms encoding buffer
Q3: Compression ratio: - Raw: 64 kbps - Compressed (CELP @ 6 kbps): 6 kbps - Ratio: 64/6 = 10.7:1 compression
With Codec2 @ 3.2 kbps: 20:1 compression
Key Insight: Voice compression makes the difference between “impossible” (64 kbps over 10 kbps link) and “comfortable” (6 kbps with 40% margin).
68.3.5 Summary: Voice Compression for IoT
| Technique | Compression | Bit Rate | Quality | IoT Use Case |
|---|---|---|---|---|
| Raw PCM | 1:1 | 64 kbps | Perfect | Local processing only |
| Companding (μ/A-law) | ~1:1* | 64 kbps | Excellent | High-bandwidth Wi-Fi devices |
| LPC-10 | 27:1 | 2.4 kbps | Robotic | Emergency/military systems |
| CELP/AMR | 4-8:1 | 8-16 kbps | Good | Smart speakers, intercoms |
| Opus | Variable | 6-64 kbps | Excellent | VoIP, wearables |
| Codec2 | 20-90:1 | 0.7-3.2 kbps | Fair | LoRa, ultra-low bandwidth |
*Companding improves quality but doesn’t reduce bit rate
Key Takeaways:
- Toll quality baseline: 64 kbps (8 bits × 8 kHz)
- Companding: Logarithmic encoding improves SNR for quiet sounds (μ-law vs A-law)
- Source-filter model: Speech = excitation (voiced/unvoiced) + vocal tract filter
- LPC magic: Transmit filter parameters instead of waveform → 8-27× compression
- IoT applications: Match codec to available bandwidth (Codec2 for LoRa, Opus for Wi-Fi)
68.4 What’s Next
- Sensor Dynamics - Understanding sensor temporal response
- Signal Processing Labs - Hands-on experiments
- Networking Protocols - Understanding bandwidth constraints