30 Voice and Audio Compression for IoT
30.1 Learning Objectives
By the end of this chapter, you will be able to:
- Derive Toll-Quality Bit Rates: Calculate the 64 kbps baseline from sampling rate and bit depth parameters
- Apply Companding: Use mu-law and A-law logarithmic encoding to improve SNR for quiet audio signals
- Analyse LPC Models: Decompose speech into excitation source and vocal tract filter to explain 8-27x compression
- Select Audio Codecs for IoT: Evaluate Opus, CELP, AMR, and Codec2 against bandwidth, quality, and latency constraints
- Estimate Real-World Bandwidth: Factor protocol overhead and duty-cycle limits into codec bit-rate calculations
- Design Voice Pipelines: Architect end-to-end voice compression solutions for constrained IoT networks
Related Chapters
Fundamentals:
- Signal Processing Overview - Sampling and Nyquist theorem
- Aliasing and ADC Resolution - Previous chapter
- Sensor Dynamics - Next in series
- Data Representation - Binary encoding
Networking:
- Long-Range Protocols - LoRaWAN bandwidth constraints
- Short-Range Protocols - BLE audio streaming
Practical:
- Signal Processing Labs - Hands-on experiments
30.2 Prerequisites
- Signal Processing Overview: Sampling rate and ADC concepts
- Aliasing and ADC: Understanding quantization
- Basic logarithms: Understanding log scales for companding formulas
30.3 Voice and Audio Compression for IoT
Key Concepts
- 64 kbps baseline: Toll-quality PCM starts at 8 kHz and 8 bits per sample, so raw voice is already too large for many IoT uplinks before packet overhead is added.
- Companding is quality shaping, not bandwidth reduction: μ-law and A-law give more quantization resolution to quiet sounds, but the stream still stays at 64 kbps.
- Speech models create the real compression: LPC, CELP, AMR, Opus, and Codec2 shrink the stream by sending speech parameters or efficient codebooks instead of every raw sample.
- Bandwidth must be budgeted end to end: Codec bit rate, framing overhead, retransmissions, and duty-cycle limits all matter when deciding whether a link can carry voice.
- Quality is multidimensional: Naturalness, intelligibility in noise, codec delay, CPU cost, and power draw all influence the right codec choice.
- Network fit matters more than codec popularity: Opus is excellent on Wi-Fi, AMR/CELP works on cellular links, and Codec2 is reserved for extremely constrained links where intelligibility matters more than natural sound.
The Problem: Voice-enabled IoT devices—smart speakers, intercoms, wearable communicators, and industrial voice systems—need to transmit audio over constrained networks. Uncompressed audio is prohibitively expensive for IoT bandwidth and storage budgets.
The Scale of the Challenge:
- Telephone-quality audio: 8-bit samples at 8 kHz = 64 kbps (kilobits per second)
- Typical IoT uplink: 10-50 kbps (LoRaWAN, NB-IoT, BLE)
- Gap: Raw audio exceeds available bandwidth by 2-6×!
For Beginners: Why Audio Compression Matters for IoT
The Problem in Plain Terms: Imagine you have a walkie-talkie that can only send 50 small packages per second, but your voice generates 64 packages per second. You can’t keep up! You need to either speak slower (not practical) or pack more information into fewer packages (compression).
Real-World Analogy: Think of a newspaper headline vs. the full article. The headline captures the essential meaning in far fewer words. Audio compression does something similar—it captures the essential sounds while discarding information your ear won’t miss.
Why This Matters for IoT:
- Smart speakers: Need to stream voice commands to the cloud for processing
- Baby monitors: Continuous audio over Wi-Fi without clogging the network
- Industrial intercoms: Clear communication in noisy factories over low-bandwidth radio
- Wearable translators: Real-time voice translation requires fast, compressed transmission
Key Insight: Good compression reduces 64 kbps to 8 kbps or less—an 8× reduction—while keeping speech intelligible!
30.3.1 Toll Quality: The Baseline for Voice
Toll quality refers to telephone-standard voice quality established by telecom standards:
- Sampling rate: 8,000 Hz, which safely captures the 300-3400 Hz telephone speech band.
- Bit depth: 8 bits, giving 256 quantization levels for each sample.
- Bit rate: 64 kbps, because
8 bits × 8000 samples/second = 64,000 bits/second. - Useful voice bandwidth: about 3.1 kHz, focused on intelligibility rather than hi-fi audio.
Why 8 kHz?: Human speech contains most intelligible content between 300-3400 Hz. Sampling at 8 kHz (twice 4 kHz) captures this range with margin.
Why 8 bits?: Early telephone systems used 8-bit quantization as a balance between quality and transmission cost. This became the G.711 PCM standard still used today.
Voice Compression Pipeline: Audio flows from microphone through band-pass filtering (telephone band 300-3400 Hz), anti-aliasing, 8 kHz sampling, 8-bit quantization, and then through one of three compression methods: companding (simple, preserves 64 kbps), LPC vocoder (high compression to 2-8 kbps), or CELP (balanced quality at 8-16 kbps).
30.3.2 Companding: Compression + Expansion
Companding (compression + expansion) is a technique that improves perceived audio quality without changing the bit rate. It exploits a fundamental property of human hearing:
The Key Insight: Human ears perceive loudness logarithmically, not linearly. A sound 10× louder doesn’t seem 10× louder—it seems about twice as loud. This is the Weber-Fechner law.
How Companding Works:
- Compression (transmitter): Apply logarithmic encoding to give more precision to quiet sounds
- Expansion (receiver): Apply inverse (exponential) decoding to restore original dynamics
Companding Principle: Linear quantization wastes bits on loud sounds while under-representing quiet sounds. Companding applies logarithmic compression before digitization and exponential expansion after, achieving constant signal-to-noise ratio (SNR) across all amplitudes. Two standards exist: μ-law (North America/Japan, μ=255) and A-law (Europe/international, A=87.6).
The μ-Law Formula (North America, Japan):
F(x) = sgn(x) * ln(1 + μ|x|) / ln(1 + μ)
Where: - \(x\) = normalized input (-1 to +1) - \(\mu\) = compression parameter (typically 255) - \(\text{sgn}(x)\) = sign of x (+1 or -1)
Putting Numbers to It
Let’s calculate how μ-law companding improves signal-to-noise ratio for quiet audio signals.
Scenario: A whisper at -20 dBFS (10% of full scale) is digitized with 8-bit PCM versus μ-law (μ=255).
Linear 8-bit PCM:
- Full scale: 256 quantization levels
- Signal at 10%: Uses only \(256 \times 0.1 = 25.6\) levels (effectively 5 bits)
- Quantization SNR: \(6.02 \times 5 = 30\) dB (poor!)
μ-Law Companding: Apply the compression function: \(F(0.1) = \frac{\ln(1 + 255 \times 0.1)}{\ln(1 + 255)} = \frac{\ln(26.5)}{\ln(256)} = \frac{3.28}{5.55} = 0.59\)
The whisper now occupies 59% of the dynamic range, using \(256 \times 0.59 = 151\) levels (7.2 effective bits).
Quantization SNR: \(6.02 \times 7.2 = 43.3\) dB Improvement: \(43.3 - 30 = 13.3\) dB better SNR for quiet sounds!
This is why telephone voice sounds clear even when you whisper – μ-law companding gives quiet speech the same SNR as loud speech.
Interactive Calculator: Companding SNR Improvement
Explore how μ-law companding improves SNR for quiet audio signals.
The A-Law Formula (Europe, international):
F(x) = sgn(x) * (A|x| / (1 + ln(A))) for |x| < 1/A, otherwise (1 + ln(A|x|)) / (1 + ln(A))
Where A = 87.6 (standard value)
Comparison: μ-Law vs A-Law:
- μ-Law: Used mainly in North America and Japan, with
μ = 255, and gives slightly better protection for very quiet signals. - A-Law: Used in Europe and much of the rest of the world, with
A = 87.6, and is slightly gentler around zero amplitude. - Shared standard: Both are part of ITU-T G.711, both target telephone speech, and both keep the stream at 64 kbps.
- Design takeaway: Choose the standard that matches the telephony ecosystem you must interoperate with; do not expect either one to solve a bandwidth bottleneck.
Important: Companding doesn’t change the bit rate (still 64 kbps), but it dramatically improves perceived quality by matching quantization precision to human hearing sensitivity.
30.3.3 Linear Predictive Coding (LPC): 8× Compression
The Breakthrough: Instead of transmitting the audio waveform directly, transmit parameters that describe how to synthesize it. This is the key to achieving 8× or greater compression.
The Source-Filter Model of Speech:
LPC is based on a revolutionary model of human speech developed by Gunnar Fant (KTH, Sweden) and Bishnu Atal (AT&T Bell Labs):
Source-Filter Model: Human speech is modeled as an excitation source (vocal cords for voiced sounds, turbulent noise for unvoiced) filtered by the vocal tract (lips, tongue, jaw). LPC transmits only the source type, pitch, filter coefficients, and gain—about 50-80 bits per 20ms frame instead of 1280 bits for raw PCM.
The LPC Compression Calculation:
Worked Example: LPC Compression Ratio
Raw PCM (Toll Quality):
- Sampling rate: 8,000 Hz
- Bits per sample: 8 bits
- Bit rate: 8 × 8,000 = 64,000 bps = 64 kbps
LPC-10 (US Government Standard):
- Frame size: 20 ms (160 samples)
- Bits per frame:
- Voiced/unvoiced: 1 bit
- Pitch: 6 bits
- Gain: 5 bits
- 10 reflection coefficients: ~36 bits
- Total: ~48 bits per frame
- Frames per second: 50 (1000 ms / 20 ms)
- Bit rate: 48 × 50 = 2,400 bps = 2.4 kbps
Compression Ratio: 64 kbps / 2.4 kbps = 26.7:1
That’s 0.5-1.5 bits per sample instead of 8 bits per sample!
Trade-off: LPC-10 sounds “robotic” because it perfectly models the vocal tract but loses the fine nuances that make voices sound natural. It’s intelligible but not pleasant for long conversations.
Modern LPC Variants for IoT:
- LPC-10: 2.4 kbps, intelligible but robotic, used where link budget matters more than naturalness.
- CELP: 4.8-16 kbps, good quality, a strong fit for intercoms and older cellular voice systems.
- AMR-NB: 4.75-12.2 kbps, good to excellent for narrowband cellular IoT voice.
- Opus: 6-64 kbps, excellent quality and low delay, best on Wi-Fi, Ethernet, and stronger BLE links.
- Codec2: 0.7-3.2 kbps, rough but usable speech for ultra-constrained long-range links.
30.3.4 Practical IoT Voice Applications
Application 1: Smart Doorbell
- Constraint: Wi-Fi connected, but video already consumes bandwidth
- Solution: Use Opus at 16 kbps for voice (vs 64 kbps PCM)
- Savings: 75% bandwidth reduction, more headroom for video
Application 2: LoRaWAN Voice Alert System
- Constraint: 250 bps effective data rate
- Solution: Pre-recorded vocabulary + Codec2 at 700 bps
- Approach: Don’t stream—buffer 2-3 second messages, transmit over 20-30 seconds
Application 3: Wearable Translator
- Constraint: BLE bandwidth ~1 Mbps, but must be real-time
- Solution: AMR-WB at 12.65 kbps for high quality
- Result: Multiple simultaneous voice streams possible
Quiz: Voice Compression Design
Scenario: You’re designing a voice-enabled industrial IoT intercom system for a noisy factory floor. Requirements: - Communication range: 500 meters (using LoRa radio, 10 kbps effective) - Voice quality: Must be intelligible over background noise - Latency: <500 ms end-to-end acceptable - Battery: Solar-powered, always listening
Questions:
- Can you transmit raw 64 kbps PCM audio? Why or why not?
- Which compression approach would you choose: companding, LPC-10, or CELP?
- What bit rate is achievable, and what’s the compression ratio?
Answers:
Q1: Can you transmit 64 kbps PCM?
- No. Your radio supports 10 kbps, but 64 kbps audio requires 6.4× more bandwidth than available.
- Even with protocol overhead removed, you need >6× compression minimum.
Q2: Which compression approach?
- Companding: 64 kbps. Best quality on generous links, but still 6.4× over the factory radio budget.
- LPC-10: 2.4 kbps. Fits easily, but the robotic output will be hard to understand in heavy noise.
- CELP/AMR: 4.75-8 kbps. Strong balance of naturalness and bandwidth, leaving some headroom for headers and retransmissions.
- Codec2: 1.2-3.2 kbps. Safest for extreme link budgets, but quality is notably lower than CELP/AMR.
Best choice: CELP/AMR at ~6 kbps (or Codec2 at 3.2 kbps for safety margin)
Reasoning:
- 6 kbps uses 60% of 10 kbps channel, leaving room for headers and retransmits
- CELP quality is acceptable even in noisy environments
- Latency budget allows for ~100-200 ms encoding buffer
Q3: Compression ratio:
- Raw: 64 kbps
- Compressed (CELP @ 6 kbps): 6 kbps
- Ratio: 64/6 = 10.7:1 compression
With Codec2 @ 3.2 kbps: 20:1 compression
Key Insight: Voice compression makes the difference between “impossible” (64 kbps over 10 kbps link) and “comfortable” (6 kbps with 40% margin).
Interactive Calculator: Codec Selection for IoT
Choose the right audio codec for your IoT bandwidth constraints.
30.3.5 Summary: Voice Compression for IoT
- Raw PCM: 64 kbps, perfect waveform fidelity, suitable mainly when processing stays local.
- Companding (μ/A-law): still 64 kbps, excellent telephone speech quality, useful when you want better quiet-signal SNR without changing infrastructure.
- LPC-10: about 2.4 kbps, very high compression, intelligible but robotic.
- CELP/AMR: roughly 8-16 kbps, good balance for intercoms and cellular voice.
- Opus: 6-64 kbps, the best choice for high-quality low-latency voice on stronger links.
- Codec2: 0.7-3.2 kbps, specialized for ultra-low-bandwidth links where intelligibility matters more than natural sound.
30.4 Common Mistake: Confusing Codec Bit Rate with Application Bandwidth
Common Mistake: “Codec2 at 3.2 kbps Means I Need 3.2 kbps Network Bandwidth”
The Misconception: “My codec compresses voice to 3.2 kbps, so I can use it over a 5 kbps LoRaWAN link with room to spare.”
Why This Is Wrong:
Codec bit rate is payload only. Real network transmission adds protocol overhead that can double or triple the required bandwidth.
Worked Example: LoRaWAN Voice Link
Scenario: Transmit voice over LoRaWAN using Codec2 at 3.2 kbps
Naive calculation:
- Voice codec: 3.2 kbps
- Available bandwidth: 5.5 kbps (LoRaWAN SF7)
- Conclusion: “5.5 kbps > 3.2 kbps, it will work!” ❌ WRONG
Step 1: Calculate Protocol Overhead
Codec2 frame structure:
- Frame size: 20 ms
- Bits per frame: 3200 bps × 0.02 s = 64 bits = 8 bytes payload
LoRaWAN packet structure:
- Payload: 8 bytes
- Frame header (FHDR): 13 bytes
- MIC (authentication): 4 bytes
- Total packet: 8 + 13 + 4 = 25 bytes
Overhead ratio: 25 / 8 = 3.125× (protocol adds 212% overhead!)
Step 2: Calculate Real Bandwidth Requirement
Packet transmission:
- Packet size: 25 bytes
- Packet frequency: 1 every 20 ms = 50 packets/second
- Required bandwidth: 25 bytes × 8 bits × 50 /sec = 10,000 bps = 10 kbps
Comparison:
- Codec payload only: 3.2 kbps
- Including protocol overhead: 10 kbps (3.125× more!)
- LoRaWAN SF7 max: 5.5 kbps
- Result: System is 182% over capacity ❌
Step 3: Add Real-World Constraints
LoRaWAN duty cycle (EU868):
- Maximum: 1% duty cycle
- Available air time per hour: 36 seconds
- At 10 kbps: Can transmit for 36 seconds/hour
- Voice time available: 36 seconds every 60 minutes = 1% uptime
Clearly insufficient for continuous voice!
Step 4: Corrected Approach - Buffer and Burst
Solution: Don’t stream real-time voice. Buffer, compress heavily, transmit in bursts.
Revised design:
- Record voice: 10 seconds
- Compress with Codec2 @ 1.2 kbps (lowest mode): 10 s × 1200 bits = 15,000 bits = 1875 bytes
- Split into LoRaWAN max payload (51 bytes): 37 packets
- Include overhead: 37 × (51 + 17) = 2516 bytes on air
Transmission time (SF9, 125 kHz): - Air time per packet: ~400 ms - Total: 37 × 400 ms = 14.8 seconds
Voice delay: 10 seconds recording + 14.8 seconds transmission = 24.8 seconds latency
Acceptable for:
- Asynchronous voice messages (walkie-talkie style)
- Alert notifications with voice playback
- Remote site check-ins
NOT acceptable for:
- Real-time phone calls (<150 ms latency expected)
- Interactive two-way conversation
- Emergency voice channels
Comparison Table: Codec Payload vs Network Reality
- Wi-Fi: 16 kbps Opus stays close to 16.2 kbps after minimal RTP overhead, so real-time voice is practical.
- BLE: 16 kbps AMR-NB lands near 16.3 kbps, making short-range real-time voice practical.
- NB-IoT: 8 kbps CELP grows to about 8.5 kbps, so near-real-time voice can work with careful budgeting.
- LoRaWAN: 3.2 kbps Codec2 becomes roughly 10 kbps once LoRa MAC overhead is included, so streaming fails and burst mode is required.
- Sigfox: even 0.7 kbps Codec2 is effectively message-oriented, not conversational voice.
Key Lessons:
- Always factor protocol overhead into bandwidth calculations
- LoRaWAN: 2-3× overhead (frame headers, MIC, ACKs)
- NB-IoT/LTE-M: 1.5-2× overhead (UDP/IP/CoAP headers)
- Wi-Fi/Ethernet: 1.1-1.2× overhead (minimal packet wrapping)
- Duty cycle limits real throughput
- EU868 LoRaWAN: 1% duty cycle = 36 sec/hour usable
- US915 LoRaWAN: 4-second dwell time limits long transmissions
- NB-IoT: PSM mode limits continuous uplink
- Match application to network capabilities
- Wi-Fi / BLE: Real-time voice ✓
- Cellular IoT (NB-IoT, LTE-M): Compressed voice ✓
- LoRaWAN / Sigfox: Burst messages only (walkie-talkie mode)
- Don’t design for “theoretical bit rate”
- Measure actual air time and overhead in lab before deployment
- Account for retransmissions (LoRaWAN confirmed uplinks add 50-100% air time)
- Factor in collision probability (LoRaWAN ALOHA-based, ~5-10% packet loss at scale)
Takeaway: When the chapter says “Codec2 achieves 0.7-3.2 kbps,” that’s the compressed voice payload. Add 2-3× protocol overhead for total bandwidth requirement. For LoRaWAN voice, you need buffer-and-burst mode, not streaming.
Key Takeaways:
- Toll quality baseline: 64 kbps (8 bits × 8 kHz)
- Companding: Logarithmic encoding improves SNR for quiet sounds (μ-law vs A-law)
- Source-filter model: Speech = excitation (voiced/unvoiced) + vocal tract filter
- LPC magic: Transmit filter parameters instead of waveform → 8-27× compression
- IoT applications: Match codec to available bandwidth (Codec2 for LoRa, Opus for Wi-Fi)
Key Takeaway
Voice compression transforms “impossible” IoT audio (64 kbps raw PCM over a 10 kbps link) into “comfortable” transmission by exploiting speech models. LPC’s source-filter model achieves 8-27x compression by transmitting vocal tract parameters instead of waveforms. For IoT codec selection, match to your available bandwidth: Opus for Wi-Fi devices (6-64 kbps, excellent quality), CELP/AMR for cellular IoT (8-16 kbps, good quality), and Codec2 for LoRa and ultra-low bandwidth links (0.7-3.2 kbps, intelligible speech).
For Kids: Meet the Sensor Squad!
Sammy the Sensor had a voice message to send, but the IoT radio was too slow!
“My voice recording is 64 kilobits per second, but our radio can only handle 10 kilobits!” Sammy worried.
Max the Microcontroller had three ideas:
“Plan A: Companding – I can make quiet sounds easier to hear by encoding them differently, but the file stays the same size. That doesn’t help with our speed problem.”
“Plan B: LPC magic – instead of recording every sound wave, I describe HOW you’re speaking! ‘Voiced sound, pitch 150 Hz, mouth shape #7.’ That’s like sending a recipe instead of the whole cake – 27 times smaller!”
Lila the LED was amazed: “From 64 kilobits down to just 2.4 kilobits? That’s like squeezing a big book into a text message!”
“Plan C: Codec2 – an open-source codec that compresses to just 0.7 kilobits. Perfect for our tiny LoRa radio!” Max beamed.
Bella the Battery smiled: “Less data to send means less radio time, which means I last longer too. Voice compression is a triple win – bandwidth, time, AND energy!”
30.5 Knowledge Check
Try It Yourself: Codec Selection
Time: ~10 min | Difficulty: Intermediate | No hardware needed
Challenge: You’re designing a voice-enabled smart doorbell. It connects via Wi-Fi (11 Mbps available) and streams audio to a cloud service for voice recognition. Choose between PCM (64 kbps) and Opus (16 kbps).
Steps:
- Calculate monthly data for 100 doorbells, each streaming 5 minutes of audio per day
- Compare cloud data costs: PCM vs Opus
- Verify Wi-Fi bandwidth is sufficient for both
30.5.1 Solution
Monthly data calculation:
- PCM: 64 kbps × 300 seconds/day × 30 days × 100 doorbells = 57.6 GB/month
- Opus: 16 kbps × 300 seconds/day × 30 days × 100 doorbells = 14.4 GB/month
- Savings: 43.2 GB/month = 75% reduction
Cloud cost (assuming $0.12/GB): - PCM: $6.91/month - Opus: $1.73/month - Annual savings: $62.16 for 100 doorbells
Recommendation: Use Opus at 16 kbps — saves 75% bandwidth and cloud costs with negligible quality loss for voice recognition.
Concept Relationships: Voice Compression
- Companding -> Signal-to-noise ratio: Logarithmic encoding protects quiet speech without lowering bit rate.
- LPC -> Speech production model: The codec sends source and vocal-tract parameters instead of the raw waveform.
- Opus -> Variable-rate transport: It adapts bit rate to the link budget while preserving low delay and high quality.
- Codec2 -> Ultra-low-bandwidth networking: It trades naturalness for intelligibility on LoRa, HF radio, and satellite-style links.
Cross-module connection: LoRaWAN Overview explains bandwidth constraints that require aggressive voice compression.
30.6 See Also
- Signal Processing Overview — Sampling fundamentals for audio signals
- LoRaWAN Fundamentals — Understanding long-range, low-bandwidth constraints
- BLE Audio — Bluetooth audio streaming for wearables
Common Pitfalls
1. Treating Companding as Bandwidth Compression
μ-law and A-law reshape the sample distribution, but they do not reduce the 64 kbps toll-quality payload. If the radio budget is 10 kbps, companding alone does not solve the problem.
2. Forgetting Headers, Retransmissions, and Duty Cycle
Codec bit rate is only the payload. Packet headers, MAC overhead, acknowledgements, and regional airtime limits can easily double or triple the true network budget, especially on LoRaWAN and cellular IoT links.
3. Picking the Lowest Bit Rate Without Listening in Context
A 2.4 kbps codec that sounds acceptable in a quiet office can become unusable in a factory or on a wearable. Validate intelligibility, noise robustness, delay, and CPU load with the real microphone, speaker, and network path before shipping.
30.7 What’s Next
- Sensor Dynamics: see how sensor settling time and dynamic range interact with the sampled data you eventually encode.
- Signal Processing Labs: practice sampling, filtering, and compression with hands-on experiments.
- Data Representation: review the binary encoding choices that sit underneath codec payloads.
- LoRaWAN Overview: connect codec selection to real long-range airtime and duty-cycle limits.
- BLE Audio Streaming: compare this chapter’s voice decisions with short-range wearable audio links.