%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'noteTextColor': '#2C3E50', 'noteBkgColor': '#fff9e6', 'textColor': '#2C3E50', 'fontSize': '14px'}}}%%
graph TB
subgraph VoIPStack["VoIP/SIP Protocol Stack"]
direction TB
subgraph Control["Signaling Layer"]
SIP["SIP<br/>Session Initiation Protocol<br/>RFC 3261"]
end
subgraph Media["Media Transport Layer"]
RTP["RTP<br/>Real-time Transport Protocol<br/>RFC 3550"]
RTCP["RTCP<br/>RTP Control Protocol<br/>Quality Feedback"]
end
subgraph Transport["Transport Layer"]
UDP["UDP<br/>User Datagram Protocol<br/>RFC 768"]
end
subgraph Network["Network Layer"]
IP["IP<br/>Internet Protocol"]
end
SIP --> UDP
RTP --> UDP
RTCP --> UDP
UDP --> IP
end
subgraph Security["Security Options"]
SRTP["SRTP<br/>Encrypted Media"]
TLS["TLS<br/>Encrypted Signaling"]
zRTP["zRTP<br/>Key Exchange"]
end
Security -.->|Protects| VoIPStack
style Control fill:#E67E22,stroke:#D35400,color:#fff
style Media fill:#16A085,stroke:#16A085,color:#fff
style Transport fill:#2C3E50,stroke:#2C3E50,color:#fff
style Network fill:#7F8C8D,stroke:#7F8C8D,color:#fff
style Security fill:#ecf0f1,stroke:#16A085,color:#2C3E50
1178 Real-Time Protocols for IoT: VoIP, SIP, and RTP
1178.1 Learning Objectives
By the end of this chapter, you will be able to:
- Understand VoIP Architecture: Explain the protocol stack for real-time audio/video in IoT
- Compare RTP vs Messaging Protocols: Differentiate when to use RTP instead of MQTT/CoAP
- Design Secure Streaming: Apply encryption and authentication for real-time IoT streams
- Select Appropriate Protocols: Choose between SIP, RTSP, and WebRTC for different IoT devices
1178.2 Prerequisites
Before diving into this chapter, you should be familiar with:
- Application Protocols Overview: Basic understanding of IoT application protocols
- Transport Fundamentals: UDP vs TCP characteristics
1178.3 Real-time Protocols for IoT
While MQTT and CoAP handle most IoT data exchange scenarios, some applications require real-time audio and video streaming with strict latency requirements. Video doorbells, baby monitors, voice assistants, and intercom systems all need protocols designed specifically for continuous media streams.
1178.4 VoIP and SIP Architecture
Voice over IP (VoIP) enables real-time voice and video communication over IP networks. The protocol stack for VoIP consists of several layers working together:
VoIP Protocol Stack: SIP handles session control (call setup/teardown), while RTP carries the actual audio/video data over UDP. Security layers (SRTP, TLS, zRTP) protect both signaling and media streams.
1178.5 Key Protocol Components
| Protocol | RFC | Purpose | IoT Relevance |
|---|---|---|---|
| SIP | RFC 3261 | Session Initiation Protocol - multimedia session control | Call setup for video doorbells, intercoms |
| RTP | RFC 3550 | Real-time Transport Protocol - media stream delivery | Audio/video streaming from cameras |
| RTCP | RFC 3550 | RTP Control Protocol - quality monitoring | Adaptive bitrate for constrained networks |
| UDP | RFC 768 | User Datagram Protocol - connectionless transport | Low-latency delivery (no TCP handshake) |
| SRTP | RFC 3711 | Secure RTP - encrypted media | Privacy for baby monitors, doorbells |
| zRTP | RFC 6189 | Key exchange for SRTP | End-to-end encryption setup |
| TLS | RFC 5246 | Transport Layer Security - encrypted signaling | Secure SIP (SIPS) on port 5061 |
1178.6 SIP Ports and Security
| Port | Protocol | Security | Use Case |
|---|---|---|---|
| 5060 | SIP over UDP/TCP | Unencrypted | Internal/trusted networks |
| 5061 | SIPS over TLS | Encrypted | Internet-facing devices |
For IoT devices accessible from the internet (video doorbells, remote intercoms), always use port 5061 with TLS encryption to prevent eavesdropping and session hijacking.
1178.7 Real-time IoT Applications
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'noteTextColor': '#2C3E50', 'noteBkgColor': '#fff9e6', 'textColor': '#2C3E50', 'fontSize': '14px'}}}%%
graph LR
subgraph Devices["IoT Devices with Real-time Audio/Video"]
Doorbell["Video Doorbell"]
Monitor["Baby Monitor"]
Intercom["Smart Intercom"]
Assistant["Voice Assistant"]
Camera["Security Camera"]
end
subgraph Protocols["Protocol Selection"]
SIPbased["SIP + RTP/SRTP<br/>Two-way communication"]
RTSPbased["RTSP + RTP<br/>One-way streaming"]
WebRTC["WebRTC<br/>Browser-based"]
end
Doorbell --> SIPbased
Intercom --> SIPbased
Monitor --> RTSPbased
Camera --> RTSPbased
Assistant --> WebRTC
style Devices fill:#2C3E50,stroke:#2C3E50,color:#fff
style Protocols fill:#16A085,stroke:#16A085,color:#fff
Real-time Protocol Selection for IoT: Two-way communication devices (doorbells, intercoms) use SIP+RTP, while one-way streaming devices (cameras, monitors) often use RTSP+RTP. Voice assistants increasingly use WebRTC for browser compatibility.
Common IoT Use Cases:
| Device | Protocol | Why |
|---|---|---|
| Video Doorbell | SIP + SRTP | Two-way audio/video with visitor, needs secure encrypted stream |
| Baby Monitor | RTP (or RTSP) | One-way video stream, low latency critical for responsiveness |
| Smart Intercom | SIP + RTP | Full duplex audio between rooms, session-based communication |
| Voice Assistant | WebRTC or proprietary | Browser/app integration, cloud speech processing |
| Security Camera | RTSP + RTP | Continuous streaming, ONVIF standard interoperability |
1178.8 Comparison: Real-time vs Messaging Protocols
When should you use VoIP/RTP instead of MQTT/CoAP?
| Requirement | MQTT/CoAP | VoIP/RTP |
|---|---|---|
| Data Type | Sensor readings, commands, telemetry | Continuous audio/video streams |
| Latency Tolerance | 100ms - seconds acceptable | <150ms required (human perception) |
| Packet Loss | Retransmit (reliability critical) | Skip/interpolate (continuity critical) |
| Bandwidth | Low (bytes to KB per message) | High (64kbps - 2Mbps continuous) |
| Connection Model | Message-based (discrete) | Session-based (continuous stream) |
| Typical Payload | JSON, CBOR, binary sensor data | PCM audio, H.264/H.265 video |
Use RTP/SRTP when: - Streaming continuous audio/video (doorbell live view) - Two-way voice communication (intercom, doorbell talk) - Latency under 150ms is critical - You need synchronized audio/video playback
Use MQTT when: - Sending audio clips/recordings (doorbell motion event) - Push notifications with audio alert - Speech-to-text results from cloud processing - Audio metadata (noise level, voice detection events)
Hybrid approach (common in smart doorbells):
Motion detected β MQTT notification to phone app
User opens app β SIP session established
Live video/audio β RTP/SRTP stream
User speaks β RTP audio to doorbell speaker
Call ends β SIP session terminated
Event recorded β Video clip stored, MQTT notification
1178.9 Security for Real-time IoT
Real-time audio and video streams require strong security to prevent eavesdropping:
| Security Layer | Protocol | Protects |
|---|---|---|
| Signaling Encryption | TLS (SIPS on port 5061) | Call setup, session details |
| Media Encryption | SRTP | Audio/video content |
| Key Exchange | zRTP, DTLS-SRTP | Secure key negotiation |
| Authentication | Digest auth, certificates | Caller/device identity |
Many consumer IoT devices have been compromised due to unencrypted streams. Always ensure:
- SRTP enabled - Not just RTP (encrypted vs plaintext audio/video)
- TLS for SIP - Port 5061, not 5060
- Strong authentication - Not default credentials
- End-to-end encryption - zRTP prevents cloud provider access to content
1178.10 RTP Packet Structure
Understanding RTP helps diagnose audio/video issues in IoT deployments:
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X| CC |M| PT | Sequence Number |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Timestamp |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Synchronization Source (SSRC) Identifier |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Contributing Source (CSRC) Identifiers |
| .... |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Payload Data |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Field | Bits | Purpose |
|---|---|---|
| V | 2 | Version (always 2) |
| P | 1 | Padding flag |
| X | 1 | Extension header present |
| CC | 4 | CSRC count |
| M | 1 | Marker (frame boundary) |
| PT | 7 | Payload type (codec identifier) |
| Sequence | 16 | Packet ordering/loss detection |
| Timestamp | 32 | Media timing (audio sample count) |
| SSRC | 32 | Stream identifier |
RTP Header Overhead: 12 bytes minimum (vs CoAPβs 4 bytes, MQTTβs 2 bytes)
The larger header is justified for streaming media because: - Sequence numbers detect packet loss and reordering - Timestamps enable jitter buffering and synchronization - SSRC allows multiple streams in one session
1178.11 Summary Table: Protocol Comparison
| Criterion | CoAP | MQTT | RTP/SIP |
|---|---|---|---|
| Best Use Case | Direct device queries | Event distribution | Audio/video streaming |
| Communication | Request-Response | Publish-Subscribe | Session-based streams |
| Transport | UDP (lightweight) | TCP (reliable) | UDP (low latency) |
| Power | Ultra-low | Low | Medium-High |
| Reliability | Optional | Built-in (QoS) | Lost packets skipped |
| Scalability | Good | Excellent | Per-session |
| Complexity | Low | Medium | High |
| Browser Support | Limited | Good (WebSockets) | WebRTC bridge |
| Setup | No broker needed | Requires broker | SIP server optional |
| Latency | Low | Medium | Ultra-low (<150ms) |
| Data Type | Sensor data | Events/telemetry | Continuous media |
1178.12 Key Takeaways
- No single protocol is always best - Evaluate based on specific requirements
- CoAP excels in constrained environments - Direct, lightweight, low power
- MQTT excels in event-driven systems - Reliable, scalable, many-to-many
- VoIP/RTP for real-time media - Video doorbells, intercoms, and voice assistants require SIP+RTP
- Hybrid approaches are common - Use both where appropriate (MQTT for notifications, RTP for live streams)
- Consider the entire system - Not just protocol features, but network, power, and architecture
Protocol Deep Dives: - MQTT - Pub/sub messaging - CoAP - RESTful IoT - AMQP - Enterprise messaging - XMPP - Presence protocol
Protocol Selection: - Protocol Selection Framework - Choosing protocols - IoT Protocols Overview - Full landscape
Architecture: - IoT Reference Models - Architecture layers - Edge Fog Computing - Protocol placement
Interactive Tools: - Simulations Hub - Protocol comparison tool
Learning Hubs: - Quiz Navigator - Protocol quizzes
1178.13 Visual Reference Gallery
This comprehensive overview shows how the major IoT application protocols relate to each other, helping guide protocol selection based on device constraints and communication patterns.
Understanding the fundamental differences between CoAP (RESTful, UDP-based) and MQTT (pub/sub, TCP-based) is essential for selecting the right protocol for specific IoT scenarios.
1178.14 Whatβs Next?
Return to the Application Protocols Overview for links to all protocol chapters, or continue to MQTT Fundamentals for a deep dive into the most widely-used IoT messaging protocol.