1178  Real-Time Protocols for IoT: VoIP, SIP, and RTP

1178.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Understand VoIP Architecture: Explain the protocol stack for real-time audio/video in IoT
  • Compare RTP vs Messaging Protocols: Differentiate when to use RTP instead of MQTT/CoAP
  • Design Secure Streaming: Apply encryption and authentication for real-time IoT streams
  • Select Appropriate Protocols: Choose between SIP, RTSP, and WebRTC for different IoT devices

1178.2 Prerequisites

Before diving into this chapter, you should be familiar with:


1178.3 Real-time Protocols for IoT

While MQTT and CoAP handle most IoT data exchange scenarios, some applications require real-time audio and video streaming with strict latency requirements. Video doorbells, baby monitors, voice assistants, and intercom systems all need protocols designed specifically for continuous media streams.

1178.4 VoIP and SIP Architecture

Voice over IP (VoIP) enables real-time voice and video communication over IP networks. The protocol stack for VoIP consists of several layers working together:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'noteTextColor': '#2C3E50', 'noteBkgColor': '#fff9e6', 'textColor': '#2C3E50', 'fontSize': '14px'}}}%%
graph TB
    subgraph VoIPStack["VoIP/SIP Protocol Stack"]
        direction TB

        subgraph Control["Signaling Layer"]
            SIP["SIP<br/>Session Initiation Protocol<br/>RFC 3261"]
        end

        subgraph Media["Media Transport Layer"]
            RTP["RTP<br/>Real-time Transport Protocol<br/>RFC 3550"]
            RTCP["RTCP<br/>RTP Control Protocol<br/>Quality Feedback"]
        end

        subgraph Transport["Transport Layer"]
            UDP["UDP<br/>User Datagram Protocol<br/>RFC 768"]
        end

        subgraph Network["Network Layer"]
            IP["IP<br/>Internet Protocol"]
        end

        SIP --> UDP
        RTP --> UDP
        RTCP --> UDP
        UDP --> IP
    end

    subgraph Security["Security Options"]
        SRTP["SRTP<br/>Encrypted Media"]
        TLS["TLS<br/>Encrypted Signaling"]
        zRTP["zRTP<br/>Key Exchange"]
    end

    Security -.->|Protects| VoIPStack

    style Control fill:#E67E22,stroke:#D35400,color:#fff
    style Media fill:#16A085,stroke:#16A085,color:#fff
    style Transport fill:#2C3E50,stroke:#2C3E50,color:#fff
    style Network fill:#7F8C8D,stroke:#7F8C8D,color:#fff
    style Security fill:#ecf0f1,stroke:#16A085,color:#2C3E50

Figure 1178.1: VoIP Protocol Stack: SIP Signaling with RTP Media Transport

VoIP Protocol Stack: SIP handles session control (call setup/teardown), while RTP carries the actual audio/video data over UDP. Security layers (SRTP, TLS, zRTP) protect both signaling and media streams.

Figure 1178.2

1178.5 Key Protocol Components

Protocol RFC Purpose IoT Relevance
SIP RFC 3261 Session Initiation Protocol - multimedia session control Call setup for video doorbells, intercoms
RTP RFC 3550 Real-time Transport Protocol - media stream delivery Audio/video streaming from cameras
RTCP RFC 3550 RTP Control Protocol - quality monitoring Adaptive bitrate for constrained networks
UDP RFC 768 User Datagram Protocol - connectionless transport Low-latency delivery (no TCP handshake)
SRTP RFC 3711 Secure RTP - encrypted media Privacy for baby monitors, doorbells
zRTP RFC 6189 Key exchange for SRTP End-to-end encryption setup
TLS RFC 5246 Transport Layer Security - encrypted signaling Secure SIP (SIPS) on port 5061

1178.6 SIP Ports and Security

NoteSIP Port Assignments
Port Protocol Security Use Case
5060 SIP over UDP/TCP Unencrypted Internal/trusted networks
5061 SIPS over TLS Encrypted Internet-facing devices

For IoT devices accessible from the internet (video doorbells, remote intercoms), always use port 5061 with TLS encryption to prevent eavesdropping and session hijacking.

1178.7 Real-time IoT Applications

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'noteTextColor': '#2C3E50', 'noteBkgColor': '#fff9e6', 'textColor': '#2C3E50', 'fontSize': '14px'}}}%%
graph LR
    subgraph Devices["IoT Devices with Real-time Audio/Video"]
        Doorbell["Video Doorbell"]
        Monitor["Baby Monitor"]
        Intercom["Smart Intercom"]
        Assistant["Voice Assistant"]
        Camera["Security Camera"]
    end

    subgraph Protocols["Protocol Selection"]
        SIPbased["SIP + RTP/SRTP<br/>Two-way communication"]
        RTSPbased["RTSP + RTP<br/>One-way streaming"]
        WebRTC["WebRTC<br/>Browser-based"]
    end

    Doorbell --> SIPbased
    Intercom --> SIPbased
    Monitor --> RTSPbased
    Camera --> RTSPbased
    Assistant --> WebRTC

    style Devices fill:#2C3E50,stroke:#2C3E50,color:#fff
    style Protocols fill:#16A085,stroke:#16A085,color:#fff

Figure 1178.3: Real-Time IoT Device Protocol Selection: SIP, RTSP, and WebRTC

Real-time Protocol Selection for IoT: Two-way communication devices (doorbells, intercoms) use SIP+RTP, while one-way streaming devices (cameras, monitors) often use RTSP+RTP. Voice assistants increasingly use WebRTC for browser compatibility.

Figure 1178.4

Common IoT Use Cases:

Device Protocol Why
Video Doorbell SIP + SRTP Two-way audio/video with visitor, needs secure encrypted stream
Baby Monitor RTP (or RTSP) One-way video stream, low latency critical for responsiveness
Smart Intercom SIP + RTP Full duplex audio between rooms, session-based communication
Voice Assistant WebRTC or proprietary Browser/app integration, cloud speech processing
Security Camera RTSP + RTP Continuous streaming, ONVIF standard interoperability

1178.8 Comparison: Real-time vs Messaging Protocols

When should you use VoIP/RTP instead of MQTT/CoAP?

Requirement MQTT/CoAP VoIP/RTP
Data Type Sensor readings, commands, telemetry Continuous audio/video streams
Latency Tolerance 100ms - seconds acceptable <150ms required (human perception)
Packet Loss Retransmit (reliability critical) Skip/interpolate (continuity critical)
Bandwidth Low (bytes to KB per message) High (64kbps - 2Mbps continuous)
Connection Model Message-based (discrete) Session-based (continuous stream)
Typical Payload JSON, CBOR, binary sensor data PCM audio, H.264/H.265 video

Use RTP/SRTP when: - Streaming continuous audio/video (doorbell live view) - Two-way voice communication (intercom, doorbell talk) - Latency under 150ms is critical - You need synchronized audio/video playback

Use MQTT when: - Sending audio clips/recordings (doorbell motion event) - Push notifications with audio alert - Speech-to-text results from cloud processing - Audio metadata (noise level, voice detection events)

Hybrid approach (common in smart doorbells):

Motion detected β†’ MQTT notification to phone app
User opens app β†’ SIP session established
Live video/audio β†’ RTP/SRTP stream
User speaks β†’ RTP audio to doorbell speaker
Call ends β†’ SIP session terminated
Event recorded β†’ Video clip stored, MQTT notification

1178.9 Security for Real-time IoT

Real-time audio and video streams require strong security to prevent eavesdropping:

Security Layer Protocol Protects
Signaling Encryption TLS (SIPS on port 5061) Call setup, session details
Media Encryption SRTP Audio/video content
Key Exchange zRTP, DTLS-SRTP Secure key negotiation
Authentication Digest auth, certificates Caller/device identity
WarningSecurity Best Practice for IoT Doorbells

Many consumer IoT devices have been compromised due to unencrypted streams. Always ensure:

  1. SRTP enabled - Not just RTP (encrypted vs plaintext audio/video)
  2. TLS for SIP - Port 5061, not 5060
  3. Strong authentication - Not default credentials
  4. End-to-end encryption - zRTP prevents cloud provider access to content

1178.10 RTP Packet Structure

Understanding RTP helps diagnose audio/video issues in IoT deployments:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT      |       Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Timestamp                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Synchronization Source (SSRC) Identifier            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Contributing Source (CSRC) Identifiers             |
|                             ....                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Payload Data                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Field Bits Purpose
V 2 Version (always 2)
P 1 Padding flag
X 1 Extension header present
CC 4 CSRC count
M 1 Marker (frame boundary)
PT 7 Payload type (codec identifier)
Sequence 16 Packet ordering/loss detection
Timestamp 32 Media timing (audio sample count)
SSRC 32 Stream identifier

RTP Header Overhead: 12 bytes minimum (vs CoAP’s 4 bytes, MQTT’s 2 bytes)

The larger header is justified for streaming media because: - Sequence numbers detect packet loss and reordering - Timestamps enable jitter buffering and synchronization - SSRC allows multiple streams in one session

1178.11 Summary Table: Protocol Comparison

Criterion CoAP MQTT RTP/SIP
Best Use Case Direct device queries Event distribution Audio/video streaming
Communication Request-Response Publish-Subscribe Session-based streams
Transport UDP (lightweight) TCP (reliable) UDP (low latency)
Power Ultra-low Low Medium-High
Reliability Optional Built-in (QoS) Lost packets skipped
Scalability Good Excellent Per-session
Complexity Low Medium High
Browser Support Limited Good (WebSockets) WebRTC bridge
Setup No broker needed Requires broker SIP server optional
Latency Low Medium Ultra-low (<150ms)
Data Type Sensor data Events/telemetry Continuous media

1178.12 Key Takeaways

NoteProtocol Selection Principles
  1. No single protocol is always best - Evaluate based on specific requirements
  2. CoAP excels in constrained environments - Direct, lightweight, low power
  3. MQTT excels in event-driven systems - Reliable, scalable, many-to-many
  4. VoIP/RTP for real-time media - Video doorbells, intercoms, and voice assistants require SIP+RTP
  5. Hybrid approaches are common - Use both where appropriate (MQTT for notifications, RTP for live streams)
  6. Consider the entire system - Not just protocol features, but network, power, and architecture

Protocol Deep Dives: - MQTT - Pub/sub messaging - CoAP - RESTful IoT - AMQP - Enterprise messaging - XMPP - Presence protocol

Protocol Selection: - Protocol Selection Framework - Choosing protocols - IoT Protocols Overview - Full landscape

Architecture: - IoT Reference Models - Architecture layers - Edge Fog Computing - Protocol placement

Interactive Tools: - Simulations Hub - Protocol comparison tool

Learning Hubs: - Quiz Navigator - Protocol quizzes

1178.14 What’s Next?

Return to the Application Protocols Overview for links to all protocol chapters, or continue to MQTT Fundamentals for a deep dive into the most widely-used IoT messaging protocol.