1171  IoT Application Protocols: Real-time Protocols for Audio and Video

1171.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Understand VoIP/SIP/RTP Architecture: Explain protocols for audio/video IoT applications
  • Compare RTP vs MQTT: Select appropriate protocol for real-time vs telemetry data
  • Design Secure Doorbell Systems: Apply security best practices for IoT audio/video devices
  • Apply Protocol Selection Principles: Choose protocols based on requirements and constraints
  • Understand Visual References: Navigate protocol landscape diagrams

1171.2 Prerequisites

Before diving into this chapter, you should be familiar with:

1171.3 How This Chapter Fits

Chapter Series Navigation: 1. Introduction and Why Lightweight Protocols Matter 2. Protocol Overview and Comparison 3. REST API Design for IoT 4. Real-time Protocols (this chapter) 5. Worked Examples

This chapter extends the protocol discussion to real-time audio/video applications like smart doorbells, intercoms, and video surveillance systems.


1171.4 Real-time Protocols for IoT

While MQTT and CoAP handle most IoT data exchange scenarios, some applications require real-time audio and video streaming with strict latency requirements. Video doorbells, baby monitors, voice assistants, and intercom systems all need protocols designed specifically for continuous media streams.

1171.4.1 VoIP and SIP Architecture

Voice over IP (VoIP) enables real-time voice and video communication over IP networks. The protocol stack for VoIP consists of several layers working together:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'noteTextColor': '#2C3E50', 'noteBkgColor': '#fff9e6', 'textColor': '#2C3E50', 'fontSize': '14px'}}}%%
graph TB
    subgraph VoIPStack["VoIP/SIP Protocol Stack"]
        direction TB

        subgraph Control["Signaling Layer"]
            SIP["SIP<br/>Session Initiation Protocol<br/>RFC 3261"]
        end

        subgraph Media["Media Transport Layer"]
            RTP["RTP<br/>Real-time Transport Protocol<br/>RFC 3550"]
            RTCP["RTCP<br/>RTP Control Protocol<br/>Quality Feedback"]
        end

        subgraph Transport["Transport Layer"]
            UDP["UDP<br/>User Datagram Protocol<br/>RFC 768"]
        end

        subgraph Network["Network Layer"]
            IP["IP<br/>Internet Protocol"]
        end

        SIP --> UDP
        RTP --> UDP
        RTCP --> UDP
        UDP --> IP
    end

    subgraph Security["Security Options"]
        SRTP["SRTP<br/>Encrypted Media"]
        TLS["TLS<br/>Encrypted Signaling"]
        zRTP["zRTP<br/>Key Exchange"]
    end

    Security -.->|Protects| VoIPStack

    style Control fill:#E67E22,stroke:#D35400,color:#fff
    style Media fill:#16A085,stroke:#16A085,color:#fff
    style Transport fill:#2C3E50,stroke:#2C3E50,color:#fff
    style Network fill:#7F8C8D,stroke:#7F8C8D,color:#fff
    style Security fill:#ecf0f1,stroke:#16A085,color:#2C3E50

Figure 1171.1: VoIP Protocol Stack: SIP Signaling with RTP Media Transport

VoIP Protocol Stack: SIP handles session control (call setup/teardown), while RTP carries the actual audio/video data over UDP. Security layers (SRTP, TLS, zRTP) protect both signaling and media streams.

Figure 1171.2

1171.4.2 Key Protocol Components

Protocol RFC Purpose IoT Relevance
SIP RFC 3261 Session Initiation Protocol - multimedia session control Call setup for video doorbells, intercoms
RTP RFC 3550 Real-time Transport Protocol - media stream delivery Audio/video streaming from cameras
RTCP RFC 3550 RTP Control Protocol - quality monitoring Adaptive bitrate for constrained networks
UDP RFC 768 User Datagram Protocol - connectionless transport Low-latency delivery (no TCP handshake)
SRTP RFC 3711 Secure RTP - encrypted media Privacy for baby monitors, doorbells
zRTP RFC 6189 Key exchange for SRTP End-to-end encryption setup
TLS RFC 5246 Transport Layer Security - encrypted signaling Secure SIP (SIPS) on port 5061

1171.4.3 SIP Ports and Security

NoteSIP Port Assignments
Port Protocol Security Use Case
5060 SIP over UDP/TCP Unencrypted Internal/trusted networks
5061 SIPS over TLS Encrypted Internet-facing devices

For IoT devices accessible from the internet (video doorbells, remote intercoms), always use port 5061 with TLS encryption to prevent eavesdropping and session hijacking.

1171.4.4 Real-time IoT Applications

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#16A085', 'secondaryColor': '#E67E22', 'tertiaryColor': '#ecf0f1', 'noteTextColor': '#2C3E50', 'noteBkgColor': '#fff9e6', 'textColor': '#2C3E50', 'fontSize': '14px'}}}%%
graph LR
    subgraph Devices["IoT Devices with Real-time Audio/Video"]
        Doorbell["🔔 Video Doorbell"]
        Monitor["👶 Baby Monitor"]
        Intercom["📞 Smart Intercom"]
        Assistant["🎤 Voice Assistant"]
        Camera["📹 Security Camera"]
    end

    subgraph Protocols["Protocol Selection"]
        SIPbased["SIP + RTP/SRTP<br/>Two-way communication"]
        RTSPbased["RTSP + RTP<br/>One-way streaming"]
        WebRTC["WebRTC<br/>Browser-based"]
    end

    Doorbell --> SIPbased
    Intercom --> SIPbased
    Monitor --> RTSPbased
    Camera --> RTSPbased
    Assistant --> WebRTC

    style Devices fill:#2C3E50,stroke:#2C3E50,color:#fff
    style Protocols fill:#16A085,stroke:#16A085,color:#fff

Figure 1171.3: Real-Time IoT Device Protocol Selection: SIP, RTSP, and WebRTC

Real-time Protocol Selection for IoT: Two-way communication devices (doorbells, intercoms) use SIP+RTP, while one-way streaming devices (cameras, monitors) often use RTSP+RTP. Voice assistants increasingly use WebRTC for browser compatibility.

Figure 1171.4

Common IoT Use Cases:

Device Protocol Why
Video Doorbell SIP + SRTP Two-way audio/video with visitor, needs secure encrypted stream
Baby Monitor RTP (or RTSP) One-way video stream, low latency critical for responsiveness
Smart Intercom SIP + RTP Full duplex audio between rooms, session-based communication
Voice Assistant WebRTC or proprietary Browser/app integration, cloud speech processing
Security Camera RTSP + RTP Continuous streaming, ONVIF standard interoperability

1171.4.5 Comparison: Real-time vs Messaging Protocols

When should you use VoIP/RTP instead of MQTT/CoAP?

Requirement MQTT/CoAP VoIP/RTP
Data Type Sensor readings, commands, telemetry Continuous audio/video streams
Latency Tolerance 100ms - seconds acceptable <150ms required (human perception)
Packet Loss Retransmit (reliability critical) Skip/interpolate (continuity critical)
Bandwidth Low (bytes to KB per message) High (64kbps - 2Mbps continuous)
Connection Model Message-based (discrete) Session-based (continuous stream)
Typical Payload JSON, CBOR, binary sensor data PCM audio, H.264/H.265 video

Use RTP/SRTP when: - Streaming continuous audio/video (doorbell live view) - Two-way voice communication (intercom, doorbell talk) - Latency under 150ms is critical - You need synchronized audio/video playback

Use MQTT when: - Sending audio clips/recordings (doorbell motion event) - Push notifications with audio alert - Speech-to-text results from cloud processing - Audio metadata (noise level, voice detection events)

Hybrid approach (common in smart doorbells):

Motion detected → MQTT notification to phone app
User opens app → SIP session established
Live video/audio → RTP/SRTP stream
User speaks → RTP audio to doorbell speaker
Call ends → SIP session terminated
Event recorded → Video clip stored, MQTT notification

1171.4.6 Security for Real-time IoT

Real-time audio and video streams require strong security to prevent eavesdropping:

Security Layer Protocol Protects
Signaling Encryption TLS (SIPS on port 5061) Call setup, session details
Media Encryption SRTP Audio/video content
Key Exchange zRTP, DTLS-SRTP Secure key negotiation
Authentication Digest auth, certificates Caller/device identity
WarningSecurity Best Practice for IoT Doorbells

Many consumer IoT devices have been compromised due to unencrypted streams. Always ensure:

  1. SRTP enabled - Not just RTP (encrypted vs plaintext audio/video)
  2. TLS for SIP - Port 5061, not 5060
  3. Strong authentication - Not default credentials
  4. End-to-end encryption - zRTP prevents cloud provider access to content

1171.4.7 RTP Packet Structure

Understanding RTP helps diagnose audio/video issues in IoT deployments:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT      |       Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Timestamp                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Synchronization Source (SSRC) Identifier            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Contributing Source (CSRC) Identifiers             |
|                             ....                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Payload Data                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Field Bits Purpose
V 2 Version (always 2)
P 1 Padding flag
X 1 Extension header present
CC 4 CSRC count
M 1 Marker (frame boundary)
PT 7 Payload type (codec identifier)
Sequence 16 Packet ordering/loss detection
Timestamp 32 Media timing (audio sample count)
SSRC 32 Stream identifier

RTP Header Overhead: 12 bytes minimum (vs CoAP’s 4 bytes, MQTT’s 2 bytes)

The larger header is justified for streaming media because: - Sequence numbers detect packet loss and reordering - Timestamps enable jitter buffering and synchronization - SSRC allows multiple streams in one session

1171.5 Summary Table: Quick Reference

Criterion CoAP MQTT RTP/SIP
Best Use Case Direct device queries Event distribution Audio/video streaming
Communication Request-Response Publish-Subscribe Session-based streams
Transport UDP (lightweight) TCP (reliable) UDP (low latency)
Power Ultra-low Low Medium-High
Reliability Optional Built-in (QoS) Lost packets skipped
Scalability Good Excellent Per-session
Complexity Low Medium High
Browser Support Limited Good (WebSockets) WebRTC bridge
Setup No broker needed Requires broker SIP server optional
Latency Low Medium Ultra-low (<150ms)
Data Type Sensor data Events/telemetry Continuous media

1171.6 Key Takeaways

NoteProtocol Selection Principles
  1. No single protocol is always best - Evaluate based on specific requirements
  2. CoAP excels in constrained environments - Direct, lightweight, low power
  3. MQTT excels in event-driven systems - Reliable, scalable, many-to-many
  4. VoIP/RTP for real-time media - Video doorbells, intercoms, and voice assistants require SIP+RTP
  5. Hybrid approaches are common - Use both where appropriate (MQTT for notifications, RTP for live streams)
  6. Consider the entire system - Not just protocol features, but network, power, and architecture

Protocol Deep Dives: - MQTT - Pub/sub messaging - CoAP - RESTful IoT - AMQP - Enterprise messaging - XMPP - Presence protocol

Protocol Selection: - Protocol Selection Framework - Choosing protocols - IoT Protocols Overview - Full landscape

Architecture: - IoT Reference Models - Architecture layers - Edge Fog Computing - Protocol placement

Interactive Tools: - Simulations Hub - Protocol comparison tool

Learning Hubs: - Quiz Navigator - Protocol quizzes

The following figures from the CP IoT System Design Guide provide alternative visual representations of IoT application protocol concepts covered in this chapter.

Application Protocols Overview:

IoT application protocols diagram showing the relative positioning of MQTT, CoAP, HTTP, and AMQP in the network stack, with their transport layer dependencies (TCP vs UDP) and typical use cases in IoT systems

IoT Application Protocols showing MQTT, CoAP, HTTP, and AMQP positioning

CoAP vs MQTT Comparison:

Comparison table showing CoAP versus MQTT across dimensions including transport protocol (UDP vs TCP), messaging pattern (request-response vs publish-subscribe), effectiveness in LLNs, security mechanisms (DTLS vs SSL/TLS), and relative strengths and weaknesses for IoT applications

Detailed comparison between CoAP and MQTT protocols

Source: CP IoT System Design Guide, Chapter 4 - Application Protocols

1171.8 Summary

This chapter covered application layer protocols for IoT - the languages devices speak to exchange data:

  • MQTT: Publish-subscribe pattern over TCP, lightweight headers, QoS levels (0/1/2), ideal for event-driven telemetry and cloud connectivity
  • CoAP: RESTful (GET/PUT/POST/DELETE) over UDP, binary headers, supports observe/multicast, perfect for constrained devices with request-response patterns
  • HTTP/REST: Universal compatibility but high overhead, best for gateways and web integration
  • AMQP: Enterprise-grade message queuing with guaranteed delivery, transactions, and complex routing
  • VoIP/SIP/RTP: Real-time audio and video streaming over UDP with SIP session control (RFC 3261), RTP media transport (RFC 3550), and SRTP encryption for video doorbells, baby monitors, and voice assistants
  • Hybrid Architectures: Combine protocols (CoAP at edge, MQTT to cloud, RTP for live streams) to leverage each protocol’s strengths

Understanding protocol trade-offs enables optimal IoT system design matching communication patterns to device capabilities.

1171.8.1 IoT Application Protocol Selection (Variant View)

This decision framework guides protocol selection based on application requirements, device constraints, and communication patterns:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#E67E22', 'secondaryColor': '#16A085', 'tertiaryColor': '#7F8C8D'}}}%%
flowchart TD
    START(["Application Protocol<br/>Selection"])
    Q1{"Communication<br/>pattern?"}
    Q2{"Device<br/>constraints?"}
    Q3{"Reliability<br/>requirement?"}
    Q4{"Many subscribers?"}
    Q5{"Real-time<br/>media?"}

    MQTT["MQTT<br/>Publish-Subscribe"]
    COAP["CoAP<br/>Request-Response"]
    HTTP["HTTP/REST<br/>Web Integration"]
    AMQP["AMQP<br/>Enterprise Messaging"]
    RTP["RTP/SIP<br/>Real-Time Media"]

    MQTT_DETAILS["MQTT:<br/>• Publish-subscribe pattern<br/>• TCP transport<br/>• QoS 0/1/2 levels<br/>• 2-byte header minimum<br/>• Broker-based"]

    COAP_DETAILS["CoAP:<br/>• Request-response (REST)<br/>• UDP transport<br/>• 4-byte header<br/>• Observe pattern<br/>• Proxy-friendly"]

    HTTP_DETAILS["HTTP/REST:<br/>• Request-response<br/>• TCP transport<br/>• Large headers<br/>• Universal compatibility<br/>• Cacheable"]

    AMQP_DETAILS["AMQP:<br/>• Publish-subscribe + queues<br/>• TCP transport<br/>• Guaranteed delivery<br/>• Transactions support<br/>• Enterprise routing"]

    START --> Q1
    Q1 -->|"Events/Telemetry"| Q4
    Q1 -->|"Resource Access"| Q2
    Q1 -->|"Audio/Video"| Q5

    Q4 -->|"Yes (fan-out)"| MQTT
    Q4 -->|"No (point-to-point)"| Q3

    Q2 -->|"Constrained MCU"| COAP
    Q2 -->|"Gateway/PC"| HTTP

    Q3 -->|"Critical (guaranteed)"| AMQP
    Q3 -->|"Best effort OK"| MQTT

    Q5 -->|"Yes"| RTP

    MQTT --> MQTT_DETAILS
    COAP --> COAP_DETAILS
    HTTP --> HTTP_DETAILS
    AMQP --> AMQP_DETAILS

    style START fill:#7F8C8D,color:#fff
    style Q1 fill:#2C3E50,color:#fff
    style Q2 fill:#2C3E50,color:#fff
    style Q3 fill:#2C3E50,color:#fff
    style Q4 fill:#2C3E50,color:#fff
    style Q5 fill:#2C3E50,color:#fff
    style MQTT fill:#16A085,color:#fff
    style COAP fill:#E67E22,color:#fff
    style HTTP fill:#3498db,color:#fff
    style AMQP fill:#9b59b6,color:#fff
    style RTP fill:#c0392b,color:#fff
    style MQTT_DETAILS fill:#d4efdf,color:#2C3E50
    style COAP_DETAILS fill:#fdebd0,color:#2C3E50
    style HTTP_DETAILS fill:#d6eaf8,color:#2C3E50
    style AMQP_DETAILS fill:#ebdef0,color:#2C3E50

Figure 1171.5: IoT application protocol selection decision tree. Events/telemetry with many subscribers leads to MQTT (publish-subscribe, broker-based). Point-to-point with critical reliability leads to AMQP (guaranteed delivery, transactions). Resource access on constrained MCU leads to CoAP (RESTful over UDP, 4-byte header). Gateway/PC leads to HTTP (universal compatibility). Audio/video leads to RTP/SIP (real-time media streaming). {fig-alt=“Protocol selection flowchart. Communication pattern: Events/Telemetry checks ‘many subscribers?’ - yes leads to MQTT (publish-subscribe, TCP, QoS levels, 2-byte header, broker-based); no checks reliability - critical leads to AMQP (guaranteed delivery, transactions, enterprise routing), best-effort leads to MQTT. Resource Access checks device constraints - constrained MCU leads to CoAP (RESTful over UDP, 4-byte header, observe pattern); gateway/PC leads to HTTP (large headers, universal compatibility, cacheable). Audio/Video leads to RTP/SIP for real-time media.”}

1171.8.2 Protocol Overhead Comparison (Variant View)

This visualization compares message overhead and efficiency across IoT application protocols:

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#E67E22', 'secondaryColor': '#16A085', 'tertiaryColor': '#7F8C8D'}}}%%
graph TB
    subgraph Header["Protocol Message Overhead (20-byte payload)"]
        direction LR
        H1["Lower Overhead"]
        H2["→"]
        H3["Higher Overhead"]
    end

    subgraph Minimal["Minimal Overhead"]
        COAP_OH["CoAP:<br/>4-byte header<br/>24 bytes total<br/>83% efficiency"]
        MQTT_OH["MQTT (QoS 0):<br/>2-byte header + topic<br/>~30 bytes total<br/>67% efficiency"]
    end

    subgraph Moderate["Moderate Overhead"]
        MQTT_Q1["MQTT (QoS 1):<br/>~35 bytes total<br/>57% efficiency<br/>+ ACK message"]
        AMQP_OH["AMQP:<br/>8-byte header + framing<br/>~50 bytes total<br/>40% efficiency"]
    end

    subgraph Heavy["Heavy Overhead"]
        HTTP_OH["HTTP:<br/>~200+ bytes headers<br/>~220 bytes total<br/>9% efficiency"]
        XMPP_OH["XMPP:<br/>~280+ bytes XML<br/>~300 bytes total<br/>7% efficiency"]
    end

    subgraph Impact["Battery/Bandwidth Impact"]
        I1["Low power sensors:<br/>Use CoAP or MQTT QoS 0<br/>Minimize TX time"]
        I2["Reliable delivery:<br/>MQTT QoS 1/2 or AMQP<br/>Accept overhead"]
        I3["Web integration:<br/>HTTP at gateway<br/>Not on sensors"]
    end

    Minimal --> Impact
    Moderate --> Impact
    Heavy --> Impact

    style Header fill:#f9f9f9,stroke:#2C3E50
    style Minimal fill:#16A085,color:#fff
    style Moderate fill:#E67E22,color:#fff
    style Heavy fill:#c0392b,color:#fff
    style Impact fill:#7F8C8D,color:#fff
    style COAP_OH fill:#d4efdf,color:#2C3E50
    style MQTT_OH fill:#d4efdf,color:#2C3E50
    style MQTT_Q1 fill:#fdebd0,color:#2C3E50
    style AMQP_OH fill:#fdebd0,color:#2C3E50
    style HTTP_OH fill:#fadbd8,color:#2C3E50
    style XMPP_OH fill:#fadbd8,color:#2C3E50
    style I1 fill:#e8e8e8,color:#2C3E50
    style I2 fill:#e8e8e8,color:#2C3E50
    style I3 fill:#e8e8e8,color:#2C3E50

Figure 1171.6: Protocol overhead comparison for 20-byte payload. Minimal overhead (teal): CoAP at 4-byte header (24 bytes total, 83% efficiency), MQTT QoS 0 at ~30 bytes (67% efficiency). Moderate overhead (orange): MQTT QoS 1 at ~35 bytes (57% efficiency) with ACK, AMQP at ~50 bytes (40% efficiency). Heavy overhead (red): HTTP at ~220 bytes (9% efficiency), XMPP at ~300 bytes (7% efficiency). Impact guidance: low-power sensors should use CoAP/MQTT QoS 0, reliable delivery needs MQTT QoS 1/2 or AMQP, HTTP should be used at gateways not on sensors. {fig-alt=“Protocol overhead comparison chart for 20-byte payload. Minimal Overhead (green): CoAP 4-byte header, 24 bytes total, 83% efficiency; MQTT QoS 0 2-byte header plus topic, ~30 bytes total, 67% efficiency. Moderate Overhead (orange): MQTT QoS 1 ~35 bytes total, 57% efficiency plus ACK message; AMQP 8-byte header plus framing, ~50 bytes total, 40% efficiency. Heavy Overhead (red): HTTP ~200+ bytes headers, ~220 bytes total, 9% efficiency; XMPP ~280+ bytes XML, ~300 bytes total, 7% efficiency. Impact recommendations: low power sensors use CoAP or MQTT QoS 0 to minimize TX time; reliable delivery uses MQTT QoS 1/2 or AMQP accepting overhead; web integration uses HTTP at gateway level, not on sensors.”}


1171.9 Summary

This chapter explored real-time protocols for IoT audio and video applications:

Key topics: - RTP (Real-time Transport Protocol): UDP-based protocol for audio/video streaming with timing and sequencing - SIP (Session Initiation Protocol): Call setup, management, and teardown for VoIP systems - Port assignments: Standard ports for SIP (5060), RTP (dynamic), and related protocols - RTP vs MQTT comparison: When to use streaming protocols vs telemetry protocols - Security best practices: SRTP, TLS, authentication, and network isolation for IoT doorbells - Protocol selection principles: Quick reference tables and decision frameworks

1171.10 What’s Next?

Complete the application protocols series:

  • Worked Examples: Agricultural sensor network protocol selection case study

For related topics: - MQTT Fundamentals - CoAP Fundamentals and Architecture - Privacy and Security