43  Real-time Protocols

In 60 Seconds

Real-time IoT applications like video doorbells and intercoms require specialized protocols (RTP/SIP/WebRTC) that prioritize latency over reliability, running over UDP and skipping lost packets rather than retransmitting them. The key threshold is 150ms end-to-end delay for acceptable human perception – use RTP for continuous audio/video streams, but stick with MQTT/CoAP for discrete sensor telemetry.

43.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Implement VoIP/SIP/RTP Architecture: Configure protocols for audio/video IoT applications and deploy a working signaling stack
  • Compare RTP vs MQTT: Select the appropriate protocol for real-time media vs telemetry data based on latency and reliability requirements
  • Design Secure Doorbell Systems: Apply security best practices including SRTP, TLS, and strong authentication for IoT audio/video devices
  • Calculate RTP Bandwidth: Determine bitrate requirements for audio codecs using packet size and interval formulas
  • Analyze the Latency Budget: Evaluate end-to-end delay across capture, encode, network, decode, and render stages to assess protocol suitability
  • Distinguish Protocol Roles: Justify why SIP handles session control separately from RTP media delivery in a VoIP stack

Real-time protocols ensure data arrives within strict time limits, which is essential for applications like factory automation and medical devices. When a robotic arm needs a command in under one millisecond, regular internet protocols are too slow. Real-time protocols guarantee timely delivery, even under heavy network load.

“My temperature reading can arrive a second late and nobody cares,” said Sammy the Sensor. “But what about a robot arm on a factory floor?”

Max the Microcontroller looked serious. “That robot arm needs its command in under ONE MILLISECOND. If the ‘stop’ command arrives 10 milliseconds late, the arm crashes into something. That’s why factories use real-time protocols with guaranteed timing. Regular MQTT or HTTP can’t promise when your message will arrive.”

“Think of it like traffic lights versus ambulance sirens,” explained Lila the LED. “Normal traffic (regular protocols) follows rules but can get stuck in jams. An ambulance (real-time protocol) gets guaranteed priority – other traffic has to move aside. Real-time protocols reserve bandwidth and prioritize time-critical messages.”

Bella the Battery added: “Real-time isn’t just about speed – it’s about predictability. A message that usually arrives in 1 ms but sometimes takes 500 ms is NOT real-time. Real-time means ALWAYS within the deadline. That’s why these protocols are used in medical devices, self-driving cars, and industrial robots where ‘usually fast’ isn’t good enough!”

43.2 Prerequisites

Before diving into this chapter, you should be familiar with:

Key Concepts

  • DDS (Data Distribution Service): A publish-subscribe middleware standard by OMG with deterministic QoS policies; used in robotics, autonomous vehicles, and industrial real-time IoT.
  • WebSocket: A full-duplex TCP-based protocol providing persistent connections with low overhead after initial HTTP upgrade; used for real-time browser-to-IoT gateway communication.
  • MQTT 5.0: The latest MQTT version adding message expiry, response topic, user properties, and shared subscriptions — improving real-time IoT application support over MQTT 3.1.1.
  • Jitter: Variation in message delivery latency; critical for real-time IoT applications — high jitter can violate timing requirements even when average latency is acceptable.
  • End-to-End Latency: Total delay from sensor measurement to actuator response including sensing, protocol processing, network transmission, broker/server processing, and actuation.

43.3 How This Chapter Fits

Chapter Series Navigation:

  1. Introduction and Why Lightweight Protocols Matter
  2. Protocol Overview and Comparison
  3. REST API Design for IoT
  4. Real-time Protocols (this chapter)
  5. Worked Examples

This chapter extends the protocol discussion to real-time audio/video applications like smart doorbells, intercoms, and video surveillance systems.


43.4 Real-time Protocols for IoT

While MQTT and CoAP handle most IoT data exchange scenarios, some applications require real-time audio and video streaming with strict latency requirements. Video doorbells, baby monitors, voice assistants, and intercom systems all need protocols designed specifically for continuous media streams.

43.4.1 VoIP and SIP Architecture

Voice over IP (VoIP) enables real-time voice and video communication over IP networks. The protocol stack for VoIP consists of several layers working together:

Layered VoIP protocol stack showing SIP (Session Initiation Protocol) at the application layer for call setup and teardown, RTP (Real-time Transport Protocol) for audio and video media delivery over UDP, SRTP for encrypted media, TLS for encrypted SIP signaling on port 5061, and zRTP for key exchange between endpoints
Figure 43.1: VoIP Protocol Stack: SIP Signaling with RTP Media Transport

VoIP Protocol Stack: SIP handles session control (call setup/teardown), while RTP carries the actual audio/video data over UDP. Security layers (SRTP, TLS, zRTP) protect both signaling and media streams.

Figure 43.2

43.4.2 Key Protocol Components

Protocol RFC Purpose IoT Relevance
SIP RFC 3261 Session Initiation Protocol - multimedia session control Call setup for video doorbells, intercoms
RTP RFC 3550 Real-time Transport Protocol - media stream delivery Audio/video streaming from cameras
RTCP RFC 3550 RTP Control Protocol - quality monitoring Adaptive bitrate for constrained networks
UDP RFC 768 User Datagram Protocol - connectionless transport Low-latency delivery (no TCP handshake)
SRTP RFC 3711 Secure RTP - encrypted media Privacy for baby monitors, doorbells
zRTP RFC 6189 Key exchange for SRTP End-to-end encryption setup
TLS RFC 8446 Transport Layer Security - encrypted signaling Secure SIP (SIPS) on port 5061

43.4.3 SIP Ports and Security

SIP Port Assignments
Port Protocol Security Use Case
5060 SIP over UDP/TCP Unencrypted Internal/trusted networks
5061 SIPS over TLS Encrypted Internet-facing devices

For IoT devices accessible from the internet (video doorbells, remote intercoms), always use port 5061 with TLS encryption to prevent eavesdropping and session hijacking.

43.4.4 Real-time IoT Applications

Decision diagram for real-time IoT protocol selection showing two-way communication devices (doorbells, intercoms) using SIP plus RTP for bidirectional audio and video, one-way streaming devices (IP cameras, baby monitors) using RTSP plus RTP, and voice assistants using WebRTC for browser compatibility and cloud integration
Figure 43.3: Real-Time IoT Device Protocol Selection: SIP, RTSP, and WebRTC

Real-time Protocol Selection for IoT: Two-way communication devices (doorbells, intercoms) use SIP+RTP, while one-way streaming devices (cameras, monitors) often use RTSP+RTP. Voice assistants increasingly use WebRTC for browser compatibility.

Figure 43.4

Common IoT Use Cases:

Device Protocol Why
Video Doorbell SIP + SRTP Two-way audio/video with visitor, needs secure encrypted stream
Baby Monitor RTP (or RTSP) One-way video stream, low latency critical for responsiveness
Smart Intercom SIP + RTP Full duplex audio between rooms, session-based communication
Voice Assistant WebRTC or proprietary Browser/app integration, cloud speech processing
Security Camera RTSP + RTP Continuous streaming, ONVIF standard interoperability

43.4.5 Comparison: Real-time vs Messaging Protocols

When should you use VoIP/RTP instead of MQTT/CoAP?

Requirement MQTT/CoAP VoIP/RTP
Data Type Sensor readings, commands, telemetry Continuous audio/video streams
Latency Tolerance 100ms - seconds acceptable <150ms required (human perception)
Packet Loss Retransmit (reliability critical) Skip/interpolate (continuity critical)
Bandwidth Low (bytes to KB per message) High (64 kbps - 2 Mbps continuous)
Connection Model Message-based (discrete) Session-based (continuous stream)
Typical Payload JSON, CBOR, binary sensor data PCM audio, H.264/H.265 video

Let’s calculate the latency budget for a video doorbell and understand why RTP is required over MQTT.

Human perception threshold: Interactive speech requires \(L_{\text{max}} < 150\) ms end-to-end latency (ITU-T G.114 recommendation).

Latency budget breakdown for doorbell live view: \[L_{\text{total}} = L_{\text{capture}} + L_{\text{encode}} + L_{\text{network}} + L_{\text{decode}} + L_{\text{render}}\] \[= 30 \text{ ms} + 20 \text{ ms} + 50 \text{ ms} + 20 \text{ ms} + 16 \text{ ms} = 136 \text{ ms}\]

Network budget: Only \(50\) ms available for packet transmission. For MQTT over TCP, the three-way handshake adds \(1.5 \times \text{RTT}\). With \(\text{RTT} = 40\) ms, TCP setup overhead \(= 60\) ms, exceeding the 50ms network budget.

RTP over UDP: No handshake, no retransmit delays. One-way delay \(= \text{RTT}/2 = 20\) ms, well within budget.

RTP bandwidth: G.711 audio codec at 64 kbps uses 20 ms packets (160 bytes of PCM) with a 12-byte RTP header: \[B_{\text{RTP}} = \frac{(12 + 160) \times 8}{0.020 \text{ s}} = 68{,}800 \text{ bps} \approx 69 \text{ kbps}\]

RTP’s low-latency, connectionless design is essential for real-time media.

Use RTP/SRTP when:

  • Streaming continuous audio/video (doorbell live view)
  • Two-way voice communication (intercom, doorbell talk)
  • Latency under 150ms is critical
  • You need synchronized audio/video playback

Use MQTT when:

  • Sending audio clips/recordings (doorbell motion event)
  • Push notifications with audio alert
  • Speech-to-text results from cloud processing
  • Audio metadata (noise level, voice detection events)

Hybrid approach (common in smart doorbells):

Motion detected → MQTT notification to phone app
User opens app → SIP session established
Live video/audio → RTP/SRTP stream
User speaks → RTP audio to doorbell speaker
Call ends → SIP session terminated
Event recorded → Video clip stored, MQTT notification

43.4.6 Security for Real-time IoT

Real-time audio and video streams require strong security to prevent eavesdropping:

Security Layer Protocol Protects
Signaling Encryption TLS (SIPS on port 5061) Call setup, session details
Media Encryption SRTP Audio/video content
Key Exchange zRTP, DTLS-SRTP Secure key negotiation
Authentication Digest auth, certificates Caller/device identity
Security Best Practice for IoT Doorbells

Many consumer IoT devices have been compromised due to unencrypted streams. Always ensure:

  1. SRTP enabled - Not just RTP (encrypted vs plaintext audio/video)
  2. TLS for SIP - Port 5061, not 5060
  3. Strong authentication - Not default credentials
  4. End-to-end encryption - zRTP prevents cloud provider access to content

43.4.7 RTP Packet Structure

Understanding RTP helps diagnose audio/video issues in IoT deployments:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|V=2|P|X|  CC   |M|     PT      |       Sequence Number         |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                           Timestamp                           |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|           Synchronization Source (SSRC) Identifier            |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|            Contributing Source (CSRC) Identifiers             |
|                             ....                              |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                         Payload Data                          |
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Field Bits Purpose
V 2 Version (always 2)
P 1 Padding flag
X 1 Extension header present
CC 4 CSRC count
M 1 Marker (frame boundary)
PT 7 Payload type (codec identifier)
Sequence 16 Packet ordering/loss detection
Timestamp 32 Media timing (audio sample count)
SSRC 32 Stream identifier

RTP Header Overhead: 12 bytes minimum (vs CoAP’s 4 bytes, MQTT’s 2 bytes)

The larger header is justified for streaming media because: - Sequence numbers detect packet loss and reordering - Timestamps enable jitter buffering and synchronization - SSRC allows multiple streams in one session

43.4.8 Interactive: RTP Bandwidth Calculator

43.5 Worked Example: Designing a Video Doorbell Protocol Stack

Scenario: A smart home company is designing a video doorbell that must stream 720p video and two-way audio to the homeowner’s phone app. The doorbell connects via Wi-Fi (802.11n, 20 MHz channel) with an average round-trip latency of 80ms to the cloud relay server. The app must display live video within 300ms of motion detection. Calculate the bandwidth requirements and evaluate whether the link can support the stream.

Step 1: Calculate Minimum Bandwidth Requirements

Stream Codec Bitrate Direction
Video (720p, 15 fps) H.264 Baseline 1.2 Mbps Doorbell to phone
Audio (voice) G.711 u-law 64 kbps Doorbell to phone
Audio (talk-back) G.711 u-law 64 kbps Phone to doorbell
RTP headers 12 bytes/packet ~48 kbps (video) + ~13 kbps (audio) Both directions
SRTP overhead 10 bytes/packet ~40 kbps (video) + ~11 kbps (audio) Both directions
Total upstream ~1.37 Mbps Doorbell to cloud
Total downstream ~0.13 Mbps Cloud to doorbell

Step 2: Evaluate Wi-Fi Link Capacity

802.11n (20 MHz, 1 spatial stream):
  PHY rate: 72 Mbps (MCS7)
  Typical throughput: ~35 Mbps (50% MAC efficiency)
  Doorbell upload: 1.37 Mbps = 3.9% of available capacity

Verdict: Comfortable margin (96% headroom)

Step 3: Verify End-to-End Latency Budget

The 300ms target must accommodate the full pipeline:

Stage Latency Running Total
Camera capture (1 frame at 15 fps) 67 ms 67 ms
H.264 encoding (hardware encoder) 20 ms 87 ms
RTP packetization 2 ms 89 ms
Wi-Fi transmission + contention 15 ms 104 ms
Doorbell to cloud relay (half RTT) 40 ms 144 ms
Cloud relay to phone (half RTT) 50 ms 194 ms
Jitter buffer (2 frames) 67 ms 261 ms
H.264 decode + render 15 ms 276 ms

Result: 276ms is within the 300ms target with 24ms margin. However, this assumes ideal network conditions. Under Wi-Fi contention (multiple devices), the Wi-Fi stage could increase to 40-80ms, pushing total latency to 300-340ms.

Step 4: Protocol Selection Decision

Component Protocol Choice Rationale
Call signaling SIP over TLS (port 5061) Standard call setup, encrypted
Video stream RTP over UDP Low latency, skip lost frames
Audio stream RTP over UDP (separate SSRC) Synchronized with video
Encryption SRTP (AES-128-CM) Encrypts media without TCP overhead
Motion alerts MQTT QoS 1 (port 8883) Reliable push notification
Clip storage HTTPS POST Reliability critical for recordings

Key Insight: The doorbell uses three protocols simultaneously: MQTT for event notifications (< 1 kbps, TCP), SRTP for live streaming (1.37 Mbps, UDP), and HTTPS for clip uploads (burst, TCP). Each matches its data type: discrete events need reliability (TCP), continuous media needs low latency (UDP), and file uploads need guaranteed delivery (TCP). Trying to force all traffic through a single protocol would compromise either latency or reliability.

Common Pitfalls

MQTT provides no timing guarantees — broker processing delay and TCP retransmission can add hundreds of milliseconds. For soft real-time (human-interface) it works; for hard real-time (motor control) it does not.

Protocol latency measured in a lab with direct connections differs significantly from real deployments with broker load, multiple concurrent clients, and network congestion. Always measure end-to-end latency under realistic load.

Real-time high-frequency data (100 Hz sensor sampling) generates large message queues if consumers are slower than producers. Implement queue depth monitoring and drop policies for real-time data that becomes stale.

43.6 What’s Next?

Chapter Focus Why Read It
Worked Examples Agricultural sensor network protocol selection case study Apply protocol selection principles to a complete real-world IoT design scenario
MQTT Fundamentals Pub/sub messaging architecture, QoS levels, broker configuration Deepen your understanding of the telemetry protocol used alongside RTP in hybrid systems
CoAP Fundamentals and Architecture RESTful IoT protocol over UDP, observe pattern, constrained devices Compare CoAP’s request-response model against RTP’s continuous streaming model
Protocol Overview and Comparison Side-by-side comparison of all major IoT application protocols Consolidate your protocol selection skills with a comprehensive comparison framework
Transport Protocols UDP, TCP, QUIC trade-offs for IoT Understand why RTP’s choice of UDP is fundamental to its real-time performance
Privacy and Security IoT threat landscape, attack vectors, and defenses Extend the SRTP/TLS security concepts from this chapter to the full IoT security architecture