Real-time IoT applications like video doorbells and intercoms require specialized protocols (RTP/SIP/WebRTC) that prioritize latency over reliability, running over UDP and skipping lost packets rather than retransmitting them. The key threshold is 150ms end-to-end delay for acceptable human perception – use RTP for continuous audio/video streams, but stick with MQTT/CoAP for discrete sensor telemetry.
43.1 Learning Objectives
By the end of this chapter, you will be able to:
Implement VoIP/SIP/RTP Architecture: Configure protocols for audio/video IoT applications and deploy a working signaling stack
Compare RTP vs MQTT: Select the appropriate protocol for real-time media vs telemetry data based on latency and reliability requirements
Design Secure Doorbell Systems: Apply security best practices including SRTP, TLS, and strong authentication for IoT audio/video devices
Calculate RTP Bandwidth: Determine bitrate requirements for audio codecs using packet size and interval formulas
Analyze the Latency Budget: Evaluate end-to-end delay across capture, encode, network, decode, and render stages to assess protocol suitability
Distinguish Protocol Roles: Justify why SIP handles session control separately from RTP media delivery in a VoIP stack
For Beginners: Real-Time Protocols
Real-time protocols ensure data arrives within strict time limits, which is essential for applications like factory automation and medical devices. When a robotic arm needs a command in under one millisecond, regular internet protocols are too slow. Real-time protocols guarantee timely delivery, even under heavy network load.
Sensor Squad: When Timing is Everything
“My temperature reading can arrive a second late and nobody cares,” said Sammy the Sensor. “But what about a robot arm on a factory floor?”
Max the Microcontroller looked serious. “That robot arm needs its command in under ONE MILLISECOND. If the ‘stop’ command arrives 10 milliseconds late, the arm crashes into something. That’s why factories use real-time protocols with guaranteed timing. Regular MQTT or HTTP can’t promise when your message will arrive.”
“Think of it like traffic lights versus ambulance sirens,” explained Lila the LED. “Normal traffic (regular protocols) follows rules but can get stuck in jams. An ambulance (real-time protocol) gets guaranteed priority – other traffic has to move aside. Real-time protocols reserve bandwidth and prioritize time-critical messages.”
Bella the Battery added: “Real-time isn’t just about speed – it’s about predictability. A message that usually arrives in 1 ms but sometimes takes 500 ms is NOT real-time. Real-time means ALWAYS within the deadline. That’s why these protocols are used in medical devices, self-driving cars, and industrial robots where ‘usually fast’ isn’t good enough!”
43.2 Prerequisites
Before diving into this chapter, you should be familiar with:
DDS (Data Distribution Service): A publish-subscribe middleware standard by OMG with deterministic QoS policies; used in robotics, autonomous vehicles, and industrial real-time IoT.
WebSocket: A full-duplex TCP-based protocol providing persistent connections with low overhead after initial HTTP upgrade; used for real-time browser-to-IoT gateway communication.
MQTT 5.0: The latest MQTT version adding message expiry, response topic, user properties, and shared subscriptions — improving real-time IoT application support over MQTT 3.1.1.
Jitter: Variation in message delivery latency; critical for real-time IoT applications — high jitter can violate timing requirements even when average latency is acceptable.
End-to-End Latency: Total delay from sensor measurement to actuator response including sensing, protocol processing, network transmission, broker/server processing, and actuation.
This chapter extends the protocol discussion to real-time audio/video applications like smart doorbells, intercoms, and video surveillance systems.
43.4 Real-time Protocols for IoT
While MQTT and CoAP handle most IoT data exchange scenarios, some applications require real-time audio and video streaming with strict latency requirements. Video doorbells, baby monitors, voice assistants, and intercom systems all need protocols designed specifically for continuous media streams.
43.4.1 VoIP and SIP Architecture
Voice over IP (VoIP) enables real-time voice and video communication over IP networks. The protocol stack for VoIP consists of several layers working together:
Figure 43.1: VoIP Protocol Stack: SIP Signaling with RTP Media Transport
VoIP Protocol Stack: SIP handles session control (call setup/teardown), while RTP carries the actual audio/video data over UDP. Security layers (SRTP, TLS, zRTP) protect both signaling and media streams.
Figure 43.2
43.4.2 Key Protocol Components
Protocol
RFC
Purpose
IoT Relevance
SIP
RFC 3261
Session Initiation Protocol - multimedia session control
Call setup for video doorbells, intercoms
RTP
RFC 3550
Real-time Transport Protocol - media stream delivery
Audio/video streaming from cameras
RTCP
RFC 3550
RTP Control Protocol - quality monitoring
Adaptive bitrate for constrained networks
UDP
RFC 768
User Datagram Protocol - connectionless transport
Low-latency delivery (no TCP handshake)
SRTP
RFC 3711
Secure RTP - encrypted media
Privacy for baby monitors, doorbells
zRTP
RFC 6189
Key exchange for SRTP
End-to-end encryption setup
TLS
RFC 8446
Transport Layer Security - encrypted signaling
Secure SIP (SIPS) on port 5061
43.4.3 SIP Ports and Security
SIP Port Assignments
Port
Protocol
Security
Use Case
5060
SIP over UDP/TCP
Unencrypted
Internal/trusted networks
5061
SIPS over TLS
Encrypted
Internet-facing devices
For IoT devices accessible from the internet (video doorbells, remote intercoms), always use port 5061 with TLS encryption to prevent eavesdropping and session hijacking.
Real-time Protocol Selection for IoT: Two-way communication devices (doorbells, intercoms) use SIP+RTP, while one-way streaming devices (cameras, monitors) often use RTSP+RTP. Voice assistants increasingly use WebRTC for browser compatibility.
Figure 43.4
Common IoT Use Cases:
Device
Protocol
Why
Video Doorbell
SIP + SRTP
Two-way audio/video with visitor, needs secure encrypted stream
Baby Monitor
RTP (or RTSP)
One-way video stream, low latency critical for responsiveness
Smart Intercom
SIP + RTP
Full duplex audio between rooms, session-based communication
Voice Assistant
WebRTC or proprietary
Browser/app integration, cloud speech processing
Security Camera
RTSP + RTP
Continuous streaming, ONVIF standard interoperability
43.4.5 Comparison: Real-time vs Messaging Protocols
When should you use VoIP/RTP instead of MQTT/CoAP?
Requirement
MQTT/CoAP
VoIP/RTP
Data Type
Sensor readings, commands, telemetry
Continuous audio/video streams
Latency Tolerance
100ms - seconds acceptable
<150ms required (human perception)
Packet Loss
Retransmit (reliability critical)
Skip/interpolate (continuity critical)
Bandwidth
Low (bytes to KB per message)
High (64 kbps - 2 Mbps continuous)
Connection Model
Message-based (discrete)
Session-based (continuous stream)
Typical Payload
JSON, CBOR, binary sensor data
PCM audio, H.264/H.265 video
Putting Numbers to It
Let’s calculate the latency budget for a video doorbell and understand why RTP is required over MQTT.
Human perception threshold: Interactive speech requires \(L_{\text{max}} < 150\) ms end-to-end latency (ITU-T G.114 recommendation).
Network budget: Only \(50\) ms available for packet transmission. For MQTT over TCP, the three-way handshake adds \(1.5 \times \text{RTT}\). With \(\text{RTT} = 40\) ms, TCP setup overhead \(= 60\) ms, exceeding the 50ms network budget.
RTP over UDP: No handshake, no retransmit delays. One-way delay \(= \text{RTT}/2 = 20\) ms, well within budget.
RTP bandwidth: G.711 audio codec at 64 kbps uses 20 ms packets (160 bytes of PCM) with a 12-byte RTP header: \[B_{\text{RTP}} = \frac{(12 + 160) \times 8}{0.020 \text{ s}} = 68{,}800 \text{ bps} \approx 69 \text{ kbps}\]
RTP’s low-latency, connectionless design is essential for real-time media.
When to Use RTP vs MQTT for IoT Audio
Use RTP/SRTP when:
Streaming continuous audio/video (doorbell live view)
Two-way voice communication (intercom, doorbell talk)
The larger header is justified for streaming media because: - Sequence numbers detect packet loss and reordering - Timestamps enable jitter buffering and synchronization - SSRC allows multiple streams in one session
43.5 Worked Example: Designing a Video Doorbell Protocol Stack
Scenario: A smart home company is designing a video doorbell that must stream 720p video and two-way audio to the homeowner’s phone app. The doorbell connects via Wi-Fi (802.11n, 20 MHz channel) with an average round-trip latency of 80ms to the cloud relay server. The app must display live video within 300ms of motion detection. Calculate the bandwidth requirements and evaluate whether the link can support the stream.
The 300ms target must accommodate the full pipeline:
Stage
Latency
Running Total
Camera capture (1 frame at 15 fps)
67 ms
67 ms
H.264 encoding (hardware encoder)
20 ms
87 ms
RTP packetization
2 ms
89 ms
Wi-Fi transmission + contention
15 ms
104 ms
Doorbell to cloud relay (half RTT)
40 ms
144 ms
Cloud relay to phone (half RTT)
50 ms
194 ms
Jitter buffer (2 frames)
67 ms
261 ms
H.264 decode + render
15 ms
276 ms
Result: 276ms is within the 300ms target with 24ms margin. However, this assumes ideal network conditions. Under Wi-Fi contention (multiple devices), the Wi-Fi stage could increase to 40-80ms, pushing total latency to 300-340ms.
Step 4: Protocol Selection Decision
Component
Protocol Choice
Rationale
Call signaling
SIP over TLS (port 5061)
Standard call setup, encrypted
Video stream
RTP over UDP
Low latency, skip lost frames
Audio stream
RTP over UDP (separate SSRC)
Synchronized with video
Encryption
SRTP (AES-128-CM)
Encrypts media without TCP overhead
Motion alerts
MQTT QoS 1 (port 8883)
Reliable push notification
Clip storage
HTTPS POST
Reliability critical for recordings
Key Insight: The doorbell uses three protocols simultaneously: MQTT for event notifications (< 1 kbps, TCP), SRTP for live streaming (1.37 Mbps, UDP), and HTTPS for clip uploads (burst, TCP). Each matches its data type: discrete events need reliability (TCP), continuous media needs low latency (UDP), and file uploads need guaranteed delivery (TCP). Trying to force all traffic through a single protocol would compromise either latency or reliability.
Common Pitfalls
1. Assuming MQTT Is Suitable for All Real-Time Scenarios
MQTT provides no timing guarantees — broker processing delay and TCP retransmission can add hundreds of milliseconds. For soft real-time (human-interface) it works; for hard real-time (motor control) it does not.
2. Not Measuring End-to-End Latency in Real Conditions
Protocol latency measured in a lab with direct connections differs significantly from real deployments with broker load, multiple concurrent clients, and network congestion. Always measure end-to-end latency under realistic load.
3. Ignoring Protocol Buffer Bloat for High-Rate Data
Real-time high-frequency data (100 Hz sensor sampling) generates large message queues if consumers are slower than producers. Implement queue depth monitoring and drop policies for real-time data that becomes stale.