Scenario: Manufacturing plant needs to monitor bearing vibration on 200 production machines to predict failures. System must detect anomalies within 100ms to prevent damage.
Requirements Analysis:
Sampling rate: 1000 Hz (1 sample per millisecond)
Sample size: 12 bytes (timestamp: 4B, X-axis: 2B, Y-axis: 2B, Z-axis: 2B, machine ID: 2B)
Data rate per machine: 1000 samples/sec × 12 bytes = 12 KB/sec
Total system: 200 machines × 12 KB/sec = 2.4 MB/sec
Real-time requirement: Detect anomaly within 100ms
Network: Gigabit Ethernet (low latency, <1ms)
Processing: Edge gateway runs anomaly detection locally
Alert delivery: To maintenance dashboard when threshold exceeded
Step 1: Evaluate Protocol Options for Sensor → Gateway
Option A: TCP
Pros:
- Reliable delivery (no lost samples)
- In-order delivery (critical for time-series)
- Flow control (prevents gateway overload)
Cons:
- Head-of-line blocking: Lost packet at T=0 delays packets at T=1,2,3...
- With 1000 Hz sampling, HOL blocking causes 10-50ms spikes
- Retransmission timeout: 1 second minimum (RFC 6298) = far exceeds 100ms alert window
- Connection overhead: 200 machines × 500 bytes TCP state = 100 KB RAM
Latency analysis:
- Normal: <1ms (Ethernet)
- With 0.1% packet loss: 1 packet/sec lost per machine
- Retransmit timeout: 1 second minimum (RFC 6298 default RTO)
- Alert delay: 200ms >> 100ms requirement FAILED
Option B: UDP
Pros:
- No head-of-line blocking (lost packet doesn't delay future)
- Consistent latency: 1ms (no retransmission delays)
- No connection state: 0 bytes RAM overhead
- Packet loss: 0.1% means 1 sample/sec lost = acceptable for anomaly detection
Cons:
- Lost samples create data gaps
- Out-of-order delivery (Ethernet reordering rare but possible)
- No flow control (gateway must buffer bursts)
Latency analysis:
- Best case: <1ms
- Worst case: 1ms (no retransmit delays)
- With 0.1% loss: 999/1000 samples arrive on time
- Alert latency: 1-2ms << 100ms requirement PASSED
Decision for Sensor → Gateway: UDP
Reasoning: Real-time 100ms requirement makes TCP’s head-of-line blocking and retransmit timeouts unacceptable. 0.1% sample loss doesn’t prevent anomaly detection (999/1000 samples sufficient).
Step 2: Evaluate Protocol Options for Gateway → Dashboard (Alerts)
Alert characteristics:
Frequency: 5-10 alerts/hour per machine (normal operation)
Criticality: HIGH (must not lose alert)
Latency: <1 second acceptable (human response time)
Size: 200 bytes (machine ID, timestamp, severity, vibration metrics, predicted failure time)
Option A: UDP (for consistency with sensor path)
Pros:
- Low latency: ~1ms
- Consistent with sensor protocol
Cons:
- Alert loss risk: 0.1% loss means 1 in 1000 alerts lost
- For critical alerts, loss unacceptable
- No confirmation of delivery
Option B: TCP
Pros:
- Guaranteed delivery: No alert loss
- In-order delivery: Alerts arrive in sequence
- Connection persistence: Dashboard maintains single TCP connection
Cons:
- Connection setup: 3ms (one-time cost, then persistent)
- Head-of-line blocking: Not relevant (alerts are infrequent)
- Latency: <10ms (acceptable for human-scale response)
Cost analysis:
- Handshake: 3ms (once per dashboard connection)
- Per alert: 200 bytes + 40 bytes ACK = 240 bytes
- Latency: ~5ms average
Decision for Gateway → Dashboard: TCP
Reasoning: Alert loss is unacceptable (maintenance window missed = $10K+ machine damage). TCP’s 5ms latency is far below the 1-second requirement. Connection overhead amortizes across multiple alerts.
Step 3: Calculate Overall System Performance
Data Path: Sensor → Gateway (UDP) → Dashboard (TCP)
End-to-end latency for alert:
1. Sensor samples vibration: 0ms (continuous)
2. UDP transmission to gateway: 1ms
3. Gateway anomaly detection: 10ms (sliding window analysis)
4. TCP alert to dashboard: 5ms
Total: 16ms << 100ms requirement PASSED
Packet loss handling:
- Sensor UDP loss: 0.1% = 1 sample/sec lost per machine
- 999/1000 samples arrive, sufficient for anomaly detection
- Alert TCP loss: 0% (reliable delivery)
- Critical alerts always reach dashboard
Resource consumption:
- Gateway RAM: 0 bytes for UDP connections, 1 × 4KB for dashboard TCP = 4KB total
- vs full TCP: 200 × 4KB = 800KB saved
- Gateway CPU: UDP socket receive + anomaly detection
Step 4: Verify Under Load
Stress test results:
Normal operation:
- 200 machines × 12 KB/sec = 2.4 MB/sec inbound UDP
- 10 alerts/hour = 1 TCP message every 360 seconds
- Total bandwidth: 2.4 MB/sec (no issues on Gigabit)
Peak failure scenario (10 machines failing simultaneously):
- Same 2.4 MB/sec sensor data (continuous)
- 10 alerts in <1 second via TCP
- Alert burst: 10 × 240 bytes = 2.4 KB in 1 second
- No congestion, no dropped alerts
Alert delivery verification:
- 30-day test: 200 machines × 10 alerts/hour × 24 hours/day × 30 days = 1,440,000 alerts sent
- Alerts received: 1,440,000 (100% delivery via TCP)
- False negatives: 0 (UDP sample loss did not prevent detection)
Key Insight: Hybrid protocol strategy (UDP for high-frequency telemetry, TCP for critical alerts) achieves both real-time performance and reliability. Using TCP for 1000 Hz sensor data would violate the 100ms requirement due to retransmit delays, while using UDP for alerts would risk losing critical failure notifications.
Lesson Learned: Different data flows in the same system have different requirements. High-frequency telemetry prioritizes latency (UDP), while infrequent critical alerts prioritize reliability (TCP). Don’t force a single protocol choice across heterogeneous data streams.