985  Zigbee Routing and Self-Healing

AODV routing protocol, path discovery, and automatic mesh recovery

985.1 Learning Objectives

By the end of this chapter, you will be able to:

  • Explain how AODV (Ad-hoc On-Demand Distance Vector) routing works in Zigbee
  • Describe the route discovery process using RREQ and RREP messages
  • Understand how Zigbee mesh networks self-heal when devices fail
  • Calculate expected routing latency for multi-hop paths
  • Design networks with adequate redundancy for reliable self-healing

985.2 Introduction

Zigbee’s mesh networking capability relies on sophisticated routing protocols to deliver messages across multiple hops. The primary routing mechanism is AODV (Ad-hoc On-Demand Distance Vector), which discovers routes only when needed, saving memory and reducing overhead on resource-constrained devices.

Imagine you need to send a package across the country, but there’s no direct route. Instead, the package travels through multiple cities:

Your city → City A → City B → City C → Destination

Each city is like a Zigbee Router, and the package is your message. Routing protocols figure out: 1. Which cities (routers) to use 2. What to do if a city (router) is unavailable 3. How to find the best path

AODV is the “GPS” of Zigbee - it finds routes when you need them.

985.3 AODV Routing Protocol

AODV (Ad-hoc On-Demand Distance Vector) is a reactive routing protocol - it discovers routes only when traffic needs to flow, rather than maintaining routes to all destinations proactively.

985.3.1 Why On-Demand Routing?

Approach Memory Usage Network Traffic Route Freshness
Proactive High (all routes) High (periodic updates) Always fresh
Reactive (AODV) Low (active routes only) Low (on-demand) Fresh when used

For resource-constrained Zigbee devices with 8-32KB RAM, reactive routing is essential.

985.3.2 Route Discovery Process

When a device needs to send a message to a destination without a known route:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#E67E22', 'signalColor': '#16A085', 'actorLineColor': '#2C3E50', 'fontSize': '11px'}}}%%
sequenceDiagram
    participant S as Source<br/>(End Device)
    participant R1 as Router 1
    participant R2 as Router 2
    participant D as Destination<br/>(Coordinator)

    Note over S: No route to Destination

    rect rgb(230, 126, 34)
        Note over S,D: Phase 1: Route Request (RREQ) Flood
        S->>R1: RREQ: "Looking for Coordinator"
        S->>R2: RREQ (broadcast)
        R1->>R2: RREQ (forward)
        R1->>D: RREQ (forward)
        R2->>D: RREQ (forward)
    end

    rect rgb(22, 160, 133)
        Note over S,D: Phase 2: Route Reply (RREP) Unicast
        D->>R1: RREP: "Route found, 1 hop"
        R1->>S: RREP: "Route found, 2 hops"
    end

    Note over S: Route cached: S → R1 → D

    rect rgb(44, 62, 80)
        Note over S,D: Phase 3: Data Transmission
        S->>R1: DATA "Temperature: 23°C"
        R1->>D: DATA (forwarded)
        D->>R1: ACK
        R1->>S: ACK
    end

Figure 985.1: AODV route discovery showing RREQ broadcast, RREP unicast, and subsequent data transmission

985.3.3 RREQ (Route Request)

When a device needs a route, it broadcasts an RREQ:

RREQ Message Contents:
- Source Address: 0x0023 (the sensor)
- Destination Address: 0x0000 (the coordinator)
- Sequence Number: 12345 (prevents loops)
- Hop Count: 0 (incremented at each hop)
- TTL: 5 (maximum hops allowed)

RREQ Propagation: 1. Source broadcasts RREQ to all neighbors 2. Each Router that receives RREQ: - Checks if it’s the destination → If yes, send RREP - Checks if already seen this RREQ (by sequence number) → If yes, drop - Otherwise, increment hop count and rebroadcast 3. RREQ floods through network until destination reached

985.3.4 RREP (Route Reply)

When the destination (or a router with a fresh route) receives the RREQ:

RREP Message Contents:
- Source Address: 0x0000 (coordinator)
- Destination Address: 0x0023 (original requester)
- Hop Count: 2 (hops from destination)
- Route Lifetime: 60 seconds (how long to cache route)

RREP Propagation: 1. Destination sends RREP back along the path RREQ arrived 2. Each Router stores the reverse route (toward source) 3. RREP travels unicast (not broadcast) - efficient 4. Source receives RREP and caches the route

985.3.5 Route Table Entries

After route discovery, each device stores route information:

Router 1 Routing Table:
| Destination | Next Hop | Hop Count | Lifetime |
|-------------|----------|-----------|----------|
| 0x0000      | 0x0000   | 1         | 60s      |
| 0x0023      | 0x0023   | 1         | 60s      |

Sensor Routing Table:
| Destination | Next Hop | Hop Count | Lifetime |
|-------------|----------|-----------|----------|
| 0x0000      | 0x0001   | 2         | 60s      |

985.4 Route Maintenance

Routes don’t last forever. AODV includes mechanisms for maintaining valid routes:

985.4.1 Route Expiration

Route Lifecycle:
1. Route discovered → Lifetime timer starts (60s typical)
2. Data transmitted → Timer reset to full value
3. No activity → Timer counts down
4. Timer expires → Route marked invalid
5. Next transmission → New route discovery required

985.4.2 Route Error (RERR)

When a link fails (device offline, out of range), the detecting device sends RERR:

RERR Trigger Conditions:
- MAC-layer ACK not received after retries
- Neighbor timeout (no heartbeats)
- Explicit device leave notification

RERR Message:
- Unreachable Destination: 0x0001 (failed router)
- Affected Routes: List of destinations via failed router

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#E67E22', 'signalColor': '#E67E22', 'actorLineColor': '#2C3E50', 'fontSize': '11px'}}}%%
sequenceDiagram
    participant S as Sensor
    participant R1 as Router 1
    participant R2 as Router 2
    participant C as Coordinator

    Note over R1: Router 1 FAILS!

    S->>R1: DATA (attempt 1)
    Note over S: No ACK...
    S->>R1: DATA (attempt 2)
    Note over S: No ACK...
    S->>R1: DATA (attempt 3)
    Note over S: No ACK - Link failed!

    rect rgb(230, 126, 34)
        Note over S,C: Route Error & Recovery
        S->>S: Mark route via R1 invalid
        S->>R2: RREQ: "New route to Coordinator?"
        R2->>C: RREQ (forward)
        C->>R2: RREP: "1 hop via me"
        R2->>S: RREP: "2 hops"
    end

    Note over S: New route: S → R2 → C

    S->>R2: DATA (via new route)
    R2->>C: DATA (forwarded)

Figure 985.2: Route error detection and automatic recovery through alternate path

985.5 Self-Healing Mesh

Self-healing is one of Zigbee’s most valuable features for reliability-critical deployments.

985.5.1 Self-Healing Timeline

When a Router fails, the network recovers automatically:

T = 0: Router fails (power loss, damage, interference)

T = 0-300ms: Failure detection
- Devices sending to failed router don't receive ACKs
- 3 retries × 100ms timeout = 300ms

T = 300ms-3s: Route invalidation
- Devices mark routes via failed router as invalid
- RERR propagates to affected sources

T = 3-10s: Route rediscovery
- Affected devices broadcast new RREQs
- Parallel discovery for all affected routes

T = 10s: Network recovered
- All devices have new routes
- Traffic flows through alternate paths

985.5.2 Redundancy Design

For reliable self-healing, design networks with path redundancy:

Minimum Redundancy (N+1):

Every end device should have at least 2 routers in range
If Router A fails → Route through Router B

Recommended Redundancy (N+2 or N+3):

3-4 routers in range of each end device
Multiple alternate paths available
Faster recovery, better load balancing

985.5.3 Self-Healing Verification

Test your network’s self-healing capability:

Test Procedure:
1. Monitor message delivery rate (baseline)
2. Disable one router (power off)
3. Measure recovery time (messages start flowing again)
4. Verify new routes in device tables
5. Re-enable router, verify mesh rebalances

Expected Results:
- Recovery time: 5-15 seconds
- Message loss during recovery: 1-5 messages
- Full restoration of delivery rate

985.6 Latency Considerations

Multi-hop routing adds latency. Understanding this helps set appropriate expectations:

985.6.1 Per-Hop Latency

Component Time
CSMA/CA backoff 5-20ms
Transmission (127 bytes @ 250Kbps) 4ms
Processing at router 1-5ms
Total per hop 10-30ms

985.6.2 End-to-End Latency Examples

Path Length Typical Latency Worst Case
1 hop 10-30ms 50ms
2 hops 20-60ms 100ms
3 hops 30-90ms 150ms
5 hops 50-150ms 250ms

985.6.3 First Message Latency

The first message to a new destination incurs route discovery overhead:

Route Discovery Time:
- RREQ flood: 100-500ms (depends on network size)
- RREP return: 50-200ms
- Total: 150-700ms (first message)

Subsequent Messages:
- Use cached route: 10-30ms per hop

Design Implication: For latency-sensitive applications, keep hop counts low (3-4 hops maximum) and consider pre-establishing routes during network formation.

985.7 Alternative Routing Methods

While AODV is the primary routing protocol, Zigbee supports additional methods:

985.7.1 Tree Routing

Hierarchical routing based on network address structure:

Address-Based Routing:
- Coordinator: 0x0000
- Router A (child of Coordinator): 0x1000
- Router B (child of Router A): 0x1100
- End Device (child of Router B): 0x1110

Route to 0x1110:
0x0000 → 0x1000 → 0x1100 → 0x1110
(Follow address hierarchy)

Advantage: No route discovery needed - address implies route Disadvantage: Rigid structure, no alternate paths

985.7.2 Source Routing

Sender specifies the complete path in the packet:

Source Route Header:
- Path: [0x0001, 0x0002, 0x0003, 0x0000]
- Each router forwards to next in list

Advantage: Predictable path, no route lookups at routers Disadvantage: Larger packet headers, sender must know full path

985.7.3 Many-to-One Routing

Optimized for sensor networks where all traffic flows to a central collector:

Coordinator broadcasts "Route Record Request"
All routers respond with their path to coordinator
Coordinator builds complete network map
Routes from any device to coordinator pre-established

Advantage: Efficient for data collection scenarios Disadvantage: Only optimizes traffic toward coordinator

985.8 Routing Best Practices

985.8.1 Network Design

  1. Limit hop count: Design for maximum 5-7 hops
  2. Ensure redundancy: 2-3 routers in range of each device
  3. Place routers strategically: Hallways, central locations
  4. Avoid bottlenecks: Multiple paths between network regions

985.8.2 Monitoring

Track these metrics to identify routing issues:

Metric Healthy Warning Critical
Average hop count 2-3 4-5 6+
Route discovery rate < 1/min 1-5/min > 5/min
Route failures < 1/hour 1-5/hour > 5/hour
Message delivery > 99% 95-99% < 95%

985.8.3 Troubleshooting

Common routing problems and solutions:

Symptom Likely Cause Solution
High latency Too many hops Add routers to reduce hop count
Frequent route changes Marginal links Improve router placement
Devices dropping offline Insufficient redundancy Add backup routers
Recovery too slow Network too large Segment into multiple PANs

985.9 Summary

This chapter covered Zigbee routing and self-healing:

  • AODV Protocol: On-demand route discovery saves memory and bandwidth
  • Route Discovery: RREQ broadcasts find paths, RREP unicasts establish routes
  • Route Maintenance: Lifetime timers and RERR messages keep routes valid
  • Self-Healing: Automatic recovery in 5-15 seconds when paths fail
  • Latency: 10-30ms per hop, plus discovery overhead for first messages

Key design principles: - Plan for redundancy (2-3 routers per end device) - Limit maximum hop count to 5-7 - Test self-healing before deployment - Monitor routing metrics in production

985.10 What’s Next

In the next chapter, Zigbee Application Profiles, we explore how ZHA, ZLL, and Zigbee 3.0 profiles enable device interoperability across manufacturers.