985 Zigbee Routing and Self-Healing

AODV routing protocol, path discovery, and automatic mesh recovery

985.1 Learning Objectives

By the end of this chapter, you will be able to:

Explain how AODV (Ad-hoc On-Demand Distance Vector) routing works in Zigbee
Describe the route discovery process using RREQ and RREP messages
Understand how Zigbee mesh networks self-heal when devices fail
Calculate expected routing latency for multi-hop paths
Design networks with adequate redundancy for reliable self-healing

985.2 Introduction

Zigbee’s mesh networking capability relies on sophisticated routing protocols to deliver messages across multiple hops. The primary routing mechanism is AODV (Ad-hoc On-Demand Distance Vector), which discovers routes only when needed, saving memory and reducing overhead on resource-constrained devices.

For Beginners: What is Routing?

Imagine you need to send a package across the country, but there’s no direct route. Instead, the package travels through multiple cities:

Your city → City A → City B → City C → Destination

Each city is like a Zigbee Router, and the package is your message. Routing protocols figure out: 1. Which cities (routers) to use 2. What to do if a city (router) is unavailable 3. How to find the best path

AODV is the “GPS” of Zigbee - it finds routes when you need them.

985.3 AODV Routing Protocol

AODV (Ad-hoc On-Demand Distance Vector) is a reactive routing protocol - it discovers routes only when traffic needs to flow, rather than maintaining routes to all destinations proactively.

985.3.1 Why On-Demand Routing?

Approach	Memory Usage	Network Traffic	Route Freshness
Proactive	High (all routes)	High (periodic updates)	Always fresh
Reactive (AODV)	Low (active routes only)	Low (on-demand)	Fresh when used

For resource-constrained Zigbee devices with 8-32KB RAM, reactive routing is essential.

985.3.2 Route Discovery Process

When a device needs to send a message to a destination without a known route:

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#E67E22', 'signalColor': '#16A085', 'actorLineColor': '#2C3E50', 'fontSize': '11px'}}}%%
sequenceDiagram
    participant S as Source<br/>(End Device)
    participant R1 as Router 1
    participant R2 as Router 2
    participant D as Destination<br/>(Coordinator)

    Note over S: No route to Destination

    rect rgb(230, 126, 34)
        Note over S,D: Phase 1: Route Request (RREQ) Flood
        S->>R1: RREQ: "Looking for Coordinator"
        S->>R2: RREQ (broadcast)
        R1->>R2: RREQ (forward)
        R1->>D: RREQ (forward)
        R2->>D: RREQ (forward)
    end

    rect rgb(22, 160, 133)
        Note over S,D: Phase 2: Route Reply (RREP) Unicast
        D->>R1: RREP: "Route found, 1 hop"
        R1->>S: RREP: "Route found, 2 hops"
    end

    Note over S: Route cached: S → R1 → D

    rect rgb(44, 62, 80)
        Note over S,D: Phase 3: Data Transmission
        S->>R1: DATA "Temperature: 23°C"
        R1->>D: DATA (forwarded)
        D->>R1: ACK
        R1->>S: ACK
    end

Figure 985.1: AODV route discovery showing RREQ broadcast, RREP unicast, and subsequent data transmission

985.3.3 RREQ (Route Request)

When a device needs a route, it broadcasts an RREQ:

RREQ Message Contents:
- Source Address: 0x0023 (the sensor)
- Destination Address: 0x0000 (the coordinator)
- Sequence Number: 12345 (prevents loops)
- Hop Count: 0 (incremented at each hop)
- TTL: 5 (maximum hops allowed)

RREQ Propagation: 1. Source broadcasts RREQ to all neighbors 2. Each Router that receives RREQ: - Checks if it’s the destination → If yes, send RREP - Checks if already seen this RREQ (by sequence number) → If yes, drop - Otherwise, increment hop count and rebroadcast 3. RREQ floods through network until destination reached

985.3.4 RREP (Route Reply)

When the destination (or a router with a fresh route) receives the RREQ:

RREP Message Contents:
- Source Address: 0x0000 (coordinator)
- Destination Address: 0x0023 (original requester)
- Hop Count: 2 (hops from destination)
- Route Lifetime: 60 seconds (how long to cache route)

RREP Propagation: 1. Destination sends RREP back along the path RREQ arrived 2. Each Router stores the reverse route (toward source) 3. RREP travels unicast (not broadcast) - efficient 4. Source receives RREP and caches the route

985.3.5 Route Table Entries

After route discovery, each device stores route information:

Router 1 Routing Table:
| Destination | Next Hop | Hop Count | Lifetime |
|-------------|----------|-----------|----------|
| 0x0000      | 0x0000   | 1         | 60s      |
| 0x0023      | 0x0023   | 1         | 60s      |

Sensor Routing Table:
| Destination | Next Hop | Hop Count | Lifetime |
|-------------|----------|-----------|----------|
| 0x0000      | 0x0001   | 2         | 60s      |

985.4 Route Maintenance

Routes don’t last forever. AODV includes mechanisms for maintaining valid routes:

985.4.1 Route Expiration

Route Lifecycle:
1. Route discovered → Lifetime timer starts (60s typical)
2. Data transmitted → Timer reset to full value
3. No activity → Timer counts down
4. Timer expires → Route marked invalid
5. Next transmission → New route discovery required

985.4.2 Route Error (RERR)

When a link fails (device offline, out of range), the detecting device sends RERR:

RERR Trigger Conditions:
- MAC-layer ACK not received after retries
- Neighbor timeout (no heartbeats)
- Explicit device leave notification

RERR Message:
- Unreachable Destination: 0x0001 (failed router)
- Affected Routes: List of destinations via failed router

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#2C3E50', 'primaryTextColor': '#fff', 'primaryBorderColor': '#16A085', 'lineColor': '#E67E22', 'signalColor': '#E67E22', 'actorLineColor': '#2C3E50', 'fontSize': '11px'}}}%%
sequenceDiagram
    participant S as Sensor
    participant R1 as Router 1
    participant R2 as Router 2
    participant C as Coordinator

    Note over R1: Router 1 FAILS!

    S->>R1: DATA (attempt 1)
    Note over S: No ACK...
    S->>R1: DATA (attempt 2)
    Note over S: No ACK...
    S->>R1: DATA (attempt 3)
    Note over S: No ACK - Link failed!

    rect rgb(230, 126, 34)
        Note over S,C: Route Error & Recovery
        S->>S: Mark route via R1 invalid
        S->>R2: RREQ: "New route to Coordinator?"
        R2->>C: RREQ (forward)
        C->>R2: RREP: "1 hop via me"
        R2->>S: RREP: "2 hops"
    end

    Note over S: New route: S → R2 → C

    S->>R2: DATA (via new route)
    R2->>C: DATA (forwarded)

Figure 985.2: Route error detection and automatic recovery through alternate path

985.5 Self-Healing Mesh

Self-healing is one of Zigbee’s most valuable features for reliability-critical deployments.

985.5.1 Self-Healing Timeline

When a Router fails, the network recovers automatically:

T = 0: Router fails (power loss, damage, interference)

T = 0-300ms: Failure detection
- Devices sending to failed router don't receive ACKs
- 3 retries × 100ms timeout = 300ms

T = 300ms-3s: Route invalidation
- Devices mark routes via failed router as invalid
- RERR propagates to affected sources

T = 3-10s: Route rediscovery
- Affected devices broadcast new RREQs
- Parallel discovery for all affected routes

T = 10s: Network recovered
- All devices have new routes
- Traffic flows through alternate paths

985.5.2 Redundancy Design

For reliable self-healing, design networks with path redundancy:

Minimum Redundancy (N+1):

Every end device should have at least 2 routers in range
If Router A fails → Route through Router B

Recommended Redundancy (N+2 or N+3):

3-4 routers in range of each end device
Multiple alternate paths available
Faster recovery, better load balancing

985.5.3 Self-Healing Verification

Test your network’s self-healing capability:

Test Procedure:
1. Monitor message delivery rate (baseline)
2. Disable one router (power off)
3. Measure recovery time (messages start flowing again)
4. Verify new routes in device tables
5. Re-enable router, verify mesh rebalances

Expected Results:
- Recovery time: 5-15 seconds
- Message loss during recovery: 1-5 messages
- Full restoration of delivery rate

985.6 Latency Considerations

Multi-hop routing adds latency. Understanding this helps set appropriate expectations:

985.6.1 Per-Hop Latency

Component	Time
CSMA/CA backoff	5-20ms
Transmission (127 bytes @ 250Kbps)	4ms
Processing at router	1-5ms
Total per hop	10-30ms

985.6.2 End-to-End Latency Examples

Path Length	Typical Latency	Worst Case
1 hop	10-30ms	50ms
2 hops	20-60ms	100ms
3 hops	30-90ms	150ms
5 hops	50-150ms	250ms

985.6.3 First Message Latency

The first message to a new destination incurs route discovery overhead:

Route Discovery Time:
- RREQ flood: 100-500ms (depends on network size)
- RREP return: 50-200ms
- Total: 150-700ms (first message)

Subsequent Messages:
- Use cached route: 10-30ms per hop

Design Implication: For latency-sensitive applications, keep hop counts low (3-4 hops maximum) and consider pre-establishing routes during network formation.

985.7 Alternative Routing Methods

While AODV is the primary routing protocol, Zigbee supports additional methods:

985.7.1 Tree Routing

Hierarchical routing based on network address structure:

Address-Based Routing:
- Coordinator: 0x0000
- Router A (child of Coordinator): 0x1000
- Router B (child of Router A): 0x1100
- End Device (child of Router B): 0x1110

Route to 0x1110:
0x0000 → 0x1000 → 0x1100 → 0x1110
(Follow address hierarchy)

Advantage: No route discovery needed - address implies route Disadvantage: Rigid structure, no alternate paths

985.7.2 Source Routing

Sender specifies the complete path in the packet:

Source Route Header:
- Path: [0x0001, 0x0002, 0x0003, 0x0000]
- Each router forwards to next in list

Advantage: Predictable path, no route lookups at routers Disadvantage: Larger packet headers, sender must know full path

985.7.3 Many-to-One Routing

Optimized for sensor networks where all traffic flows to a central collector:

Coordinator broadcasts "Route Record Request"
All routers respond with their path to coordinator
Coordinator builds complete network map
Routes from any device to coordinator pre-established

Advantage: Efficient for data collection scenarios Disadvantage: Only optimizes traffic toward coordinator

985.8 Routing Best Practices

985.8.1 Network Design

Limit hop count: Design for maximum 5-7 hops
Ensure redundancy: 2-3 routers in range of each device
Place routers strategically: Hallways, central locations
Avoid bottlenecks: Multiple paths between network regions

985.8.2 Monitoring

Track these metrics to identify routing issues:

Metric	Healthy	Warning	Critical
Average hop count	2-3	4-5	6+
Route discovery rate	< 1/min	1-5/min	> 5/min
Route failures	< 1/hour	1-5/hour	> 5/hour
Message delivery	> 99%	95-99%	< 95%

985.8.3 Troubleshooting

Common routing problems and solutions:

Symptom	Likely Cause	Solution
High latency	Too many hops	Add routers to reduce hop count
Frequent route changes	Marginal links	Improve router placement
Devices dropping offline	Insufficient redundancy	Add backup routers
Recovery too slow	Network too large	Segment into multiple PANs

985.9 Summary

This chapter covered Zigbee routing and self-healing:

AODV Protocol: On-demand route discovery saves memory and bandwidth
Route Discovery: RREQ broadcasts find paths, RREP unicasts establish routes
Route Maintenance: Lifetime timers and RERR messages keep routes valid
Self-Healing: Automatic recovery in 5-15 seconds when paths fail
Latency: 10-30ms per hop, plus discovery overhead for first messages

Key design principles: - Plan for redundancy (2-3 routers per end device) - Limit maximum hop count to 5-7 - Test self-healing before deployment - Monitor routing metrics in production

985.10 What’s Next

In the next chapter, Zigbee Application Profiles, we explore how ZHA, ZLL, and Zigbee 3.0 profiles enable device interoperability across manufacturers.

Related Chapters

Zigbee Network Topologies - Star, tree, and mesh configurations
Zigbee Network Formation - Device joining process
Zigbee Security - Encrypted routing